D703 生活言談資料集

摘要:

我們通過那些由眾人在兩個不同時間寫的關於現實生活事件的故事，並比較兩次敘述的差異，來創建此資料集。

版本：0

說明

當回想人生經歷時，人們經常會忘記或混淆生活事件，這就需要資訊回憶的服務。因此我們構建了生活言談(NIR)資料集，該資料集對個故事，提供2個不同時間點(pre-retold and post-retold)的敘述差異，並請人標註。

NIR

Introduction

Hippocorpus is constructed for investigating the difference in the narrative flow between relating life experiences and telling imaginative stories. We construct NIR by pruning the imaginative stories in Hippocorpus and retaining those stories about real-life events written by crowdworkers at two different times as pre-retold stories and post-retold stories. We summarize the following five event types from the story pairs in the dataset: Consistent, Inconsistent, Additional, Forgotten, and Unforgotten.

Schematic diagram

Format

Each object of the JSON files is consisted of event_id(i.e., object key), pair_id, story_type, subject, predicate, object, time, event_type, and the support evidences of the event.

Example

{
  "59": {
    "pair_id": "3P4RDNWND6SXR9D7TBY1P0EI0KHJIR",
    "story_type": "post-retold",
    "explicitness": "explicit",
    "subject_token_ids": [
      24,
      25,
      26,
      27,
      28,
      29,
      30
    ],
    "predicate": null,
    "predicate_token_ids": [
      31,
      32,
      33
    ],
    "object_token_ids": [
      35
    ],
    "time_token_ids": [],
    "event_type": "additional",
    "supports": []
  },
  ...
}

Steps

1. Download the corpus–Hippocorpus

Since we construct our dataset–NIR by exteding the Hippocorpus, we need to download the hippocorpus first.

Go to http://aka.ms/hippocorpus
Login your Microsoft accouot.
Download hippoCorpusV2.csv and save it to the parent directory(i.e. data/).

2. Download NIR dataset & Hippocorpus correction file

gdown https://drive.google.com/uc?id=13F_9A8Z1jL9Eg4IwtRospfec7HQnubOC -O ../NIR.json
gdown https://drive.google.com/uc?id=1kaViqs9FDzArV_e8F7i7TZfeoEKkpRnc -O ../errors.csv

3. Download spacy model

We use spacy tok tokenize the stories. Thus, we need to download the spacy model.

python -m spacy download en_core_web_sm

4. Tokenize the Stories in Hippocorpus

Since we only release the annotation that uses the tokenized result of the Hippocorpus, we provide the script for tokenization and preprocessing to ensure the result is the same as ours.

python tokenize_hippocorpus.py

5. Merge NIR and Hippocorpus

After parsing the Hipporcorpus, we also provide the script to merge the hippocorpus and the NIR for convenience.

python merge.py

主要檔案下載位置:

https://github.com/ntunlplab/SEEN

發表文獻:

You-En Lin, An-Zi Yen, Hen-Hsen Huang and Hsin-Hsi Chen, “SEEN: Structured Event Enhancement Network for Explainable Need Detection of Information Recall Assistance,” The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), December 2022.

開發團隊

臺大自然語言處理實驗室

數據說明

Format

Each object of the JSON files is consisted of event_id(i.e., object key), pair_id, story_type, subject, predicate, object, time, event_type, and the support evidences of the event.

Example

{
  "59": {
    "pair_id": "3P4RDNWND6SXR9D7TBY1P0EI0KHJIR",
    "story_type": "post-retold",
    "explicitness": "explicit",
    "subject_token_ids": [
      24,
      25,
      26,
      27,
      28,
      29,
      30
    ],
    "predicate": null,
    "predicate_token_ids": [
      31,
      32,
      33
    ],
    "object_token_ids": [
      35
    ],
    "time_token_ids": [],
    "event_type": "additional",
    "supports": []
  },
  ...
}

欄位名稱	說明
pair_id	文章配對 ID
story_type	事件來源文章之類型
explicitness	事件是否是顯性
subject_token_ids	主詞詞序
predicate	謂詞
predicate_token_ids	謂詞詞序
object_token_ids	受詞次序
time_token_ids	時間詞序
event_type	事件類型
supports	關聯事件代號

Annotation

規範文件清單

A-1 資料蒐集程序下載檔案
A-2 資料盤點表下載檔案
A-3 法律依據與規範表下載檔案
A-5 資料共享使用管理計畫下載檔案
A-6 去識別化處理程序下載檔案
B 資料集說明文件下載檔案

請填入個人資料以進行下載或授權申請

送出

分類

標籤

D703 生活言談資料集

摘要:

說明

NIR

Introduction

Schematic diagram

Format

Example

Steps

1. Download the corpus–Hippocorpus

2. Download NIR dataset & Hippocorpus correction file

3. Download spacy model

4. Tokenize the Stories in Hippocorpus

5. Merge NIR and Hippocorpus

主要檔案下載位置:

發表文獻:

開發團隊

數據說明

Format

Example

Annotation

規範文件清單

Follow us on Facebook