D703 生活言談資料集

摘要:

我們通過那些由眾人在兩個不同時間寫的關於現實生活事件的故事,並比較兩次敘述的差異,來創建此資料集。

版本:0

說明

當回想人生經歷時,人們經常會忘記或混淆生活事件,這就需要資訊回憶的服務。因此我們構建了生活言談(NIR)資料集,該資料集對個故事,提供2個不同時間點(pre-retold and post-retold)的敘述差異,並請人標註。

 

NIR

Introduction

Hippocorpus is constructed for investigating the difference in the narrative flow between relating life experiences and telling imaginative stories. We construct NIR by pruning the imaginative stories in Hippocorpus and retaining those stories about real-life events written by crowdworkers at two different times as pre-retold stories and post-retold stories. We summarize the following five event types from the story pairs in the dataset: ConsistentInconsistentAdditionalForgotten, and Unforgotten.

Schematic diagram

Format

Each object of the JSON files is consisted of event_id(i.e., object key), pair_id, story_type, subject, predicate, object, time, event_type, and the support evidences of the event.

Example

{
  "59": {
    "pair_id": "3P4RDNWND6SXR9D7TBY1P0EI0KHJIR",
    "story_type": "post-retold",
    "explicitness": "explicit",
    "subject_token_ids": [
      24,
      25,
      26,
      27,
      28,
      29,
      30
    ],
    "predicate": null,
    "predicate_token_ids": [
      31,
      32,
      33
    ],
    "object_token_ids": [
      35
    ],
    "time_token_ids": [],
    "event_type": "additional",
    "supports": []
  },
  ...
}

Steps

1. Download the corpus–Hippocorpus

Since we construct our dataset–NIR by exteding the Hippocorpus, we need to download the hippocorpus first.

  1. Go to http://aka.ms/hippocorpus
  2. Login your Microsoft accouot.
  3. Download hippoCorpusV2.csv and save it to the parent directory(i.e. data/).

2. Download NIR dataset & Hippocorpus correction file

gdown https://drive.google.com/uc?id=13F_9A8Z1jL9Eg4IwtRospfec7HQnubOC -O ../NIR.json
gdown https://drive.google.com/uc?id=1kaViqs9FDzArV_e8F7i7TZfeoEKkpRnc -O ../errors.csv

3. Download spacy model

We use spacy tok tokenize the stories. Thus, we need to download the spacy model.

python -m spacy download en_core_web_sm

4. Tokenize the Stories in Hippocorpus

Since we only release the annotation that uses the tokenized result of the Hippocorpus, we provide the script for tokenization and preprocessing to ensure the result is the same as ours.

python tokenize_hippocorpus.py

5. Merge NIR and Hippocorpus

After parsing the Hipporcorpus, we also provide the script to merge the hippocorpus and the NIR for convenience.

python merge.py

主要檔案下載位置:

https://github.com/ntunlplab/SEEN

 

發表文獻:

You-En Lin, An-Zi Yen, Hen-Hsen Huang and Hsin-Hsi Chen, “SEEN: Structured Event Enhancement Network for Explainable Need Detection of Information Recall Assistance,” The 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP 2022), December 2022.

 

開發團隊

臺大自然語言處理實驗室

數據說明

Format

Each object of the JSON files is consisted of event_id(i.e., object key), pair_id, story_type, subject, predicate, object, time, event_type, and the support evidences of the event.

Example

{
  "59": {
    "pair_id": "3P4RDNWND6SXR9D7TBY1P0EI0KHJIR",
    "story_type": "post-retold",
    "explicitness": "explicit",
    "subject_token_ids": [
      24,
      25,
      26,
      27,
      28,
      29,
      30
    ],
    "predicate": null,
    "predicate_token_ids": [
      31,
      32,
      33
    ],
    "object_token_ids": [
      35
    ],
    "time_token_ids": [],
    "event_type": "additional",
    "supports": []
  },
  ...
}


欄位名稱 說明
pair_id 文章配對 ID
story_type 事件來源文章之類型
explicitness 事件是否是顯性
subject_token_ids 主詞詞序
predicate 謂詞
predicate_token_ids 謂詞詞序
object_token_ids 受詞次序
time_token_ids 時間詞序
event_type 事件類型
supports 關聯事件代號

Annotation

規範文件清單

請填入個人資料以進行下載或授權申請

送出
Copyright © 2020 人工智慧技術暨全幅健康照護聯合研究中心
- design by Morcept