D702 健康言談資料集

摘要:

醫療諮詢平台醫師與民眾答詢

版本：0

說明

目的：

提供健康諮詢言談文字生成之學術研究

用途：

訓練語言模型

價值：

為目前學術領域中相對稀少的繁體中文資料集

標註方式：

無標註

介紹：

我們從一個名為「台灣e院」的網上健康咨詢網站中收集了86,399個問答對(QA pairs)。該網站允許用戶就自己的健康狀況提問，並由醫生回答他們的問題。與由一些眾包工作者撰寫的文本相比，來自多個用戶的文本擁有更大的語言多樣性，並且回答中的解釋更加自然。

Learning to Generate Explanation from e-Hospital Services for Medical Suggestion

If you need more detailed information, please refer to our paper “Learning to Generate Explanation from e-Hospital Services for Medical Suggestion” at COLING 2022, as well as our GitHub repository.

Dataset

We construct our dataset by crawling approximately 86k QA pairs from Taiwan e-Hospital website, where an answer is a freetext response written by the physician, containing the explanation and suggestion as Figure 1 shows. An ideal instance would be a triple of (q, s, e), where q is the question asked by the patient, and s and e are the suggestion and the explanation responded by the physician. However, the crawled raw text often carries greeting terms, salutations, and personal information, such as names of the patients and doctors, which are noise for our task and should be pruned. To gather the desired (q, s, e), we propose a rule-based keyword matching method to extract text snippets that belong to the suggestion and explanation, defined as follows.

• Suggestion: The suggested action regarding the patient’s concerns, such as whether to seek medical attention, the department for making an appointment with, or the follow-up examination to undergo.

• Explanation: The text describing why a physician gives the suggestion. Generally, it includes medical knowledge to address the patient’s concerns. The details of our method are shown as follows

Schematic diagram

Dataset Format

You can download our dataset from here.

The folder structure for the dataset is shown below, we saperate suggestion (recmd/) and explanation (expln/) in the testset of R2 and R3 for convinience to perform evaluation.

data
├── train
│   ├── R1_train.csv
│   ├── R2_train.csv
│   └── R3_train.csv
│
└── test
    ├── R1_test.csv
    ├── R2_test.csv
    ├── R3_test.csv
    │
    ├── recmd
    │   ├── R2_test_recmd.csv
    │   └── R3_test_recmd.csv
    │
    └── expln
        ├── R2_test_expln.csv
        └── R2_test_expln.csv

Note

The dataset is originally collected from here.

主要檔案下載位置:

https://github.com/ntunlplab/tw-eH

開發團隊

臺大自然語言處理實驗室

數據說明

數據格式: 文字，CSV檔案。

Annotation

規範文件清單

A-1 資料蒐集程序下載檔案
A-2 資料盤點表下載檔案
A-3 法律依據與規範表下載檔案
A-5 資料共享使用管理計畫下載檔案
A-6 去識別化處理程序下載檔案
B 資料集說明文件下載檔案

請填入個人資料以進行下載或授權申請

送出

分類

標籤