BMC medical informatics and decision making2020Jul09Vol.20issue(Suppl 3)

中国の医療質問の意図分類のためのベンチマークデータセットとケーススタディ

PMID：32646426DOI：10.1186/s12911-020-1122-3

文献タイプ：

Journal Article
Research Support, Non-U.S. Gov't

5大医学誌の要約と
著名医師による解説が無料で読めます

会員登録(医師のみ)してログイン
すると翻訳の精度が向上します

概要

Abstract

背景：満足のいく回答を提供するために、医療QAシステムはユーザーの質問の意図を正確に理解する必要があります。医学的意図の分類には、監視された方法で深い学習アプローチをトレーニングするために高品質のデータセットが必要です。現在、中国の医療意図分類のためのパブリックデータセットはなく、他のフィールドのデータセットは医療QAシステムには適用されません。この問題を解決するために、医療QA Webサイトからの質問を使用して、中国の医療意図データセット（CMID）を構築します。これに基づいて、ケーススタディを使用してCMIDの4つの意図分類モデルを比較します。方法：CMIDの質問は、いくつかの医療QA Webサイトから取得されます。Intent Annotation Standardは、4つのタイプと36のサブタイプのユーザーの意図を含む医療専門家によって開発されています。意図ラベルに加えて、CMIDは、単語セグメンテーションと名前付きエンティティを含む2種類の追加情報も提供します。クラウドソーシングの方法を使用して、中国の医療質問ごとに意図情報に注釈を付けます。Wordセグメンテーションと名前付きエンティティは、Jiebaとよく訓練されたLattice-LSTMモデルを使用して取得されます。より正確な結果を得るために、単語セグメンテーションのために530,000で構成される中国の医療辞書をロードしました。また、4つの一般的なディープラーニングベースのモデルを選択し、CMIDでの意図分類のパフォーマンスを比較します。結果：最終的なCMIDには12,000の中国の医学的質問が含まれており、JSON形式で編成されています。各質問には、意図、単語セグメンテーション、および名前付きエンティティ情報とラベル付けされています。質問の長さ、エンティティの数に関する情報、および詳細な分析もあります。高速テキストの中で、textcnn、textrnn、およびtextgcn、高速テキスト、およびtextcnnモデルは、それぞれ4つのタイプと36のサブタイプ意図分類で最高の結果を達成しました。結論：この作業では、医療QAおよび関連分野で使用できる中国の医療意図分類のデータセットを提供します。CMIDで意図分類タスクを実行しました。さらに、データセットの内容についても分析を行いました。

BACKGROUND: To provide satisfying answers, medical QA system has to understand the intentions of the users' questions precisely. For medical intent classification, it requires high-quality datasets to train a deep-learning approach in a supervised way. Currently, there is no public dataset for Chinese medical intent classification, and the datasets of other fields are not applicable to the medical QA system. To solve this problem, we construct a Chinese medical intent dataset (CMID) using the questions from medical QA websites. On this basis, we compare four intent classification models on CMID using a case study. METHODS: The questions in CMID are obtained from several medical QA websites. The intent annotation standard is developed by the medical experts, which includes four types and 36 subtypes of users' intents. Besides the intent label, CMID also provides two types of additional information, including word segmentation and named entity. We use the crowdsourcing way to annotate the intent information for each Chinese medical question. Word segmentation and named entities are obtained using the Jieba and a well-trained Lattice-LSTM model. We loaded a Chinese medical dictionary consisting of 530,000 for word segmentation to obtain a more accurate result. We also select four popular deep learning-based models and compare their performances of intent classification on CMID. RESULTS: The final CMID contains 12,000 Chinese medical questions and is organized in JSON format. Each question is labeled the intention, word segmentation, and named entity information. The information about question length, number of entities, and are also detailed analyzed. Among Fast Text, TextCNN, TextRNN, and TextGCN, Fast Text and TextCNN models have achieved the best results in four types and 36 subtypes intent classification, respectively. CONCLUSIONS: In this work, we provide a dataset for Chinese medical intent classification, which can be used in medical QA and related fields. We performed an intent classification task on the CMID. In addition, we also did some analysis on the content of the dataset.

医師のための臨床サポートサービス

ヒポクラ x マイナビのご紹介

無料会員登録していただくと、さらに便利で効率的な検索が可能になります。

Translated by Google