Journal of computational biology : a journal of computational molecular cell biology2019Jan01Vol.26issue(1)

定義により、遺伝子オントロジー用語のセマンティックな類似性を測定するための単語と文の埋め込みツール

PMID：30383443DOI：10.1089/cmb.2018.0093

文献タイプ：

Journal Article
Research Support, N.I.H., Extramural
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

5大医学誌の要約と
著名医師による解説が無料で読めます

会員登録(医師のみ)してログイン
すると翻訳の精度が向上します

概要

Abstract

遺伝子オントロジー（GO）データベースには、遺伝子の生物学的機能を記述するGO用語が含まれています。GO用語を比較するための以前の方法は、GO用語がツリー構造に編成されるという事実に依存しています。このパラダイムの下で、ツリーの2つのGO用語の場所は類似性スコアを決定します。この記事では、代わりにGO用語の定義に焦点を当てることにより、この問題の2つの新しいソリューションを紹介します。自然言語処理（NLP）ドメインからニューラルネットワークベースの手法を適用します。最初の方法はGOツリーに依存しませんが、2番目の方法は間接的にGOツリーに依存します。最初のアプローチでは、2つのGO定義を2つの順序付けられていない単語セットとして扱うことで比較します。単語の類似性は、単語をn次元空間にマッピングする単語埋め込みモデルによって推定されます。2番目のアプローチでは、文の中の単語順序を説明します。Sente Encoderを使用して、GOの定義をベクトルに埋め込み、ある定義が別の定義を伴う可能性を推定します。2つの方法でメソッドを検証します。最初の実験では、ランダムに生成されたネットワークから真のタンパク質間タンパク質ネットワークを区別するモデルの能力をテストします。2番目の実験では、ヒト、マウス、フライのランダムに一致した遺伝子からのオーソログを特定するモデルをテストします。両方の実験で、NLPとGOベースの方法のハイブリッドが最適な分類精度を実現します。

The gene ontology (GO) database contains GO terms that describe biological functions of genes. Previous methods for comparing GO terms have relied on the fact that GO terms are organized into a tree structure. Under this paradigm, the locations of two GO terms in the tree dictate their similarity score. In this article, we introduce two new solutions for this problem by focusing instead on the definitions of the GO terms. We apply neural network-based techniques from the natural language processing (NLP) domain. The first method does not rely on the GO tree, whereas the second indirectly depends on the GO tree. In our first approach, we compare two GO definitions by treating them as two unordered sets of words. The word similarity is estimated by a word embedding model that maps words into an N-dimensional space. In our second approach, we account for the word-ordering within a sentence. We use a sentence encoder to embed GO definitions into vectors and estimate how likely one definition entails another. We validate our methods in two ways. In the first experiment, we test the model's ability to differentiate a true protein-protein network from a randomly generated network. In the second experiment, we test the model in identifying orthologs from randomly matched genes in human, mouse, and fly. In both experiments, a hybrid of NLP and GO tree-based method achieves the best classification accuracy.

医師のための臨床サポートサービス

ヒポクラ x マイナビのご紹介

無料会員登録していただくと、さらに便利で効率的な検索が可能になります。

Translated by Google