PloS one20170101Vol.12issue(5)

潜在セマンティック分析の高速オンラインクエリ処理のためのインデックスベースのアルゴリズム

PMID：28520747DOI：10.1371/journal.pone.0177523

文献タイプ：

Journal Article

5大医学誌の要約と
著名医師による解説が無料で読めます

会員登録(医師のみ)してログイン
すると翻訳の精度が向上します

概要

Abstract

潜在セマンティック分析（LSA）は、セマンティックがキーワードのクエリに似ているドキュメントを見つけるために広く使用されています。LSAは有望な同様の結果をもたらしますが、既存のLSAアルゴリズムには、オンラインクエリ処理中に類似性計算と候補チェックに不必要な操作が多く含まれています。このホワイトペーパーでは、LSAのオンラインクエリ処理の効率性の問題を、特定のクエリに同様のドキュメントを効率的に検索することを検討します。LSAの類似度方程式と、Partial Indexと呼ばれる設計されたインデックスに保存されている部分類似性と呼ばれる中間値と組み合わされます。検索空間を縮小するために、類似性方程式のおおよその形式を提供し、特定のしきい値θよりも低い部分的な類似性をスキップする部分インデックスを構築するための効率的なアルゴリズムを開発します。部分インデックスに基づいて、高速オンラインクエリ処理をサポートするためにILSAと呼ばれる効率的なアルゴリズムを開発します。指定されたクエリは擬似ドキュメントベクトルに変換され、クエリと候補のドキュメントの類似性は、インデックスノードから得られた部分的な類似性を蓄積することにより計算されます。LSAアルゴリズムと比較して、ILSAは、有望ではない候補文書を剪定し、類似性スコアにほとんど貢献しない操作をスキップする候補文書を剪定することにより、オンラインクエリ処理の時間コストを削減します。LSAとの比較による広範な実験が行われており、提案されたアルゴリズムの効率と有効性を示しています。

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.

医師のための臨床サポートサービス

ヒポクラ x マイナビのご紹介

無料会員登録していただくと、さらに便利で効率的な検索が可能になります。

Translated by Google