IEEE/ACM transactions on computational biology and bioinformatics20140101Vol.11issue(4)

Life Scienceにおける大規模データのK-meansクラスタリングを加速するためのシンプルだが強力なヒューリスティックな方法

PMID：26356339DOI：10.1109/TCBB.2014.2306200

文献タイプ：

Journal Article
Research Support, Non-U.S. Gov't

5大医学誌の要約と
著名医師による解説が無料で読めます

会員登録(医師のみ)してログイン
すると翻訳の精度が向上します

概要

Abstract

K-Means Clusteringは、大規模なライフサイエンスデータから生物学的システムに関する洞察を得るために広く使用されています。生物学的データセット間の類似性を定量化するために、ピアソン相関距離と標準化されたユークリッド距離が最も頻繁に使用されます。ただし、最適化方法はほとんど未開拓です。これらの2つの距離測定値は、同じK-meansクラスタリング結果をK-Meansクラスタリング結果を同一のKの初期重心にもたらすという意味で同等です。したがって、一方に使用される効率的なアルゴリズムは、他方に適用できます。いくつかの最適化方法はユークリッド距離で利用でき、標準化されたユークリッド距離の処理に使用できます。ただし、このコンテキストのためにカスタマイズされていません。代わりに、ピアソン相関距離の特性を研究することで問題にアプローチし、最終的なソリューションを保持しながら不必要な計算を著しく剪定するためのシンプルだが強力なヒューリスティックな方法を発明しました。寸法の50〜60Kベクトルを使用した実際の生物学的データセットを使用したテスト10-2001（サイズは〜400 MB）は、K = 10-500の計算時間の著しい縮小を示しています。エルカンとハマリーのアルゴリズム。BoostKCPソフトウェアは、http：//mlab.cb.k.u-tokyo.ac.ac.jp/~ichikawa/boostkcp/で入手できます。

K-means clustering has been widely used to gain insight into biological systems from large-scale life science data. To quantify the similarities among biological data sets, Pearson correlation distance and standardized Euclidean distance are used most frequently; however, optimization methods have been largely unexplored. These two distance measurements are equivalent in the sense that they yield the same k-means clustering result for identical sets of k initial centroids. Thus, an efficient algorithm used for one is applicable to the other. Several optimization methods are available for the Euclidean distance and can be used for processing the standardized Euclidean distance; however, they are not customized for this context. We instead approached the problem by studying the properties of the Pearson correlation distance, and we invented a simple but powerful heuristic method for markedly pruning unnecessary computation while retaining the final solution. Tests using real biological data sets with 50-60K vectors of dimensions 10-2001 (~400 MB in size) demonstrated marked reduction in computation time for k = 10-500 in comparison with other state-of-the-art pruning methods such as Elkan's and Hamerly's algorithms. The BoostKCP software is available at http://mlab.cb.k.u-tokyo.ac.jp/~ichikawa/boostKCP/.

医師のための臨床サポートサービス

ヒポクラ x マイナビのご紹介

無料会員登録していただくと、さらに便利で効率的な検索が可能になります。

Translated by Google