Bioinformatics (Oxford, England)2017Sep01Vol.33issue(17)

EMMAW：外部メモリに最小限の欠席した単語を計算します

PMID：28407038DOI：10.1093/bioinformatics/btx209

文献タイプ：

Journal Article

5大医学誌の要約と
著名医師による解説が無料で読めます

会員登録(医師のみ)してログイン
すると翻訳の精度が向上します

概要

Abstract

動機：最小限の不在の単語の生物学的意義は、人生のすべての領域からの生物のゲノムで調査されています。たとえば、エボラウイルスゲノムでは、ヒトゲノムの3つの最小限の不在の単語が見つかりました。接尾辞アレイに基づいた固定サイズのアルファベット上に、長さnのシーケンスのすべての最小限の単語を計算するためのO（n）-timeおよびo（n）スペースアルゴリズムが存在します。このアルゴリズムの標準的な実装では、長さnの大きなシーケンスに適用されると、20 NバイトのRAMが必要です。このようなメモリ要件は、大きなデータセットに最小限の欠席した単語の計算に対する重要なハードルです。結果：最小限の不在の単語を計算するための最初の外部メモリアルゴリズムであるEMMAWを提示します。アルゴリズムの無料オープンソース実装が利用可能になります。これにより、以前に可能だったよりもはるかに大きなデータセットで最小限の不在の単語を計算できます。私たちの実装では、標準のワークステーションで3時間未満で、1 GBのRAMが利用可能になった場合、完全なヒトゲノムを処理します。外部メモリを使用しているにもかかわらず、実装は高速であることを強調しています。確かに、必要なすべてのデータ構造を保持するのに十分なRAMが利用可能である場合、比較的小さいデータセットであっても、最先端の内部メモリの実装よりも2倍遅いです。可用性と実装：https：//github.com/solonas13/maw（GNU GPLの条件に基づくフリーソフトウェア）。連絡先：alice.heliou@lix.polytechnique.frまたはsolon.pissis@kcl.ac.uk。補足情報：補足データは、バイオインフォマティクスオンラインで入手できます。

MOTIVATION: The biological significance of minimal absent words has been investigated in genomes of organisms from all domains of life. For instance, three minimal absent words of the human genome were found in Ebola virus genomes. There exists an O(n) -time and O(n) -space algorithm for computing all minimal absent words of a sequence of length n on a fixed-sized alphabet based on suffix arrays. A standard implementation of this algorithm, when applied to a large sequence of length n , requires more than 20 n bytes of RAM. Such memory requirements are a significant hurdle to the computation of minimal absent words in large datasets. RESULTS: We present emMAW, the first external-memory algorithm for computing minimal absent words. A free open-source implementation of our algorithm is made available. This allows for computation of minimal absent words on far bigger data sets than was previously possible. Our implementation requires less than 3 h on a standard workstation to process the full human genome when as little as 1 GB of RAM is made available. We stress that our implementation, despite making use of external memory, is fast; indeed, even on relatively smaller datasets when enough RAM is available to hold all necessary data structures, it is less than two times slower than state-of-the-art internal-memory implementations. AVAILABILITY AND IMPLEMENTATION: https://github.com/solonas13/maw (free software under the terms of the GNU GPL). CONTACT: alice.heliou@lix.polytechnique.fr or solon.pissis@kcl.ac.uk. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

医師のための臨床サポートサービス

ヒポクラ x マイナビのご紹介

無料会員登録していただくと、さらに便利で効率的な検索が可能になります。

Translated by Google