ユニバーサル画像セグメンテーションのクエリ定式化の強化

PMID：38544142DOI：10.3390/s24061879

文献タイプ：

Journal Article

5大医学誌の要約と
著名医師による解説が無料で読めます

会員登録(医師のみ)してログイン
すると翻訳の精度が向上します

概要

Abstract

画像セグメンテーションの最近の進歩は、特に視覚変圧器によって促進されています。これらの変圧器ベースのモデルは、さまざまなセグメンテーションタスクを処理できる汎用性の高いネットワーク構造を1つ提供します。その効果にもかかわらず、強化された機能の追求は、しばしばより複雑なアーキテクチャとより大きな計算需要につながります。OneFormerは、トレーニング中にのみアクティブなクエリテキスト対照学習戦略を導入することにより、これらの課題に対応しました。ただし、このアプローチは、テキスト生成の非効率性の問題と対照的な損失計算に完全に対処していません。これらの問題を解決するために、効率的なクエリオプティマイザー（EQO）を導入します。これは、マルチモーダルデータを効率的に利用して画像セグメンテーションのクエリ最適化を改良するアプローチです。当社の戦略は、クラス間およびタスク間情報を画像から単一のテンプレート文に蒸留することにより、パラメーターと計算の複雑さを大幅に削減します。さらに、新しい注意ベースのコントラスト損失を提案します。これは、損失計算で1対多マッチングメカニズムを促進するように設計されており、オブジェクトクエリがより堅牢な表現を学習するのに役立ちます。単に複雑さを軽減するだけでなく、私たちのモデルは、Swin-Tバックボーンを使用して3つのセグメンテーションタスクすべてにわたってOneFormerと比較して優れた性能を示しています。ADE20Kデータセットに関する評価は、モデルが複数のメトリックでOneFormerを上回ることを明らかにしています。結合（MIOU）の平均交差点（MIOU）で0.2％、平均精度（AP）で0.6％、パノプティック品質（PQ）で0.8％です。これらの結果は、画像セグメンテーションの分野を進める際のモデルの有効性を強調しています。

Recent advancements in image segmentation have been notably driven by Vision Transformers. These transformer-based models offer one versatile network structure capable of handling a variety of segmentation tasks. Despite their effectiveness, the pursuit of enhanced capabilities often leads to more intricate architectures and greater computational demands. OneFormer has responded to these challenges by introducing a query-text contrastive learning strategy active during training only. However, this approach has not completely addressed the inefficiency issues in text generation and the contrastive loss computation. To solve these problems, we introduce Efficient Query Optimizer (EQO), an approach that efficiently utilizes multi-modal data to refine query optimization in image segmentation. Our strategy significantly reduces the complexity of parameters and computations by distilling inter-class and inter-task information from an image into a single template sentence. Furthermore, we propose a novel attention-based contrastive loss. It is designed to facilitate a one-to-many matching mechanism in the loss computation, which helps object queries learn more robust representations. Beyond merely reducing complexity, our model demonstrates superior performance compared to OneFormer across all three segmentation tasks using the Swin-T backbone. Our evaluations on the ADE20K dataset reveal that our model outperforms OneFormer in multiple metrics: by 0.2% in mean Intersection over Union (mIoU), 0.6% in Average Precision (AP), and 0.8% in Panoptic Quality (PQ). These results highlight the efficacy of our model in advancing the field of image segmentation.

医師のための臨床サポートサービス

ヒポクラ x マイナビのご紹介

無料会員登録していただくと、さらに便利で効率的な検索が可能になります。

Translated by Google