Abstract: |
This study introduces FoundationMorph, a 3D vision-language foundation model for unsupervised deformable 3D medical image registration. Traditional deep learning approaches struggle with generalizability and require training individual models for different image modalities and tasks, which may reduce efficiency and compromise clinical processes due to the limited data for training and the resulted performance inconsistency. FoundationMorph addresses these issues by performing multiple registration tasks with a single model. It integrates a language module for clinical text-based information using a pre-trained language model and a vision module that unifies 2D and 3D image encoders. The 2D encoder, trained on a large-scale mixed medical imaging dataset, including MRI, CT, and PET, works with a 3D network to learn multiple 3D registration tasks. A multi-dimensional attention module integrates language, 2D, and 3D features for accurate 3D registrations. The model was evaluated on the IXI dataset for brain MRI inter-patient registration and the DIRLAB dataset for lung 4DCT intra-patient registration. FoundationMorph outperformed competitive methods, achieving minimal target registration error and demonstrating superior effectiveness in handling multiple registration tasks, highlighting its potential to facilitate single- and multi-task registration for clinical practices. © 2025 SPIE |