Abstract:To address the challenges of strong dependence on large-scale labeled data, insufficient feature representation, and single-granularity segmentation results in retinal fundus image segmentation, an unsupervised multi-granularity segmentation method for retinal images is proposed. A novel fully convolutional encoder-decoder architecture was designed to effectively capture local details and global semantic features of images, achieving efficient reconstruction of multi-level representations. On this basis, a comprehensive loss function was constructed by integrating pixel-level patch contrastive loss, representation-level contrastive learning loss, and global reconstruction loss. This joint optimization constrained the model across multiple feature scales, enhancing the representation capability and aligning the feature space with the structural distribution of the segmentation task. Subsequently, a diffusion-condensation algorithm was applied within the representation space to aggregate multi-scale semantic information, improving boundary precision and structural coherence, and generating segmentation results with hierarchical and multi-granular characteristics. Experiments conducted on publicly available retinal fundus datasets demonstrated that the proposed method achieved a 3.7% improvement in Dice coefficient compared with state-of-the-art unsupervised segmentation approaches, showing superior performance in both detail fidelity and structural consistency. The results indicated that this method enabled accurate and multi-granularity segmentation of retinal fundus images.