Abstract:Image coloring is an inherently uncertain and multimodal problem. By inputting a grayscale image into a coloring network, visually plausible colored photos can be generated. Conventional methods primarily rely on semantic information for image colorization. These methods still suffer from color contamination and semantic confusion. This is largely due to the limited capacity of convolutional neural networks to learn deep semantic information inherent in images effectively. In this paper, we propose a network structure that addresses these limitations by leveraging multi-level semantic information classification and fusion. Additionally, we introduce a global semantic fusion network to combat the issues of color contamination. The proposed coloring encoder accurately extracts object-level semantic information from images. To further enhance visual plausibility, we employ a self-supervised adversarial training method. We train the network structure on various datasets with varying amounts of data and evaluate its performance using the ImageNet validation set and COCO validation set. Experimental results demonstrate that our proposed algorithm can generate more realistic images compared to previous approaches, showcasing its high generalization ability.