CCC: Color Classified Colorization

Mrityunjoy Gain, Avi Deb Raha, and Rameswar Debnath*, Member, IEEE Computer Science and Engineering Discipline, Khulna University, Khulna 9208, Bangladesh
E-mail: {gain1624, dev1611, rdebnath}@cseku.ac.bd

Abstract

Automatic colorization of gray images with objects of different colors and sizes is challenging due to inter- and intra-object color variation and the small area of the main objects due to extensive backgrounds. The learning process often favors dominant features, resulting in a biased model. Like the class imbalance problem, a weighted function imposing a higher weight on minority features can solve this feature imbalance problem. In this paper, we formulate the colorization problem into a multinomial classification problem and then apply a weighted function to classes. We propose a set of formulas to transform color values into color classes and vice versa. Class optimization and balancing feature distribution are the keys for good performance. Class levels and feature distribution are fully data-driven. Observing class appearance on various extremely large-scale real-time images in practice, we propose 215 color classes for our colorization task. During training, we propose a class-weighted function based on true class appearance in each batch to ensure proper color saturation of individual objects. We establish a trade-off between major (mostly appearing) and minor classes (rarely appearing) to provide orthodox class prediction by eliminating major classes’ dominance over minor classes. As we apply regularization to enhance the stability of the minor class, occasional minor noise may appear at the object’s edges. We propose a novel object-selective color harmonization method empowered by the Segment Anything Model (SAM) to refine and enhance these edges. We propose a new color image evaluation metric, the Chromatic Number Ratio (CNR), to quantify the richness of color components. We compare our proposed model with state-of-the-art models using five different datasets: ADE, Celeba, COCO, Oxford 102 Flower, and ImageNet, in both qualitative and quantitative approaches. The experimental results show that our proposed model outstrips other models in visualization and CNR measurement criteria while maintaining satisfactory performance in regression (MSE, PSNR), similarity (SSIM, LPIPS, UIQI), and generative criteria (FID).

Index Terms:

Colorization, Minority Features, Feature Balancing, Chromatic Number

I Introduction

Refer to caption — Figure 1: Imbalance feature distribution makes the regression task biased

Human vision perceives thousands of colors, making object identification easier. Color images are a popular way to express creativity and reminisce. Colorizing images from antiquity, medicine, industry, and astronomy helps convey their meanings. Color-coded subjects continue to captivate the public with remastered versions of vintage black-and-white movies, colored books, and online automatic colorization bots.
Colorization is a process that assigns color components to grayscale images. It can be non-linear and ill-posed, allowing multiple colors in a single gray image. For example, a fruit’s color can be light green, yellow, or red. Natural colorization aims to predict credible color distribution, not just the intensity values of a gray image. This process is not limited to the ground truth image color values.
Researchers have used various methods for image coloring, including user-guided [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 31, 36, 51, 52] and learning-based methods[13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 28, 29, 30, 32, 33, 34, 35, 37]. Traditional user-guided methods require significant human interaction, leading to a decline in effectiveness. Learning-based strategies, which involve classical regression[21, 24, 28, 29, 50, 53], object segmentation[25, 32, 33, 35], generative approaches[13, 14, 15, 16, 17, 18, 19, 34, 37], and feature-balancing[20, 22, 23, 30] techniques, are now more popular. Deep learning approaches, particularly in regression, are also gaining popularity for image colorization. These methods are easier to implement and require less human labor. Deep Neural Networks (DNNs) learn representative features and hidden structural knowledge from data through training. The loss function generates feedback to refine the model’s parameters, and networks adjust weights proportionally to the error. However, imbalanced class distributions can cause learning models to incorrectly classify minority class observations, making predicted class probabilities unreliable[26, 27].
The colorization problem primarily involves feature distributions, with unbalanced distributions causing imbalances in the training process. Desaturated color components are more prevalent in training images, impacting the performance of saturated color components. This bias can cause smaller objects’ hues to merge with the background, making learning tasks biased. Handling feature imbalance is essential for achieving the desired learning outcomes.
Class imbalance is often resolved by resampling the dataset or using weighted functions to increase minority class weight. In training, features of a sample determine gradient directions on the loss function. The sample’s spatial resolution determines the input dimension of a learning model, while the output dimension of colorization models is the same. Defining rules to transform feature values into class values can solve feature imbalance problems.

The study proposes a method to transform continuous color values into discrete color classes and vice versa to predict a distribution of possible colors for each pixel. The study revealed that 215 of the 400 color classes are predominantly present. So, we reduced the class to 215 as redundancy reduced the classification accuracy. To address the class imbalance issue, we determine the class weights by analyzing the true class of each batch during training, assigning higher weights to rarely-appearing classes. This adjustment aims to alleviate desaturation and biases towards predominant features. We propose a SAM-empowered[57] object-selective color harmonization method to refine and polish the edge more.

Our proposed model formulated regression problem into classification problem based on the work CIC[30]. The CIC[30] defines classes that are static. But self-supervision majorly depends on data pattern and variety. So We formulate our problem in a data-driven manner. Moreover, we embedded SAM that improves results because it eradicates color bleeding and ensures object-selective color harmonization for model failure cases. We also describe with experiments the impact of classes with their appearance in the image ( $>$ 500). The basic works improvement of CCC over the work CIC[30] are illustrated in Tab. I:

TABLE I: Differences between the CIC[30] and our proposed CCC

Contents	CIC[30]	CCC
Formula for color class conversion and vice versa	No	Yes
Data driven class points optimization	No	Yes
Task generalization(Adaptibility on
similar feature imbalance problem)	No	Yes
Data driven class weight formulation	No	Yes
Segmentation based edge refinement	No	Yes
Chromatic diversity evaluation metric	No	Yes

The essence of our method is shown in Fig. I. The following are the contributions to this work:

1.

We propose a set of formulas to transform continuous double-channel color values into discrete single-channel color classes and vice versa. Any feature imbalance regression problem can be configured to a classification problem using these baseline formulas.
2.

We optimize class levels of the colorization problem by analyzing numerous different images.
3.

We propose a class re-weighting formula for graving high gradient from misclassified low appeared or rare classes to ensure a balance contribution of all classes in the loss. This removes feature biases as well as desaturation along with over-saturation from the color distribution and ensures orthodox prediction.
4.

We proposed a novel object-selective color harmonization method empowered by the Segment Anything Model (SAM) to make the edge more refined and polished.
5.

We propose a new color image evaluation metric, Chromatic Number Ratio (CNR), which quantifies the richness of color classes in generated images compared to ground truth images, providing a comprehensive measure of the color spectrum.
6.

We present an abundance of quantitative and qualitative results demonstrating that our method significantly outperforms extant state-of-the-art baselines and produces reasonable results.

The rest of the paper is structured as follows: Sec. II reviews the relevant literature; Sec. III, the entire CCC, including problem formulation and solution approach; Sec. IV, the SAM-empowered color harmonization; Sec. V the experimental outcomes and a comparative analysis with other cutting-edge techniques; and Sec. VI, the conclusion.

II Related Litarature

Image colorization mainly falls into two categories: user-guided and learning-based colorization.

II-A User Guided Colorization

The user-guided colorization mainly falls into two categories: scribble-based and example-based.
Scribble-based colorization The scribble-based colorization technique uses user input to fill in missing or incomplete sections of an image. Techniques include optimizing color propagation[2], combining non-iterative techniques with adaptive edge extraction[1], introducing color blending[3], propagating color effectively[4] in pattern-continuous and intensity-continuous regions[5], and incorporating U-Net structures[31].
Example-based colorization The example-based colorization minimizes user effort in grayscale image transmission, including global color statistics[6], segmented region-level approaches[7, 8, 36], super-pixel-level[51, 9], and pixel-level methods[10, 52]. However, manual similarity metrics can be prone to error in scenarios with significant variations in intensity and content[11, 12].

II-B Learning Based Colorization

Learning-based colorization is a machine-learning technique that automatically applies color to grayscale or black-and-white photographs using CNNs trained on large datasets. The main challenge is feature balancing for focused objects and backgrounds.
Basic Regression Based Colorization: Colorization involves using conventional CNN or specialized architectures like InceptionNet, VGGNet, ResNet, and DenseNet to estimate color channels from grayscale images. Gradient is calculated using regression loss function by automated methods, such as[28, 1, 29, 24], encoder-decoder based colorization models[53, 21], iColoriT[50] etc.
Object Segmentation Based Colorization Various colorization models that segment objects within an image, learn color assignment segment-wise or object-wise, and assign colors to segments using techniques like spatial connections or global color coherence. These models have been developed using various techniques, including semantic segmentation[25, 33, 32], adversarial edge-aware models[35], and point annotations.
GAN Based Colorization GAN image colorization models combine discriminator and generator networks to produce realistic, aesthetically pleasing colorized photographs. These models use semantic information[13], CapsNet[34], GAN encoders[14], and other techniques[17, 16] to improve colorization results. Examples include creating ethnic costumes, using GAN encoders for colorization, and using GAN for colorizing medical images[18, 37, 15, 19]. Techniques like transfer learning and deep convolution GAN have been developed for various applications.
Feature Balancing for Colorization Zhang et al.[30] proposed an automatic colorization using CNN, classifying intensity into predetermined color levels and assigning corresponding colors based on classified class levels. An et al.[23] used a VGG-16 CNN model and color rebalancing technique to solve feature imbalance problems. Larsson et al.[20] used unbalanced loss of classification, and Gain et al.[22] proposed a deep localized network for image colorization.

III Color Classified Colorization

III-A Color Space

Conventional RGB is the most commonly used color space, consisting of Red, Green, and Blue. However, its inability to distinguish between color and content information renders it inappropriate for color manipulation tasks involving colorization. CIE LAB[38] is a suitable choice, as it separates color information from context information, allowing for manipulation while keeping context information unchanged. In La*b* (LAB) space, L denotes the brightness or luminosity of the picture, with intensities falling between $[0,100]$ . As $L$ increases, colors become brighter. The a* and b* channels correspondingly represent the image’s proportion of red-green and yellow-blue tones, with red-yellow represented by a positive value and green-blue by a significant negative value, often falling between $[-128,127]$ .

III-B Problem Definition

The colorization problem is considered to predict color channels from a given gray channel. The Lightness (L) channel of La*b* color space can be mapped into the gray channel(intensity) and vice versa[38]. Furthermore, RGB can be mapped into LAB and vice versa. The task can be defined as follows in Eq. 1, 2, 3, and 4.

\small\mathcal{X}_{ab}=f(\mathcal{X}_{L})\vspace{-2mm}

(1)

\small Distance_{min}(\mathcal{Y}_{ab},\mathcal{X}_{ab})

(2)

\small\mathcal{X}_{Lab}=concat(\mathcal{X}_{L},\mathcal{X}_{ab})

(3)

\small\mathcal{X}_{L}\in\mathbb{R}^{H\times W\times 1},\mathcal{X}_{ab}\in% \mathbb{R}^{H\times W\times 2},\mathcal{Y}_{ab}\in\mathbb{R}^{H\times W\times 2}

(4)

where $\mathcal{X}_{L}$ is the lightness channel, $\mathcal{X}_{ab}$ is the predicted color channel, $\mathcal{X}_{Lab}$ is the predicted color image, $\mathcal{Y}_{ab}$ is the ground truth color channel, $f(.)$ is the mapping function achieved by deep learning, $Distance_{min}(.)$ is the objective function(can be any loss function) by which the optimizer makes the learning efficient, $\mathbb{R}$ is the total image component, $\mathbb{H}$ and $\mathbb{W}$ are the image dimension.

Theoretically, the values of the a* and b* channels are continuous within [-128, 127]. Therefore, the prediction is considered a regression problem. That’s why the $Distance_{min}(.)$ of Eq. 2 naturally can be either L1 loss or L2 loss or Huber loss or Log-cosh loss or similar regression loss shown in Eq. 5, 6, 7, 8.

\small L_{1}(\mathcal{Y}_{ab},\mathcal{X}_{ab})=\frac{1}{N}\sum_{N}|\mathcal{Y% }_{ab}-\mathcal{X}_{ab}|\vspace{-2mm}

(5)

\small L_{2}(\mathcal{Y}_{ab},\mathcal{X}_{ab})=\frac{1}{N}\sum_{N}(\mathcal{Y% }_{ab}-\mathcal{X}_{ab})^{2}

(6)

\small L_{\delta}=\left\{\begin{matrix}\frac{1}{N}\sum\limits_{N}\frac{1}{2}(% \mathcal{Y}_{ab}-\mathcal{X}_{ab})^{2},&|\mathcal{Y}_{ab}-\mathcal{X}_{ab}|<% \delta\\ \frac{1}{N}\sum\limits_{N}\delta((\mathcal{Y}_{ab}-\mathcal{X}_{ab})-\frac{1}{% 2}\delta),&otherwise\end{matrix}\right.

(7)

\small Log-Cosh(\mathcal{Y}_{ab},\mathcal{X}_{ab})=\frac{1}{N}\sum_{N}log(cosh% (\mathcal{Y}_{ab}-\mathcal{X}_{ab}))

(8)

Background colors like clouds, soil, pavement, and walls dominate real-time images, leading to an imbalanced distribution of features. Handling feature imbalance is crucial because the smaller subsets of features are the feature of interest for the learning task. The ambiguity and multimodality of the colorization problem make the above loss functions vulnerable. The mean of the set is the most effective method to solve the loss, as the averaging error effect favors color values predominantly covered in the ground truth image. In an imbalanced feature distribution, the training process is biased towards larger feature subsets, resulting in the colors of smaller objects disappearing from the resulting models. The distribution of a*b* values is skewed towards desaturated values, causing the color of minuscule objects to disappear.

III-C Solution Approach

Continuous Color Range to Discrete Color Classes The a* and b* color channels are continuous within the $[-128,127]$ range. Each a*b* pair with a lightness value $L$ forms an RGB color pixel. We can get an a*b* pair from a*b* color space, a 2-D space, where a* is one direction and b* is another. For a fixed $L$ , a small change in the a*b* pair has no psychovisual effect. Because human perception of the information in an image normally does not involve quantitative analysis of every pixel value in the image. Colorization is a regression problem where the regression model predicts the continuous quantities of a* and b* for a given $L$ . Taking advantage of the psycho-visual nature of humans, the colorization problem can be represented as a classification problem where the learning model predicts a discrete class level for an a*b* pair. To formulate the problem, the a*b* color space is divided into bins of a fixed grid size, and each bin is assigned a discrete class level. The formula is given below in Eq. 9.

\small\mathcal{C}=\Big{(}\frac{b^{*}_{i}+\beta}{\alpha}\Big{)}\cdot\Delta+% \frac{a^{*}_{i}+\beta}{\alpha},\forall i\in N

(9)

where a* and b* are the continuous color channels, $\mathcal{C}$ is the discrete color class, $\alpha^{2}$ is the area of a bin, $\beta$ is a shifting constant that shifts a*b* color values into the positive quadrant, $\Delta$ is the number of grids in each a* or b* color channel, $N$ is the total number of pixels.
Color Class to Visual Color Mapping We need to extract a*b* pairs from the predicted color classes, $\mathcal{C}$ s, generated by the learning model for color image generation. Each bin is assigned by a fixed color class level $\mathcal{C}$ driven by $a^{*}$ and $b^{*}$ . The formulas are given below in Eq. 10, 11, which is the reverse of Eq. 9.

\small a^{*\prime}=[(\mathcal{C}_{i}\mod\delta)\cdot\alpha]-\beta+\frac{\alpha% }{2},\forall i\in N

(10)

\small b^{*\prime}=[(\mathcal{C}_{i}\div\delta)\times\alpha]-\beta+\frac{% \alpha}{2},\forall i\in N

(11)

According to the above equations, the maximum loss for each a* or b* value is $\frac{\alpha}{2}-1$ . The higher value of $\alpha$ reduces the number of classes but makes the representation lossy as a large continuous range is converted to a single class. However, handling the problem with the lower class is easy. The lower value of $\alpha$ increases the number of classes. In the colorization problem, more classes make the prediction less precise. It is important to adjust the number of class levels for $a*b*$ color space so that modified $a^{*}b^{*}$ can describe the image’s color nature.
Color Class reduction Based on Practical Appearance

The a* and b* color values are continuous within $[-128,127]$ in the a*b* color channel. But in practice, the range is found within $[-100,99]$ . We first transform the continuous $[-100,99]$ ranged a*b* color channel to a single plane of 400 color classes by taking $\alpha=10$ , $\beta=100$ and $\Delta=20$ in Equation 9. A 2D grid of bins as $20\times 20$ single plane array is then formed where horizontal axes indicate a* and vertical axes indicate b* color information. Each coordinate is assigned a class value. The class matrix is shown in Fig. 4. Fig. 5 shows the proposed color class to visual color. Fig. 5 shows that the whole image is a smooth representation of different colors. The colors of the nearest bins or blocks are almost similar. Color changes gradually, block by block.
The study focuses on the colorization of images using 400 color classes from the Place365 Validation dataset. We extracted the classes of 35040 images. The images were downsampled to $56\times 56$ to reduce class samples, resulting in 109885440 class samples with 400 class levels. Each color pixel represents a color class sample with a specific color level. A class level is considered for training samples with a minimum of 500 (0.000455%) class samples. Class samples under 500 with a specific class level are mapped to their nearest-neighbor present class levels using fixed centroid $k$ -means clustering shown in Eq. 12. The final color bin contains 215 color classes within 400 color classes with more than 500 pixels Which is shown in Fig. 6 and their visual in Fig. 7. Class optimization is a major issue for the colorization model, as less class may make the model more error-free but may make some color visuals outside the bin. To keep rare color values in the predicted distribution, these visuals must be active in the training process.

\vspace{-2mm}\small kmeans(\mathcal{C},\mu)=argmin\sum_{i=1}^{k}\sum_{\mathcal% {C}}||c-\mu_{i}||^{2}

(12)

where $\mathcal{C}$ is the input color class vector, $\mu$ is the approved color classes for training, and $k$ is the number of color classes (215). We can define $\mu$ as the fixed value centroid. The iteration will happen a single time, and the centroid value will be unchanged.

Network Architecture We build our model based on an encoder-decoder architecture. We use DenseNet[54] for the encoder part of our feature extractor. The DenseNet is a high-level feature extractor suitable for good color value generation. For the decoder part, we use conventional CNN. The Network architecture of our proposed method is shown in Fig. 8.

•

Feature Extraction DenseNet’s robust connections minimize gradient vanishing and semantic information loss during feature extraction. It concatenates output from each layer, adapts to grayscale input by changing the first convolutional layer, and discards the final linear layer to build a $\frac{H}{32}\times\frac{W}{32}\times 1024$ feature representation.
•

Colorization Network The network employs several convolutional and up-sampling layers after receiving an input of a $\frac{H}{32}\times\frac{W}{32}\times 1024$ feature representation. The fundamental nearest-neighbor method is what we employ for up-sampling. The $56\times 56\times 215$ color class distribution is what the network outputs.

Loss Calculation Colorization is generally considered a regression problem as the color values are continuous. But we transform the continuous color values into the discrete color classes. So, we consider the problem a classification problem and use cross-entropy loss instead of MSE or other regression loss. The loss function is shown in Eq. 13.

\small Loss_{CE}=-\sum_{H,W}\mathcal{W}_{c}\sum_{\mathcal{C}}\mathcal{K}.log(% \overline{\mathcal{K}}).

(13)

Where $H$ and $W$ are the height and width of output $\mathcal{K}$ distribution. $\kappa$ is the true color class and $\overline{\mathcal{K}}$ is the estimated color class. The $\mathcal{W}_{c}$ is the weights vector of color classes. The $\mathcal{W}_{c}$ is defined as follows in Eq. 14.

\small\mathcal{W}_{c}=\Big{(}\frac{1}{n_{\mathcal{C}}}\Big{)},\forall_{c}\in% \mathcal{C}

(14)

where $n_{C}=215=$ the number of color classes.
Class Confusion Based Weights Adjustment In realistic images, not all color classes are represented equally. Grayish visual color classes are found in a much larger proportion than bright color classes due to the large background areas. In the categorical cross-entropy loss, each true class gets $\frac{1}{N}$ weight during loss calculation, which is shown in Eq. 14. As the minor color classes are far smaller in the count values, the gradients disappear gradually iteration by iteration. To keep the rarely appearing color classes, we increase the weights of the rarely appearing color classes more than the mostly appearing color classes. However, this process increases the global loss. Therefore, the weights must be trade-offs to ensure plausible colors and a minimum loss. To trade off the weights, we proposed a new formula, which is given in Eq. 15.

\small\mathcal{W}_{new}=\Bigg{(}\frac{\underset{c\in\mathcal{C}}{\max}(N_{c})}% {N_{c}\cdot\Upsilon+\underset{c\in\mathcal{C}}{\max}(N_{c})\cdot\Phi}\Bigg{)},% \forall c\in\mathcal{C}\vspace{-1mm}

(15)

where $\mathcal{C}$ is the color classes of a particular batch, ${\max}(N_{c})$ is the maximum appearance value of a class, $N_{c}$ is the appearance value of class $c$ , $\mathcal{W}_{new}$ is the new weights matrix of the particular batch, $\Upsilon$ and $\Phi$ is the trade-off factor where $\Upsilon$ can ranges (0,1] and $\Phi=\frac{1}{n_{C}}$ .
We initially normalize weights by dividing the count of the maximum appeared class in a batch by the total count of each 215 classes individually, ensuring the weight of the maximum class is set to 1 and proportionally up-scaling others. However, this approach leads to a significant increase in the weight of classes appearing very infrequently. To strike a balance, we introduce a trade-off mechanism. This involves adding a term, ( ${\max}(N_{c})\cdot\Phi$ ), to the product of the individual class count ( $N_{c}$ ) and a trade-off factor ( $\Upsilon$ ). This supplementary term helps control the influence of rare class occurrences, providing a more nuanced and balanced approach to class weight determination. Therefore, the loss function is now modified, as shown in Eq. 16.

\small Loss_{CBCE}=-\sum_{H,W}\mathcal{W}_{new}\sum_{C}\mathcal{K}.log(% \overline{\mathcal{K}}).\vspace{-1mm}

(16)

Color Class Estimation The network outputs $H\times W\times C\times batch$ tensor. Using a softmax probability distribution, we extract $H\times W\times 1\times batch$ class representation.

\small\overline{\mathcal{K}}=\sigma(\mathcal{K})=\frac{e^{\mathcal{K}_{i}}}{% \sum_{j=1}^{k}{e^{\mathcal{K}_{j}}}}\vspace{-1mm}

(17)

Algorithm 1 SAM Empowered Object-Selective Color Harmonization

0: Input gray image, Predicted

a^{*}b^{*}

0: Edge harmonized

a^{*}b^{*}

Extract segments (

\mathcal{S}

) from the gray image using SAM

for each segment

s

\mathcal{S}

if the number of pixels in

s>\Psi

then

Extract

\mathcal{S}^{a}

from

a^{*}

using coordinates of

s

Calculate mode value (

\mathcal{M}_{a}

) of

\mathcal{S}^{a}

for each pixel (

\mathcal{P}

) in

\mathcal{S}^{a}

|\mathcal{P}-\mathcal{M}_{a}|>\delta_{a}

then

Replace the value of

\mathcal{P}

with

\mathcal{M}_{a}

end if

end for

Extract

\mathcal{S}^{b}

from

b^{*}

using coordinates of

s

Calculate mode value (

\mathcal{M}_{b}

) of

\mathcal{S}^{b}

for each pixel (

\mathcal{P}

) in

\mathcal{S}^{b}

|\mathcal{P}-\mathcal{M}_{b}|>\delta_{b}

then

Replace the value of

\mathcal{P}

with

\mathcal{M}_{b}

end if

end for

end if

end for

return Edge harmonized

a^{*}b^{*}

III-D Chromatic Diversity

We propose a novel color image evaluation metric named Chromatic Number Ratio (CNR). The CNR quantifies the richness of color classes within the generated images compared to the ground truth images. It offers a comprehensive measure of the spectrum of colors in the generated images, enhancing our understanding of color diversity. The metric is shown in Eq. 18.

\small CNR=\frac{\sum\limits_{i=0}^{m-1}\sum\limits_{j=0}^{n-1}\left(1-\sum% \limits_{k=0}^{i-1}\sum\limits_{l=0}^{n-1}[\mathcal{P}_{i,j}=\mathcal{P}_{k,l}% ]\right)}{\sum\limits_{i=0}^{m-1}\sum\limits_{j=0}^{n-1}\left(1-\sum\limits_{k% =0}^{i-1}\sum\limits_{l=0}^{n-1}[\mathcal{G}_{i,j}=\mathcal{G}_{k,l}]\right)}% \vspace{-1mm}

(18)

where, $\mathcal{P}_{i,j}$ and $\mathcal{G}_{i,j}$ is the color class value at row $i$ and column $j$ of the generated color image $\mathcal{P}$ and Ground truth image $\mathcal{G}$ . $m$ and $n$ are the image’s dimensions in the color class space. The outer summation iterates through all rows ( $i$ ) and columns ( $j$ ) of the image in color class space. The inner summation compares each pixel ( $\mathcal{P}_{i,j}$ / $\mathcal{G}_{i,j}$ ) with all previous pixels in the image to check for uniqueness. $[\mathcal{P}_{i,j}=\mathcal{P}_{k,l}]$ and $[\mathcal{P}_{i,j}=\mathcal{P}_{k,l}]$ is an indicator function that returns 1 if the condition is true (if pixel values are equal) and 0 if it’s false.

Through the CNR, we have tried to show how many different color components are picked in the color images. We initially set color classes to 400. But in the training set, majorly appeared color classes dominate the minor classes. That’s why the model overlooks the minor class in the prediction. We aim to ensure the minor classes are also in the predicted distribution while maintaining the other measurement criteria satisfactory. The CNR value 1 indicates the number of different color classes is the same. The predicted distribution can also pick more minor color classes making the visual more plausible and the CNR value greater than 1.

Gray DeOldify[43] Iizuka[24] Larsson[20] CIC[30] Zhang[31] Su[25] Gain[22] DD[55] CCC Ground True

Figure 9: Some results of our proposed method compared to other state-of-the-art methods.

IV SAM Empowered Object-Selective Color Harmonization

As we force regularization of the minor class, there is sometimes a little noise at the object’s edge. To make the edge more polished, we proposed SAM-empowered object-selective color harmonization. The SAM is a segmentation model with zero-shot generalization to unfamiliar objects and images without additional training[57]. Our proposed algorithm is shown in Algorithm 1.

V Experiment

Datasets We train the proposed model using the Place365 Train dataset[40]. Our model is developed in a self-supervised manner. We provide no external label for our data during the train. Instead, we generate the model’s supervisory signals or labels from the input data during training. For testing, we use multiple datasets. We use Place365 Test dataset[40]. The dataset has 328.5k images with 365 scene categories. Besides, we take randomly 50 images from ImageNet1k Validation[47], Oxford 102 Flower[48], CelebFaces(CelebA)[49], and COCO[56] datasets.

TABLE II: Different Hyper-parameter and trade-off factor values for CCC

H. P.	$\alpha$	$\beta$	$\Delta$	$\Upsilon$	$\Phi$	$\delta_{a}$	$\delta_{b}$	$\Psi$
Value	10	100	20	0.5	.0046	8	8	500

Implementation Set Up The experiments were conducted on a workstation with an NVIDIA GEFORCE RTX 2080 Ti graphics processing unit (GPU). The network was constructed using PyTorch[41] version 1.28 in Python version 3.10.9. During training, the batch size is set to 64, the Adam optimizer is employed with the learning rate $1\times 10^{-3}$ , and the momentum parameters $\beta$ 1 = 0.5 and $\beta$ 2 = 0.999 are used to update and compute the network parameters. Each ground truth a*b* tensor was resized into $56\times 56$ size to reduce the complexity of loss calculations. We systematically explore a spectrum of hyper-parameters and trade-off factors for our proposed model, with their values determined through methodical experimental analysis. The corresponding values are detailed in Tab. II.

TABLE III: Quantitative comparison of our proposed method with the baseline and SOTA methods using the visuals of Fig. 9.

	MSE $\downarrow$	PSNR $\uparrow$	SSIM $\uparrow$	LPIPS $\downarrow$	UIQI $\uparrow$	FID $\downarrow$	CNR $\uparrow$
Deoldify	0.0064	21.71	0.898	0.183	0.872	0.395	0.615
Iizuka	0.0042	22.09	0.885	0.171	0.871	0.361	0.376
Larsson	0.0052	21.51	0.878	0.197	0.879	0.340	0.585
CIC	0.0061	20.82	0.864	0.194	0.864	0.319	0.619
Zhangs	0.0054	21.78	0.878	0.162	0.862	0.303	0.526
Su	0.0071	20.94	0.854	0.233	0.865	0.300	0.648
Gain	0.0059	21.01	0.871	0.210	0.871	0.315	0.691
DD	0.0051	21.68	0.878	0.165	0.869	0.276	0.798
Our	0.0055	21.46	0.878	0.207	0.872	0.288	0.884

Evaluation Metrics We use mean squared error (MSE[39]), peak signal-to-noise ratio (PSNR[42]), structural similarity index measure (SSIM[42]), learned perceptual image patch similarity (LPIPS[45]), universal image quality index (UIQI[44]), frechet inception distance score (FID[46]), and our proposed Chromatic Number Ratio (CNR) to compare our proposed model with the baselines and state-of-the-art (SOTA) colorization methods quantitatively.
Comparison with Baselines and SOTA: We compare our model with eight baselines and SOTA methods: DeOldify[43], Iizuka[24], Larsson[20], CIC[30], Zhang[31], Su[25], Gain[22], and DD[55]. In Fig. 9, we compare eight images visually against those methods with gray and ground truth. The figure shows that our proposed CCC method visually outperforms the baselines and SOTA methods. The proposed CCC method effectively colors the minor objects with the majors. In Tab. III, quantitatively evaluate the images of Fig. 9. From the visual and quantitative analysis, we find Deoldify has the best SSIM, Iizuka has the best MSE and PSNR, Larsson has the best UIQI, Zhang has the best LPIPS, DD has the best FID, and CCC has the best CNR. Visually, the CCC has a more plausible major and minor color combination than the others. Therefore, it is evident that MSE, PSNR, SSIM, LPIPS, and FID criteria are not completely suitable for ensuring the presence of minor colors. At first, ensuring the CNR and then maintaining those criteria may be the best possible solution for the appearance of major and minor colors in the generated images. In Tab. IV, we evaluate our proposed model against seven baselines and SOTA methods across three datasets using regression criteria. The table shows that our method performs well in all datasets and outperforms others in the ‘Oxford Flower’ dataset. Because ‘ADE’ predominantly features natural images, while ‘Celeba’ focuses on human faces, typically presenting a more limited range of color combinations. In contrast, the ‘Oxford Flowers’ dataset is characterized by its diverse array of flower species, each exhibiting a unique and varied color palette. This diversity in coloration within the ‘Oxford Flowers’ dataset provides a more complex and challenging environment for colorization, highlighting the efficacy of our method in handling a wide range of colors and complexities.

TABLE IV: Regression loss comparison of our proposed method with the baseline and SOTA methods using multiple datasets.

	ADE		Celeba		Ox Flower
	MSE $\downarrow$	PSNR $\uparrow$	MSE $\downarrow$	PSNR $\uparrow$	MSE $\downarrow$	PSNR $\uparrow$
DeOldify	.0043	25.66	.0045	26.06	.0295	16.46
Iizuka	.0035	26.22	.0045	26.00	.0211	18.01
Larsson	.0037	25.94	.0058	26.66	.0245	16.85
CIC	.0053	24.33	.0056	24.79	.0261	17.16
Zhang	.0036	26.07	.0041	26.78	.0295	16.80
Su	.0038	25.37	.0046	25.70	.0265	16.81
DD	.0039	25.22	.0066	25.70	.0273	16.88
CCC	.0058	24.03	.0061	24.12	.0201	18.06

In Tab. V, we evaluate our proposed model against those methods across three datasets using similarity measurement criteria. The table shows that our method performs well in all datasets while maintaining the minor color structure. Usually, it is easier to achieve good similarity by ignoring the minor color features and focusing only on major ones. However, our proposed method maintains satisfactory similarity while ensuring the minor color features.

TABLE V: Structural Similarity comparison of our proposed method with the baseline and SOTA methods using multiple datasets.

	ADE		Celeba		Ox Flower
	SSIM $\uparrow$	UIQI $\uparrow$	SSIM $\uparrow$	UIQI $\uparrow$	SSIM $\uparrow$	UIQI $\uparrow$
DeOldify	0.96	0.96	0.94	0.94	0.82	0.81
Iizuka	0.95	0.96	0.95	0.94	0.80	0.82
Larsson	0.95	0.96	0.94	0.93	0.82	0.83
CIC	0.95	0.95	0.93	0.92	0.81	0.80
Zhang	0.96	0.96	0.95	0.93	0.81	0.81
Su	0.92	0.96	0.93	0.93	0.77	0.81
DD	0.96	0.96	0.93	0.92	0.81	0.80
CCC	0.91	0.95	0.92	0.92	0.80	0.80

In Tab. VI, we evaluate our proposed model against those methods across three datasets using LPIPS and FID criteria. The table shows that our method performs well in all datasets and outperforms others in the ‘Oxford Flower’ dataset. Because ‘Oxford flowers’ have the highest diversity compared to ‘ADE’ and ‘Celeba.’

TABLE VI: Perceptual Image Patch Similarity and frethed image distance comparison of our proposed method with the baseline and SOTA methods using multiple datasets.

	ADE		Celeba		Ox. Flower
	LPIPS $\downarrow$	FID $\downarrow$	LPIPS $\downarrow$	FID $\downarrow$	LPIPS $\downarrow$	FID $\downarrow$
DeOldify	0.15	0.48	0.13	0.43	0.35	3.85
Iizuka	0.16	1.05	0.16	0.45	0.31	3.57
Larsson	0.16	0.62	0.14	0.37	0.34	2.42
CIC	0.18	1.31	0.17	0.58	0.35	4.20
Zhang	0.14	1.12	0.13	0.49	0.34	4.72
Su	0.21	1.24	0.18	0.28	0.41	4.51
DD	0.16	0.30	0.16	0.18	0.32	1.54
CCC	0.15	0.90	0.13	0.43	0.30	1.51

In Tab. VII, we evaluate our proposed model against those methods across five datasets using CNR criteria. The table shows that our method outperforms all methods in all datasets. The main objective of our proposed model is to ensure the presence of minor colors along with major. Minor color confirmation makes color images more diverse because an image contains one or two major colors as well as more minor colors.

TABLE VII: CNR comparison of our proposed method with the baseline and SOTA methods using multiple datasets.

	ADE	Celeba	COCO	Ox. Flower	ImageNet
DeOldify	0.77	0.62	1.43	0.69	0.61
Iizuka	0.78	0.51	1.49	0.58	0.49
Larsson	0.77	0.64	0.73	0.73	0.63
CIC	0.81	0.86	1.57	0.67	0.58
Zhang	0.73	0.66	1.05	0.66	0.49
Su	0.77	0.80	2.89	0.66	0.66
DD	1.25	1.07	2.23	0.88	0.94
CCC	1.90	1.33	3.53	0.96	1.13

VI Conclusion

Automatic colorization of grayscale photographs with objects of varying colors and sizes is complex due to inter- and intra-object color variations and the limited area occupied by principal items. The learning process often favors dominating features, leading to biased models. A weighted function can address feature imbalance, assigning greater importance to minority features. In this paper, we propose a set of formulas to convert color values into corresponding color classes and vice versa. To achieve optimal performance, we optimize the class levels and establish a trade-off between the weights of major and minor classes, considering both types of classes for accurate class prediction. We also propose SAM-empowered object selective color harmonization that improves the stability of minor classes. We propose a novel color picture assessment measure called Chromatic Number Ratio (CNR) to assess color component richness quantitatively. We evaluated our model against eight baseline and SOTA models using five datasets, and experimental findings show that the proposed model surpasses previous models in terms of visualization and CNR measurement criteria while maintaining satisfactory performance in other regression criteria, MSE, PSNR, similarity criteria SSIM, LPIPS, UIQI, and generative criteria FID.

References

[1] Yi-Chin Huang, Yi-Shin Tung, Jun-Cheng Chen, Sung-Wen Wang, and Ja-Ling Wu. An adaptive edge detection based colorization algorithm and its applications. In ICM, 2005.
[2] Anat Levin, Dani Lischinski, and Yair Weiss. Colorization using optimization. In SIGGRAPH. pages 689-694, 2004.
[3] Liron Yatziv and Guillermo Sapiro. Fast image and video colorization using chrominance blending. IEEE Transactions on Image Processing, volume 15, number 5, pages 1120-1129, 2006.
[4] Yingge Qu, Tien-Tsin Wong, Pheng-Ann Heng. Manga colorization. ACM Transactions on Graphics (ToG). volume 25, number 3, pages 1214-1220, 2006.
[5] Qing Luan, Fang Wen, Daniel Cohen-Or, Lin Liang, Ying-Qing Xu, Heung-Yeung Shum. Natural image colorization. In ECRT. pages 309-320, 2007.
[6] Tomihisa Welsh, Michael Ashikhmin, Klaus Mueller. Transferring color to greyscale images. In ACCGIT. pages 277-280, 2002.
[7] Revital Ironi, Daniel Cohen-Or, Dani Lischinski. Colorization by Example. Rendering Techniques, volume 29, pages 201-210, 2005.
[8] Yu-Wing Tai, Jiaya Jia, Chi-Keung Tang. Local color transfer via probabilistic segmentation by expectation-maximization. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 747-754, 2005.
[9] Alex Yong-Sang Chia, Shaojie Zhuo, Raj Kumar Gupta, Yu-Wing Tai, Siu-Yeung Cho, Ping Tan, Stephen Lin. Semantic colorization with internet images. ACM Transactions on Graphics (ToG), volume 30, number 6, pages 1-8, 2011.
[10] Xiaopei Liu, Liang Wan, Yingge Qu, Tien-Tsin Wong, Stephen Lin, Chi-Sing Leung, Pheng-Ann Heng. Intrinsic colorization. In SIGGRAPH, pages 1-9, 2008.
[11] Austin Sousa, Rasoul Kabirzadeh, Patrick Blaes. Automatic colorization of grayscale images. Department of Electrical Engineering, Stanford University, 2013.
[12] Mingming He, Dongdong Chen, Jing Liao, Pedro V Sander, Lu Yuan. Deep exemplar-based colorization. ACM Transactions on Graphics (TOG), volume 37, number 4, pages 1-16, 2018.
[13] Di Wu, Jianhou Gan, Juxiang Zhou, Jun Wang, Wei Gao. Fine-grained semantic ethnic costume high-resolution image colorization with conditional GAN. International Journal of Intelligent Systems, volume 37, number 5, pages 2952-2968, 2022.
[14] Yanze Wu, Xintao Wang, Yu Li, Honglun Zhang, Xun Zhao, Ying Shan. Towards vivid and diverse image colorization with generative color prior. In CVPR, pages 14377-14386, 2021.
[15] Min Wu, Xin Jin, Qian Jiang, Shin-jye Lee, Wentao Liang, Guo Lin, Shaowen Yao. Remote sensing image colorization using symmetrical multi-scale DCGAN in YUV color space. The Visual Computer, volume 37, pages 1707-1729, 2021.
[16] Haojie Guo, Zhe Guo, Zhaojun Pan, Xuewen Liu. Bilateral Res-Unet for Image Colorization with Limited Data via GANs. In ICTAI, pages 729-735, 2021.
[17] Hyojin Bahng, Seungjoo Yoo, Wonwoong Cho, David Keetae Park, Ziming Wu, Xiaojuan Ma, Jaegul Choo. Coloring with words: Guiding image colorization through text-based palette generation. In ECCV, pages 431-447, 2018.
[18] Yihuai Liang, Dongho Lee, Yan Li, Byeong-Seok Shin. Unpaired medical image colorization using generative adversarial network. Multimedia Tools and Applications. volume 81, number 19, pages 26669-26683, 2022.
[19] Shengrui Zang, Min Chen, Zhenhua Ai, Jieru Chi, Guowei Yang, Chenglizhao Chen, Teng Yu. Texture-aware gray-scale image colorization using a bistream generative adversarial network with multi scale attention structure. Engineering Applications of Artificial Intelligence, volume 122, pages 106094, 2023.
[20] Gustav Larsson, Michael Maire, Gregory Shakhnarovich. Learning representations for automatic colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 577-593, 2016.
[21] Mrityunjoy Gain, Md Arifur Rahman, Rameswar Debnath, Mrim M Alnfiai, Abdullah Sheikh, Mehedi Masud, Anupam Kumar Bairagi. An Improved Encoder-Decoder CNN with Region-Based Filtering for Vibrant Colorization. Comput. Syst. Sci. Eng.. volume 46, number 1, pages 1059-1077, 2023.
[22] Mrityunjoy Gain, Rameswar Debnath. A Novel Unbiased Deep Learning Approach (DL-Net) in Feature Space for Converting Gray to Color Image. IEEE Access. volume 11, pages 78918-78933, 2023.
[23] Jiancheng An, Kpeyiton Koffi Gagnon, Qingnan Shi, Hongwei Xie, Rui Cao. Image colorization with convolutional neural networks. In CISP-BMEI. pages 1-4, 2019.
[24] Satoshi Iizuka, Edgar Simo-Serra, Hiroshi Ishikawa. Let there be color! Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (ToG). volume 35, number 4, pages 1-11, 2016.
[25] Jheng-Wei Su, Hung-Kuo Chu, Jia-Bin Huang. Instance-aware image colorization. In CVPR. pages 7968-7977, 2020.
[26] K Ruwani M Fernando, Chris P Tsokos. Dynamically weighted balanced loss: class imbalanced learning and confidence calibration of deep neural networks. IEEE Transactions on Neural Networks and Learning Systems. volume 33, number 7, pages 2940-2951, 2021.
[27] Byron C Wallace, Issa J Dahabreh. Improving class probability estimates for imbalanced data. Knowledge and Information Systems. volume 41, number 1, pages 33-52, 2014.
[28] Ryan Dahl. Automatic colorization. Automatic colorization, 2016. [Online]. Available: https://tinyclouds.org/colorize.
[29] Federico Baldassarre, Diego González Morín, Lucas Rodés-Guirao. Deep koalarization: Image colorization using CNNs and Inception-ResNet-v2. arXiv preprint arXiv:1712.03400. 2017.
[30] Richard Zhang, Phillip Isola, Alexei A Efros. Colorful image colorization. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 649-666, 2016.
[31] Richard Zhang, Jun-Yan Zhu, Phillip Isola, Xinyang Geng, Angela S Lin, Tianhe Yu, Alexei A Efros. Real-time user-guided image colorization with learned deep priors. ACM Transactions on Graphics (TOG). volume 36, number 4, pages 1-11, 2017.
[32] Lili Xia, Zhiyong Qu, Jianpeng An, Zhongke Gao. A Weakly Supervised Method with Colorization for Nuclei Segmentation using Point Annotations. IEEE Transactions on Instrumentation and Measurement, 2023.
[33] Mennatullah Hesham, Heba Khaled, Hossam Faheem. Image colorization using Scaled-YOLOv4 detector. International Journal of Intelligent Computing and Information Sciences, volume 21, number 3, pages 107-118, 2021.
[34] Gokhan Ozbulak. Image colorization by capsule networks. In CVPR Workshops, pages 0-0, 2019.
[35] Guangqian Kong, Huan Tian, Xun Duan, Huiyun Long. Adversarial edge-aware image colorization with semantic segmentation. IEEE Access. volume 9, pages 28194-28203, 2021.
[36] Guillaume Charpiat, Matthias Hofmann, Bernhard Schölkopf. Automatic image colorization via multimodal predictions. In Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part III 10, pages 126-139, 2008.
[37] Sandra Treneska, Eftim Zdravevski, Ivan Miguel Pires, Petre Lameski, Sonja Gievska. Gan-based image colorization for self-supervised visual feature learning. Sensors. volume 22, number 4, pages 1599, 2022.
[38] Alan R. Robertson. The CIE 1976 color-difference formulae. Color Research & Application. volume 2, number 1, pages 7-11, 1977.
[39] Zhou Wang, Alan C. Bovik. Mean squared error: Love it or leave it? A new look at signal fidelity measures. IEEE Signal Processing Magazine, volume 26, number 1, pages 98-117, 2009.
[40] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. volume 40, number 6, pages 1452-1464, 2017.
[41] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, and others. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems. volume 32, 2019.
[42] Alain Hore, Djemel Ziou. Image quality metrics: PSNR vs. SSIM. In ICPR, pages 2366-2369, 2010.
[43] J. Antic, “A deep learning based project for colorizing and restoring old images (and video!),” 2018. [Online]. Available: https://github.com/jantic/DeOldify.
[44] Zhou Wang, Alan C. Bovik. A universal image quality index. IEEE Signal Processing Letters. volume 9, number 3, pages 81-84, 2002.
[45] Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, pages 586-595, 2018.
[46] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in Neural Information Processing Systems, volume 30, 2017.
[47] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR. pages 248-255, 2009.
[48] Maria-Elena Nilsback, Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages 722-729, 2008.
[49] Ziwei Liu, Ping Luo, Xiaogang Wang, Xiaoou Tang. Deep Learning Face Attributes in the Wild. In ICCV, December 2015.
[50] Jooyeol Yun, Sanghyeon Lee, Minho Park, Jaegul Choo. iColoriT: Towards Propagating Local Hints to the Right Region in Interactive Colorization by Leveraging Vision Transformer. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1787-1796, 2023.
[51] Raj Kumar Gupta, Alex Yong-Sang Chia, Deepu Rajan, Ee Sin Ng, Zhiyong Huang. Image colorization using similar images. In ICM. pages 369-378, 2012.
[52] Aurélie Bugeau, Vinh-Thong Ta, Nicolas Papadakis. Variational exemplar-based image colorization. IEEE Transactions on Image Processing. volume 23, number 1, pages 298-307, 2013.
[53] Tram-Tran Nguyen-Quynh, Soo-Hyung Kim, Nhu-Tai Do. Image colorization using the global scene-context style and pixel-wise semantic segmentation. IEEE Access. volume 8, pages 214098-214114, 2020.
[54] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, Kilian Q Weinberger. Densely connected convolutional networks. In CVPR. pages 4700-4708, 2017.
[55] Xiaoyang Kang, Tao Yang, Wenqi Ouyang, Peiran Ren, Lingzhi Li, Xuansong Xie. DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders. In ICCV. pages 328-338, 2023.
[56] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740-755, 2014.
[57] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick. Segment Anything. In ICCV, pages 4015-4026, October 2023.