The first stage in the Berkeley-Iowa Naked People Finder identifies images which contain large regions whose color and texture are appropriate for skin. Our tests have shown that small regions of skin (e.g. faces) pass the color and texture tests, and fail the skin filter simply because they cover too little of the image. Although our test dataset is heavily biased towards white females, additional images were used to tune the filter so it also works on other races.
This page summarizes the details of our skin finder. Feel free to adapt it, use it in other applications, and tell us how it works! I'm distributing a precise description rather than code, because much of the code is written in Common Lisp.
The color of human skin is created by a combination of blood (red) and melanin (yellow, brown). Skin colors lie between these two extreme hues and are somewhat saturated, but not extremely saturated. Because more deeply colored skin is created by adding melanin, saturation increases as skin becomes more yellow. This effect must be taken into account, otherwise the skin filter will mark red and light yellow regions, whose color does not resemble skin.
Except for extremely hairy subjects, which are rare, skin has only low-amplitude texture.
In our application, we had no control over illumination level. Furthermore, intensity varies substantially across a curved object such as a human bodypart. Therefore, we used only features that do not depend on the light level.
In our database, the illumination is almost always close to white, although the exact light color varies slightly from image to image. Color constancy algorithms cannot be used to remove this variation because the selection of colors in our images is not representative. For example, some images depict naked people on pink backgrounds.
The reflectance of human skin has significant specular components. Specular components desaturate the skin. In the extreme case, highlights are essentially the same color as the illumination: white or perhaps even slightly bluish or greenish. Finally, bright skin regions may saturate the very limited dynamic range of many cameras (8-bit), which also results in desaturation of color values. Our algorithm attempts to fill small desaturated regions but may fail to identify large desaturated regions as skin.
This description assumes that input images are in 8-bit RGB format. Images in formats such as JPEG were converted to RGB format using standard image conversion software. Some of our images were taken directly in the following format, using a special camera interface. Some framegrabber interfaces can produce images with higher dynamic range by frame averaging. In such cases, adapt the following algorithm by using a different multiplier in your log transformation and by scaling the color saturation and texture amplitude bounds.
The zero-response of the camera is estimated as the smallest value in any of the three color planes, omitting locations within 10 pixels of the image edges. This value is subtracted from each R, G, and B value. This avoids potentially significant desaturation of opponent color values if the zero-response is far from zero. (In applications with physical control of the camera, the zero response should be calibrated by taking an image with the lens cap on.)
The RGB values are then transformed into log-opponent values I, Rg, and By as follows:
The green channel is used to represent intensity because the red and blue channels from some cameras have poor spatial resolution. The constant 105 simply scales the output of the log function into the range [0,254]. n is a random noise value, generated from a distribution uniform over the range [0,1). The random noise is added to prevent banding artifacts in dark areas of the image. The constant 1 added before the log transformation prevents excessive inflation of color distinctions in very dark regions.
The log transformation makes the Rg and By values, as well as differences between I values (e.g. texture amplitude), independent of illumination level.
The algorithm computes a scaling factor SCALE, which measures the size of the image. SCALE is the sum of the height and the width of the image (in pixels), divided by 320. Spatial constants used in the skin filter are functions of SCALE, so that their output does not depend on the resolution of the input image. The images used in our experiments were approximately 128 by 192 pixels, so SCALE was typically near 1.0.
Order statistic filters, including the median filter, were implemented using the multi-ring approximation described in
The Rg and By arrays are smoothed with a median filter of radius 2*SCALE, to reduce noise.
To compute texture amplitude, the intensity image was smoothed with a median filter of radius 4*SCALE. The result was subtracted from the original image. The absolute values of these differences are then run through a second median filter of radius 6*SCALE. This computes a version of the MAD scale estimator used in robust statistics.
Skin is detected using a two-stage filter. It first marks pixels whose color is very likely to be skin. These preliminary skin regions are then enlarged to include pixels whose color and texture are consistent with skin, but might be consistent with other materials. This allows us to fill small desaturated regions (e.g. highlights) without detecting large amounts of other materials whose color and texture are similar to that of skin.
For convenience, define the hue at a pixel to be atan(Rg,By), where Rg and By are the smoothed values computed as in the previous section. The saturation at the pixel is sqrt(Rg^2 + By^2). Because our representation ignores intensity, yellow and brown regions cannot be distinguished: I will call them both "yellow."
The first (tightly-tuned) stage of the skin filter marks all pixels whose texture amplitude is no larger than 5, and (a) whose hue is between 110 and 150 and whose saturation is between 20 and 60 or (b) whose hue is between 130 and 170 and whose saturation is between 30 and 130. The combination of these two regions approximates a diagonal region in color space, going from low-saturated red towards somewhat saturated yellow.
The two-region approximation is a historical accident. A new implementation should use a polygonal region in color space rather than bounds on hue and saturation. Applications using higher-resolution images should consider imposing a lower bound on skin texture, to distinguish skin from materials such as plastic.
The skin regions are cleaned up and enlarged slightly, to accomodate possible desaturated regions adjacent to the marked regions. Specifically, a pixel is marked in the final map if at least one eighth of the pixels in a circular neighborhood of radius 24*SCALE pixels are marked in the original map. This is done quickly (though only approximately) by slightly modifying the fast median filter.
Specifically, to enlarge the skin regions, non-skin regions are assigned value zero and skin regions a non-zero value in a temporary array. The 12.5th percentile order statistic is then computed over a neighborhood of radius 24*SCALE. To do this, the first iteration (largest ring) of the multi-ring filter returns the second smallest of its 9 input values. Subsequent iterations (smaller rings) then compute the median of their input values. (If this makes no sense, read the CVPR paper cited above and see our on-line description of the order-statistic algorithm.)
If the marked regions cover at least 30% of the image area, the image will be identified as passing the skin filter and passed to the geometrical analysis stage of our naked people detector.
The marked regions are typically an over-estimate of the skin area. To trim extraneous pixels the second filtering stage unmarks any pixels which do not pass a lenient skin filter. This filter requires that the pixel have hue in the range [110,180] and saturation in the range [0,130]. (This filter imposes no constraints on texture amplitude.) The geometrical analysis algorithm tries to locate body parts in these trimmed regions.
The 30% threshold is applied before trimming for historical reasons. A new implementation should consider trimming the regions first, then computing how much of the image contains skin.