Pattern Recognition 33 (2000) 1585}1598
Texture discrimination using discrete cosine transformation shift-insensitive (DCTSIS) descriptors夽 Richard E. Frye*, Robert S. Ledley National Biomedical Research Foundation, 3900 Reservoir Road, NW LR-3 Preclinical Science Building, Washington, DC, USA Received 31 October 1995; received in revised form 14 April 1999; accepted 14 April 1999
Abstract Many of the numerous texture measurements are based on space-frequency signal decomposition; these include Gabor "lters and wavelet-based methods. The discrete cosine transformation (DCT) extracts spatial-frequency (SF) components from a local image region. It is the basis for the JPEG image compression standard and has many fast algorithmic implementations. By using a sliding DCT we derive a SF representation for a region of interest (ROI) surrounding each image pixel. We show that the DCT coe$cients may represent a SF as a combination of several DCT coe$cients depending on the o!set of the SF waveform maximum from the ROI's beginning. Thus, the DCT coe$cients for a texture with a certain SF will change as the transformation is moved over the texture. In order to circumvent this problem, we derive horizontal and vertical SF shift-insensitive measurements from DCT components. Examples are given which show how these DCT shift-insensitive (DCTSIS) descriptors can be used to classify textured image regions. Since a large number of image display, storage and analysis systems are based on DCT hardware and software, DCTSIS descriptors may be easily integrated into existing technology and highly useful. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Discrete cosine transformation; Texture analysis; Shift insensitive; Spatial-frequency analysis; Discriminant analysis; Texture discrimination
1. Introduction 1.1. Background Feature extraction is one of the primary steps when decomposing an image into similar regions or objects
夽 Supported by a Fellowship to Dr. Frye from the National Biomedical Research Foundation and a Grant from the Georgetown Institute for Cognitive and Computational Sciences. * Corresponding author. Tel. #1-202-687-2121; fax: #1202-687-1662. E-mail addresses:
[email protected] (R.E. Frye),
[email protected] (R.S. Ledley). Now a Resident in Pediatrics at Jackson Memorial Hospital, Miami, FL, USA. Also at Department of Physiology and Biophysics and Department of Radiology, Georgetown University Medical Center, Washington, DC, USA.
using methods such as segmentation, clustering, or discriminant analysis. Over the past two decades, descriptors of texture have become increasingly popular as feature extraction tools. The idea of texture, for gray level images, is to describe periodic-like changes in gray level rather than absolute changes associated with object edges or illumination variation. In this way, regional texture characteristics can be described independent of, but complementary to, standard image descriptors such as illumination, contrast, and average gray level, thereby providing additional image characteristic information for input into high-level processing algorithms. 1.2. Texture descriptors in image analysis Investigators use many methodologies in developing texture measurements, including statistical, geometric, model-based and signal-processing methods. One of the earliest and most widely known techniques is aimed at
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 6 8 - 5
1586
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
characterizing the statistical properties of textures using a spatial gray level dependence co-occurrence matrix. Haralick [1] and Conner [2,3] were among the "rst to present texture descriptors calculated from this matrix [4]. Recently, Elfadel and Picard [5] have described how Gibbs random "eld models and co-occurrence measures are related through a generalized `aura seta, suggesting that statistical and model-based texture descriptors represent similar texture characteristics. Another statistically based method, the multiresolution fractal dimension vector, can distinguish between image textures and is supposedly superior to co-occurrence, Fourier power spectrum and Laws' texture energy measurements [6]. A new statistical texture measurement is based on a so-called texture `spectruma. However, the axis of this `spectruma is not continuous, as in, for example, a frequency spectrum, rather it is merely an orderly way of classifying textures [7]. Nevertheless, statistical-type co-occurrence-like texture descriptors can be derived from this `spectruma and texture "ltering can be performed [8]. A true texture `spectruma was proposed by Ledley et al. [9]. This spectrum characterizes several basic image statistics using a primitive multiresolution approach. This technique has not been widely adopted. This is perhaps due to the fact that it measures very general texture features and that the precise spectrum for most textures is not intuitively obvious. Spatial-frequency (SF) analysis is prominent among a majority of signal processing applications. Although the two-dimensional Fourier transformation (FT) can be used to derive the SF components of an image, other methods are more widely used for measuring texture characteristics. For example, Gabor "lters discriminate between textures and have functions shaped similarly to human visual cortical receptive "elds [10]. Gabor "lters optimally localize textures in the space-frequency domain [11], although evidence disputing this claim exists [12]. The B-spline wavelet can approximate the Gabor function and is useful in texture discrimination [13]. Orthonormal and non-orthonormal wavelets are also useful as texture classi"ers [14,15]. A comparison of three statistical-type and one SFbased texture descriptors showed that co-occurrence and fractal texture descriptors outperformed Markov random "eld and Gabor "lter-based texture measurements [16]. However, a recent study suggests that a SF texture descriptor based on the frequency of local extremas may perform slightly better than cooccurrence and Gabor-"lter-derived texture measurements [17]. 1.3. Improving spatial-frequency-based descriptors One of limitation of some SF-based measurements is pattern shift sensitivity. For example, the magnitude of a given SF component can change depending on the
position of the SF waveform peak within the region of interest (ROI). When applying a texture descriptor to an image, the texture at each pixel is given by a speci"c-sized area surrounding the pixel. To analyze the next horizontal pixel, the ROI is shifted one horizontal pixel and the texture descriptor is recomputed. As the region is shifted either horizontally or vertically the positions of the maximums and minimum of a given SF waveform peak will change. This can a!ect the coe$cients' values and hence the derived texture measurement. Many studies apply smoothing operators to the result of the texture descriptor. Although this can minimize texture descriptor variation which arises from pattern shift sensitivity, it also reduces information extraction and texture classi"cation. Hence, a texture descriptor which does not inherently vary as a pattern is shifted (i.e., shift-insensitive), but does vary with changes in texture SF (i.e., frequencysensitive) is desired.
2. The discrete cosine transformation (DCT) 2.1. Relevance for texture description Many theories have proposed that SF is the cue used by the visual cortex to perceive texture [18]. Until recently, studies have modeled SF-based texture processing as a decomposition of textures into sinusoidal components. In this respect, it has been argued that SF analysis utilizing FTs or Gabor "lters best describe the texture recognition process of human vision [19]. Like the sine function, the cosine function can also be used to model the SFs of texture by simply changing the phase of the basis function by 903. The discrete cosine transformation (DCT) provides information about the SFs of a signal in a manner similar to the FT, but has many advantages over the FT. The DCT can compress the spectrum of a signal into fewer coe$cients than the FT and does not generate spurious SF coe$cients. Although this is a great advantage in image compression, it can also assist in recognition of distinct SF components since fewer coe$cients need to be considered. Unlike the FT, the DCT does not assume an implicit periodicity of the signal beyond the sampled domain * and is symmetric about the transformation domain. In addition, the DCT is mathematically and computationally easier to work with since it does not calculate an imaginary component [20]. Some have argued that texture classi"cation based on SF analysis assumes that a texture is a repetition of some primitive pattern following a certain displacement rule and that such rules are both hard to determine and complex [7,8]. Others have argued that SF analysis is only useful for textures in which the spectra do not overlap [21]. However, texture can be identi"ed by the
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
relative SF coe$cient values and, unlike statistical texture classi"ers, SF methods can easily eliminate slow changes in image intensity and contrast due to phenomena such as shading, re#ectance and highlighting. For example, although a texture may be the same in two regions of an image, one region may be superimposed on a shaded portion of the image resulting in a reduction in image intensity. Such slow changes in intensity can be ignored when analyzing texture using SF analysis by examining texture characteristics based on medium and fast SFs. The DCT removes the mean image intensity (i.e., the DC image component), allowing the remainder of the coe$cients to represent SF components independent of mean brightness. 2.2. Deriving DCT coezcients The forward two-dimensional DCT, shown as Eq. (1), can be applied to any n;n ROI in order to derived a set of n coe$cients which are typically arranged in a twodimensional n;n array:
4=( fc )=( fc ) L\ L\ P A Data(p , p ) DC¹( fc , fc )" P A T F n NT NF
;cos
(2p #1) fc p (2p #1) fc p T P ;cos F A 2n 2n
(1)
In Eq. (1), =( fc)"2\ for fc"0 otherwise =( fc)" 1, fc and fc are the row and column coe$cient indices P A which range from 0 to (n!1), p and p are the vertical T F and horizontal pixel o!sets which range from 0 to (n!1), and n is the horizontal and vertical size of the ROI to be processed. Fig. 1 depicts a basis function for an 8;8 DCT. The "rst DCT coe$cient (0, 0) represents the DC components of the ROI. All other components represent the values of SF coe$cients which compose the ROI. The coe$cients can be thought of as the gray-level variance around the ROI's gray-level mean DC component. The "rst column of DCT coe$cients represents the vertical SFs in the image which do not contain any horizontal SFs. Likewise the "rst row of DCT coe$cients represents the horizontal image SFs which do not contain any vertical SFs. The remainder of the coe$cients represent the interactions between the various horizontal and vertical frequencies. The SF in cycles/pixel of each frequency coe$cient can be determined by 2n sf ( fc)" fc where fc"0,2,(n!1) with n being the transformation size
(2)
1587
3. DCT shift-insensitive spatial frequency descriptors 3.1. Deriving the shift-insensitive descriptor Since the DCT does not explicitly represent SF phase information, a SF in a texture which is not in-phase with the basis function but is of the same SF as one of the basis functions may be represented by a combination of positive and negative coe$cients. For example, Fig. 2(a) represents a horizontal SF of 0.33 cycles/pixel on an 8;8 grid. Fig. 2(d) represents the DCT coe$cients for this pattern. Since this SF has been selected to correspond to only one horizontal DCT coe$cient, the values of only one horizontal DCT coe$cient is signi"cant. However, when this pattern is shifted horizontally by one pixel unit (Fig. 2(b)) several horizontal DCT coe$cients become signi"cant (Fig. 2(e)). Likewise, when the original pattern is shifted two horizontal pixels (Fig. 2(c)), the horizontal DCT coe$cients change again (Fig. 2(f )). This demonstrates a signi"cant problem with using DCT coe$cients to identify a texture. As the ROI is moved across a texture, the DCT coe$cients will change. In order to circumvent this problem a texture measurement which identi"es the signi"cant characteristics of the texture independent of the o!set of the texture's pattern must be derived. Since we know the value of the SF component at a zero o!set, we might try to determine how the values of the other coe$cients are related to the zero o!set SF component values when the pattern has been shifted. In order to determine such a function we "rst assume that the function is an addition of some function of the other coe$cients. Since the values of the horizontal SF coe$cients change in the example above, yet all of the values are located at one vertical SF regardless of the o!set, we only examined the horizontal SF coe$cients for the vertical SF of DCT coe$cient two. Figs. 3(a)}(c) show graphs of the horizontal SF coe$cients at a vertical SF of 0.33 cycles/pixel (i.e., vertical DCT coe$cient two) for the pattern in Figs. 2(a)}(c). Figs. 3(e)}(g) show the absolute value of the coe$cients and Figs. 3(i)}(k) show the power spectrum (i.e., squared coe$cients). The sum totals of the coe$cients following these transformations are shown graphically in Figs. 3(d)}(i). Notice that the sum totals of the power spectrum values are very similar regardless of the horizontal pattern o!set. 3.2. Statistical conxrmation of shift-insensitivity The graphical example above describes one case in which the addition of the power spectrum results in very similar values regardless of the pattern o!set. However, it is not known whether this result is true for all patterns and all pattern o!sets. Furthermore, since the power spectrum illustrated above is only for one vertical frequency, we do not know whether the additions of the power spectrums
1588
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
Fig. 1. The DCT basis function for an 8;8 sized transformation. The indices along the top and left side of the image represent the vertical and horizontal spatial frequency coe$cient indices. Notice that interaction of the spatial frequencies as the indices of both spatial frequency coe$cients increase.
for the other vertical and horizontal frequencies also demonstrate the same shift-insensitive characteristics. In this section we outline the DCT shift-insensitive descriptors and our method of con"rming this insensitively. 3.2.1. Texture descriptor calculations To con"rm the shift-insensitive characteristic of the power spectrum values we calculated the power values (i.e., sum total power spectrum values) for each horizontal and vertical SF and for all SFs of simulated patterns. The power value for each vertical SF is calculated by
Eq. (3) below, the power value for each horizontal SF is calculated by Eq. (4) below, and the power value for all SFs is calculated by Eq. (5) below: L\ P ( fc)" DC¹( fc, i) where fc"0,2,(n!1), (3) T G L\ P ( fc)" DC¹(i, fc) where fc"0,2,(n!1), (4) F G L\ L\ P " DC¹(i, j). (5) R G H
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
1589
Fig. 2. A simulated 8;8 pattern with horizontal and vertical spatial frequencies of 0.33 cycles/pixel and the DCT coe$cients: (a) pattern with zero o!set; (b) pattern (a) shifted by one horizontal pixel; (c) pattern (a) shifted by two horizontal pixels; (d) the values of the DCT coe$cients for (a); (e) the values of the DCT coe$cients for (b); (f ) the values of the DCT coe$cients for (c).
3.2.2. Pattern simulation The simulated patterns are created from a cosinebased equation similar to the basis function of the DCT. Eq. (6) below is used to derive the values of the 8;8 sized simulated patterns: p( fc , fc , i, j, n, o , o ) F T F T
"cos
cos
((2i)#1) fc p fc o p F # F F ; 2n n ((2j)#1) fc p fc o p T # T T 2n n
where fc"0,2,(n!1), o "0,2,(n!1), T
o "0,2,(n!1), F i"0,2,(n!1), j"0,2,(n!1). (6)
where fc represents an integer which indexes the n possible vertical or horizontal SFs, o and o represent the T F n possible o!sets of the SFs from alignment with the DCT basis function, and i and j represent the data indexes for the simulated pattern. All combinations of vertical ( fc ) and horizontal ( fc ) SFs with all combinaT F tions of vertical (o ) and horizontal (o ) o!sets are T F simulated and the SF power values are derived using Eqs. (3)}(5).
3.2.3. Data analysis The dependence of the SF and o!set parameters on the power values was determined by performing 17 analyses of variances (ANOVAs). The ANOVAs were implemented by a linear regression model, with each model having one of the power values as a dependent variable and each simulation parameter represented as a set of categorical independent variables [22]. The F-ratios for each categorical independent variable were determined by Partial F-tests and are presented in Table 1. 3.2.4. Results Table 1 strongly suggests that manipulating vertical SF o!set in#uences the vertical SF power values and the total SF power value, but only one horizontal SF power value. Likewise, manipulating the horizontal SF o!set in#uences the horizontal SF power values and the total power value, but only one vertical SF power value. Interestingly, the total power of all SFs is signi"cantly in#uenced by manipulating both the vertical and horizontal SF o!sets. From these data, it appears that most of the vertical SF power values are completely horizontally shift-insensitive, while most of the horizontal SF power values are vertically shift-insensitive. The total SF power value is not completely resistant to pattern shifts; however, the F-ratios for the horizontal and vertical o!set parameters are two orders of magnitude smaller than the horizontal and vertical frequency parameters for the total SF power
1590
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
Fig. 3. Coe$cient values for all vertical frequency coe$cients for the horizontal frequency coe$cient two from the patterns in Fig. 2: (a) coe$cient pro"le for the pattern presented in 2(a); (b) coe$cient pro"le for the pattern presented in 2(b); (c) coe$cient pro"le for the pattern presented in 2(c); (d) sum of each pattern's coe$cient pro"le; (e) absolute value of the coe$cient pro"le for pattern 2(a); (f ) absolute value of the coe$cient pro"le for pattern 2(b); (g) absolute value of the coe$cient pro"le for the pattern 2(c); (h) the sum of the absolute value coe$cient pro"les; (i) squared coe$cient pro"le for pattern 2(a); ( j) squared coe$cient pro"le for pattern 2(b); (k) squared coe$cient pro"le for pattern 2(c); (l) sum of the squared coe$cient pro"les.
value. Thus, we refer to these power functions as DCT shift-insensitive (DCTSIS) descriptors. It can be shown mathematically that the sum of the squares of a cosine function multiplied by a shift-invariant function is shift-invariant along the dimension of the summation. Thus, it is not surprising that the DCTSIS descriptors are shift-insensitive to the pattern shifts along the direction in which each speci"c descriptor is summed. This fact provides support for the validity of the analysis in this section. The results in this section suggest that this analysis can provide a quantitative measure of shift insensitivity.
4. Application to texture discrimination 4.1. Representative texture images 4.1.1. Texture preprocessing Three composite texture test images were constructed to test the applicability of the DCTSIS descriptors. Be-
fore the composite images were constructed, the intensity and contrast of each texture were adjusted by setting the mean and standard deviation of the image histogram to 128 and 64, respectively. This equalized each texture with respect to its "rst and second statistical distribution moments. In this way, these textures cannot be distinguished based on global mean and standard deviation measures. It is unclear whether such an equalization of the intensity and global contrast is performed in many scienti"c articles. Failure to take this step can bias experiments, leading to unrepeatable results and invalid conclusions. 4.1.2. Test image 1 (TI1; Photogear Square) Four medium resolution (266 ppi) textures were chosen from the photographic natural texture Photogear image library on CD-ROM (Image Club Graphics, Inc.). The textures consisted of one texture of crumpled paper (paper}01: Texture 1), one marble texture (marbl}04: Texture 2), and two wood textures (wood}01: Texture 3; wood}02: Texture 4). Each image was converted from
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598 Table 1 Statistical tests for the shift-insensitive characteristics of the DCTSIS descriptors for all vertical and horizontal DCT spatial frequencies. F-ratios represent the dependence of the DCTSIS descriptors on the parameters of the simulated texture waveforms Frequency Vertical
O!set Horizontal
Vertical frequency power 0 786.10 0.86 1 131.47 38.84 2 127.60 34.44 3 115.33 38.92 4 102.37 26.80 5 128.77 38.72 6 133.48 36.57 7 153.40 23.79
Vertical
Horizontal
3.67 12.99 16.16 14.05 22.15 12.57 12.90 14.97
8.45 0.03 0.09 0.03 0.08 0.04 0.03 0.05
power 786.30 131.47 127.60 115.33 102.37 128.77 133.48 153.40
8.45 0.03 0.09 0.03 0.08 0.04 0.03 0.05
3.67 12.99 16.16 14.06 22.15 12.57 12.90 14.97
All frequency power 203.51 203.51
4.43
4.43
Horizontal frequency 0 0.60 1 38.84 2 34.44 3 38.92 4 26.80 5 38.72 6 36.57 7 23.79
p(0.001.
TIFF format to PCX format and the color was removed by reducing it to gray scale. A 128;128 pixel portion of each resulting image was selected and saved. These four images were combined into one image such that each subimage was positioned in each of the four quadrants of the resulting square image. Fig. 4 shows the original image, the DCT components resulting from a 4;4 DCT, and the DCTSIS descriptors. 4.1.3. Test image 2 (TI2; Photogear Checkerboard) Each 128;128 texture was divided up into four images, and a checkerboard pattern was then created with these 64;64 images (Fig. 5(d)). The positions of the textured regions were counterbalanced using a Latin-square design so that for any textured region the adjacent textures were never located on the same side twice. This test image doubles the amount of texture edges, thereby allowing us to investigate the robustness of our image descriptors for distinguishing texture transition regions. The counterbalancing arrangement insures that characteristics of adjacent textures do not skew the results.
1591
4.1.4. Test image 3 (TI3; Brodatz Textures) It is more common to test texture processing algorithms using the natural textures in the Brodatz [23] texture album. Four of the common natural textures were digitized at a resolution of 100;100 dpi and stored as PCX image "les. Regions of size 128;128 were extracted from the textures of French Canvas (Brodatz C21; Texture 1), Herring Bone Weave (Brodatz C17; Texture 2), Long Island Beach Straw (Brodatz C15; Texture 3) and Woven aluminum wire (Brodatz C6; Texture 4). These extracted textures were combined by arranging them in a square pattern in a manner identical to Image 1. Fig. 5(g) shows the resulting test image. 4.2. DCT measurements used One basic DCT parameter is ROI size. The fact that the size of the DCT can easily be modi"ed allows this transformation to be ubiquitous with regard to image resolution and dynamic with regard to multiresolution analysis. In light of the fact that studies have demonstrated the utility of multiresolution techniques such as wavelets and that multiresolution has been advocated for texture analysis, this study uses a multiresolution design to de"ne the ROI size parameters. Thus, DCT coe$cients for 2;2, 4;4, 8;8 and 16;16 sized ROIs were computed. From each of the DCT coe$cient sets, the DCTSIS descriptors described above were derived, without using the DC component of the DCT analysis. The DC components of the DCT analysis were included in the discriminant functions. Once the measurements were derived for each image, the results were stored as 8-bit gray-scale "les. In order to optimize the numerical range, the values computed were standardized and stored as a distribution with a mean of 128 and a standard deviation of 64. All values outside the gray-scale range were given the values of 0 or 255 depending on whether the value was below or above the gray-scale limit. 4.3. Discriminant function derivation A multiple-group discriminant function was derived for each image. Within-group mean corrected sums-ofsquares matrices were computed for the texture measurements derived from a 64;64 pixel sized regions of each texture. Note that these 64;64 images were di!erent from any of the 64;64 images used to make TI2. A total mean corrected sums-of-squares matrix was then computed for each image and the between-group corrected sums-of-squares matrices were derived. The eigenvalues and eigenvectors were then computed for the =\B matrix and all signi"cant eigenvectors were retained [24]. Discriminant functions derived from these eigenvectors classi"ed each pixel by determining the closest group centroid. The classi"cation accuracy was derived
1592
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
from these results. The signi"cance of each DCTSIS descriptor was determined by calculating its total F-ratio as the weighted sum of its F-ratios over all discriminant functions. The F-ratio of each DCTSIS descriptor was determined for each discriminate function by calculating its statistical contribution the discriminate function's output value. 4.4. Results The 2;2 DCT coe$cients for TI1 as well as the DCTSIS descriptors derived from these coe$cients are shown in Fig. 4. Table 2 shows that the DCT coe$cients could not classify the textures whereas the DCTSIS descriptors distinguished the textured regions well. Fig. 5(b) shows the mask used to create the composite image and
Fig. 5(c) shows the discriminant function classi"cation based on the DCTSIS descriptors. Fig. 5(f ) shows the classi"cation of TI2 by the discriminant function. Classi"cation results similar to TI1 were found for TI3, and the DCTSIS descriptors classi"ed the Brodatz textures with higher accuracy than Photogear textures. Fig. 5(h) shows the mask used to create the composite image and Fig. 5(i) depicts the group classi"cations for IT3. Table 3 shows the detailed classi"cation results using the DCT coe$cients for all images and Table 4 shows the detailed classi"cation results using the DCTSIS descriptors for all images. The large sample sizes used in the image statistic calculations results in in#ated F-ratios which are almost always signi"cant. Therefore, the F-ratios must be considered in relation to each other. Although most DCTSIS
Fig. 4. Test image 1, and its 4;4 DCT and DCTSIS descriptor components. The upper right-hand side shows the image layout scheme. The numbers separated by commas represent the DCT coe$cients which are surrounded by the DCTSIS descriptors. PA indicates the power of all of the coe$cients, PVf represents the power of vertical spatial frequency coe$cient f, and PHf represents the power of horizontal spatial frequency coe$cient f.
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
1593
Fig. 4. (Continued.)
Table 2 Discriminant function classi"cation performance on test images based on DCT coe$cients and DCT shift-insensitive (DCTSIS) descriptors. Notice that the classi"cation accuracy is much higher for the DCTSIS descriptors Classi"cation accuracy Test image
DCT coe$cients
DCTSIS descriptors
1 2 3
27.0% 26.7% 25.6%
93.7% 87.1% 95.7%
descriptors contributed signi"cantly to the discriminant function, we present only the most signi"cant DCTSIS descriptors. Table 5 lists the 22 most signi"cant DCTSIS descriptors used to classify TI1. Table 6 lists the 18 most signi"cant DCTSIS descriptors used in classifying TI3. While the variation in F-ratio values of the most in#uential variables is signi"cantly for TI1, the F-ratio values for the in#uential variables are similar for TI3. While the most in#uential DCTSIS descriptors for TI3 appear to be in the vertical direction, it appears that a mixture of DCTSIS descriptors is responsible for the classifying TI3 pixels. Total DCTSIS descriptors for the 2 and 16 size
ROIs appear in Table 5. However, none of the total DCTSIS descriptors appear in Table 6. Although we did not expect any of the DCTSIS descriptors, which are shift-sensitive to be on either list, the ones that do appear are not the most in#uential variables used by the discriminant functions.
5. Application to image processing and future development We have demonstrated that texture descriptors based on the discrete cosine transformation can be used, with reasonable accuracy, to classify image textures. We have also shown evidence that texture measurements must be derived from combinations of DCT coe$cients in order to perform texture discrimination and that the DCT coe$cients themselves do not appear to be good texture measurements. In addition, it appears that shift insensitivity is an important characteristic for a texture descriptor. We have also outlined a procedure for determining shift and frequency sensitivity of a texture descriptor. Indeed, it might be useful to evaluate this property in other existing texture descriptors. If such a property does not exist, it may be useful to determine how this property can be included in the descriptor.
1594
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
Fig. 5. Test images, and their classi"cation masks and predictions: (a) original test image 1, textures are numbered 1, 2, 4, 3 starting in the upper left corner proceeding clockwise; (b) true classi"cation mask used to build the image; (c) classi"cation of test image 1's pixels resulting from the discriminant analysis procedure; (d) original test image 2; (e) true classi"cation mask used to build the image; (f ) classi"cation of test image 2's pixels resulting from the discriminant analysis procedure; (g) original test image 3, textures are numbered 1, 2, 3, 4 starting in the upper left corner proceeding clockwise; (h) true classi"cation mask used to build composite image; (i) classi"cation of test image 3's pixels resulting from discriminant analysis procedure.
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598 Table 3 Confusion matrices for DCT-based discriminant function classi"cation of test images
Table 4 Confusion matrices for DCTSIS descriptor-based discriminant function classi"cation of test images
Predicted classi"cation Actual class
1
1595
Predicted Classi"cation
2
3
4
Actual class
1
2
3
4
Test Image 1 1 2 3 4
5.48% 2.48% 6.07% 4.68%
39.77% 49.05% 45.98% 42.77%
4.63% 2.72% 3.90% 2.56%
50.11% 45.75% 44.06% 23.88%
Test Image 1 1 2 3 4
88.58% 0.08% 0.10% 0.00%
0.28% 92.93% 0.40% 4.00%
8.57% 5.91% 97.32% 0.00%
2.57% 1.08% 2.17% 96.00%
Test Image 2 1 2 3 4
5.27% 2.43% 5.85% 4.46%
42.16% 47.20% 45.52% 42.88%
4.59% 2.70% 4.18% 2.70%
47.97% 47.67% 44.45% 49.96%
Test Image 2 1 2 3 4
77.92% 0.06% 1.35% 0.00%
1.78% 86.48% 2.10% 4.94%
14.94% 9.91% 90.07% 1.08%
5.36% 3.55% 6.49% 93.98%
Test Image 3 1 2 3 4
45.41% 45.78% 42.19% 45.60%
10.77% 6.31% 7.65% 4.99%
7.12% 3.01% 4.57% 3.49%
36.70% 44.89% 45.59% 45.92%
Test Image 3 1 2 3 4
98.90% 0.37% 0.38% 0.09%
0.73% 97.16% 8.72% 3.16%
0.02% 2.26% 90.24% 0.31%
0.35% 0.21% 0.65% 96.44%
The DCT is a widely used mathematical tool for image compression with many fast hardware and software implementations [25,26]. Although the utility of each DCT coe$cient in pattern recognition has never been outlined, these coe$cients are useful for image processing and signal enhancement. Indeed, DCT coe$cients can be used to perform both basic and complex image processing operations, such as transparent and opaque masking, translation, scaling, rotation, and linear "ltering; signal enhancement through DCT-based adaptive "ltering has also been investigated [27}29]. Correlating texture type with DCT coe$cient values or DCT-derived texture measurements, like the DCTSIS
descriptors, might allow various classes of texture types to be identi"ed in compressed image data. Indeed, the validity of any regional texture descriptor assumes texture homogeneity thoughout the analysis region. The variability in a DCTSIS descriptor across a region may indicate the relative homogeneity for the region. This would provide image storage systems with guidance for adjusting compression block size and coe$cient weighting. Such a quick and simple automated pattern recognition system would provide e$cient image storage and retrieval. A DCT-based compression algorithm which uses quadtrees to classify 4;4 image regions into variable-sized
Table 5 Signi"cance of the "rst 22 DCTSIS descriptors for the discriminant function of image 1's. Variable loadings sorted by F-ratio. Note: Only F-ratios'100 are presented Size
Direction
2 2 2 16 16 16 8 16 16 8 16
Constant Vertical Horizontal Constant Horizontal Horizontal Constant Vertical Horizontal Vertical Total
Frequency
1 1 1 3 3 2 2
F-value
Size
Direction
369,551.0 1943.6 949.4 872.0 787.7 586.1 571.6 526.1 517.2 446.4 375.0
16 2 8 16 16 8 16 16 4 16 16
Horizontal Total Vertical Vertical Horizontal Vertical Vertical Horizontal Vertical Vertical Vertical
Frequency 4 1 2 15 3 1 0 1 6 15
F-value 278.9 261.6 223.7 220.3 218.7 207.2 182.8 174.9 136.5 128.3 103.0
1596
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
Table 6 Signi"cance of the "rst 18 DCTSIS descriptors for the discriminant function of image 2's. Variable loadings sorted by F-ratio. Note: Only F-ratios'10,000 are presented Size
Direction
Frequency
F-value
Size
Direction
Frequency
F-value
8 8 8 4 8 4 8 16 4
Vertical Vertical Vertical Vertical Vertical Vertical Vertical Constant Horizontal
2 1 3 1 4 2 5 2 0
28,742 28,689 28,326 25,468 21,290 19,469 17,700 14,140 13,502
8 16 4 16 16 4 16 16 16
Vertical Horizontal Vertical Vertical Vertical Horizontal Vertical Vertical Horizontal
7 1 3 5 2 2 3 1 3
12,562 11,482 11,378 11,217 10,761 10,628 10,354 10,147 10,135
low-detail areas and high-detailed edges has shown good compression and reconstruction quality [30]. If textures could also be detected, modeled and coded in a manner similar to that used with high-detailed edges, greater compression and reconstruction quality might be achieved. Since existing systems use block coding, such a system would allow a thumbnail view of an image to be displayed without reconstructing the image. Although the textures were classi"ed using discriminant functions in this study, other types of classi"cation schemes might produce di!erent results. Since the vertical and horizontal SF power measurements are not both vertically and horizontally shift-insensitive they cannot optimally equate both horizontally and vertically adjacent textures. A segmentation procedure which uses the vertical DCTSIS descriptors to determine equality of horizontally adjacent textures and the horizontal DCTSIS descriptors to determine the equality of the vertically adjacent textures might demonstrate improved classi"cation performance. However, it is evident from the F-ratio statistic in Table 1 that the function's value is more markedly changed by changing the SF than shifting the texture in the non-shift-insensitive direction. The fact that TI1 had a higher classi"cation accuracy than TI2 may be due to the fact that regions near adjoining textures tended to be misclassi"ed since the values of texture measurement based on larger regions of interest (i.e., 16;16) are not purely derived from one texture when they are calculated at texture transitional regions. These transitional regions are more numerous in TI2 and may require a di!erent type of classi"cation algorithm. A multiscale segmentation algorithm which dynamically utilized smaller ROI DCTSIS descriptors when larger ROI DCTSIS descriptors become signi"cantly di!erent from the group centroid, or an edge-"nding algorithm which formally de"nes boundaries around regions based on transitional variations in the DCTSIS descriptor values might retain the high accuracy obtained in TI1 classi"cation.
There are many new and old techniques for measuring texture characteristics. Due to the lack of standardization in the textures evaluated, texture resolution, test patterns, image preprocessing, and classi"cation algorithms, a comparison of our technique with others is di$cult. Such a study is needed before the full potential of our technique can be determined. However, the ability of the DCTSIS descriptors developed in this paper to classify textures suggests that these texture measurement tools will work well in other imaging applications such as texture segmentation and recognition.
6. Summary The discrete cosine transformation (DCT) is the basis of JPEG image compression and has many fast algorithmic implementations. It decomposes images by extracting spatial-frequency components from a local image region. Many texture analysis methods, including those which utilize gabor "lters and wavelets, are based on space-frequency signal decomposition. Since the DCT is in wide use, the advantage of using its coe$cients to measure image characteristics would be of great utility. Many spatial-frequency-based textured measurements are limited by their sensitivity to the speci"c position of the analysis region on the texture pattern. If this so-called `shift-sensitivitya is eliminated an important factor of variation is greatly reduced. A sliding DCT is used to derive a spatial-frequency representation for a region of interest. Horizontal and vertical spatial-frequency shift-insensitive measurements are derived from these DCT components. These measurements are derived by examining several simple mathematical manipulations of the spatial-frequency values. Simulated textures with systematically varying frequency components are used to test the validity of the derived shift-insensitive measurement. Examples demonstrate how the derived DCT shift-insensitive descriptors
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
can classify image textures using discriminant functions. Since a large number of image display, storage and analysis systems are based on DCT hardware and software, DCT shift-insensitive descriptors can be easily integrated into existing technology and may be highly useful.
References [1] R.M. Haralick, K. Shanmugan, I. Dinstein, Texture features for image classi"cation, IEEE Trans. System Man Cybernet. 3 (1973) 610}621. [2] R.W. Conners, Towards a set of statistical features which measure visually perceivable qualities of texture, Proceedings Pat. Recong. Imag. Conference, 1979, pp. 382}390. [3] R.W. Conners, C.A. Harlow, A theoretical comparison of texture algorithms, IEEE Trans. Pattern Anal. Mach. Intell. 2 (1980) 204}222. [4] C.M. Wu, Y.C. Chen, K.S. Hsieh, Texture features for classi"cation of ultrasonic liver images, IEEE Trans. Med. Imag. 11 (1992) 141}152. [5] I.M. Elfadel, R.W. Picard, Gibbs random "elds, co-occurrences, and texture modeling, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 24}37. [6] P.P. Ohanian, R.C. Dubes, Performance evaluation for four classes of textural features, Pattern Recognition 25 (1992) 819}833. [7] D.C. He, L. Wang, Texture features based on texture spectrum, Pattern Recognition 24 (1991) 391}399. [8] D.C. He, L. Wang, Textural "lters based on the texture spectrum, Pattern Recognition 24 (1991) 1187}1195. [9] R.S. Ledley, Y.G. Kulkarni, L.S. Rotolo, T.J. Golab, J.B. Wilson, TEXAC: a special purpose processing texture analysis computer, Proceedings IEEE Comp. Doc. International Conference, 1977, pp. 66}71. [10] T.R. Reed, H. Wechsler, Segmentation of textured images and Gestalt organization using spatial/spatial-frequency representation, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 1}12. [11] J.G. Daugman, Uncertainty relation for resolution in space, space-frequency, and orientation optimized by two-dimensional visual cortical "lters, JOSA 2 (1985) 1160}1169. [12] S.A. Klein, B. Beutter, Minimizing and maximizing the joint space-spatial frequency uncertainty of Gabor-like functions: comment, JOSA Comm. 9 (1992) 337}340. [13] M. Unser, A. Aldroubi, M. Eden, A family of polynomial spline wavelet transforms, Signal Processing 20 (1993) 141}162.
1597
[14] P.H. Carter, Texture discrimination using wavelets, SPIE Applications Digital Image Processing XIV 1567 (1991) 432}438. [15] A. Laine, J. Fan, Texture classi"cation by wavelet packet signatures, IEEE Trans. Pattern Anal. Mach. Intell. 15 (1993) 1186}1191. [16] P.P. Ohanian, R.C. Dubes, Performance evaluation for four classes of textural features, Pattern Recognition 25 (1992) 819}833. [17] J. Strand, T. Taxt, Local frequency features for texture classi"cation, Pattern Recognition 27 (1994) 1397}1406. [18] J.J. Kulikowski, P.O. Bishop, Fourier analysis and spatial representation in the visual cortex, Experientia 37 (1981) 160}163. [19] J.J. Atick, A.N. Redlick, Mathematical description of the responses of simple cells in the visual cortex, Biol. Cybernet. 63 (1990) 99}109. [20] M. Rabbani, P.W. Jones, Digital image compression techniques, Tut. Tex. Opt. Eng. TT7 (1991) 102}128. [21] M.K. Tsatsanis, G.B. Giannakis, Object and texture classi"cation using higher order statistics, IEEE Trans. Pattern Anal. Mach. Intell. 14 (1992) 733}750. [22] D.G. Kleinbaum, L.L. Kupper, K.E. Muller, Applied Regression Analysis and other Multivariable Methods, PWS-KENT Publishing Comp., Boston, 1988. [23] P. Brodatz, Textures: A Photographic Album for Artists and Designers, Dover, New York, 1966. [24] W.R. Dillon, M. Goldstein, Multivariate Analysis: Methods and Applications, Wiley, New York, 1984. [25] L. Lee, F.Y. Huang, Restructured recursive DCT and DST algorithms, IEEE Signal 42 (1994) 1600}1609. [26] M. Matsui, H. Hara, Y. Uetani, L.S. Kim, T. Nagamatsu, Y. Watanabe, K. Matsuda, T. Sakurai, A 200 MHZ 13MM 2-D DCT macrocell using sense-amplifying pipeline #ip-#op scheme, IEEE J. Soli. 29 (1994) 1482}1490. [27] S.F. Chang, D.G. Messerschmitt, Manipulation and compositing of Mc-Dct compressed video, IEEE J. Sel. 13 (1995) 1}11. [28] R.S. Ledley, The processing of medical images in compressed format, Biomedical Image Processing and Biomedical Visualization: Proc. of SPIE, 1905 (1993) 677}687. [29] Z.Y. Lin, J.D.Z. Chen, Recursive running DCT algorithm and its application in adaptive "ltering of surface electrical recording of small intestine, Med. Bio. Engng Comp. 32 (1994) 317}322. [30] M.H. Lee, G. Crebbin, Classi"ed vector quantization with variable block-sized DCT models, IEE P. Vis. I 141 (1994) 39}48.
About the Author*ROBERT S. LEDLEY has been President of the National Biomedical Research Foundation in Washington, DC., since 1960 and also holds the rank of Professor at Georgetown University Medical Center in the Departments of Physiology and Biophysics, and of Radiology. He is a Fellow of the American Association for the Advancement of Science, a Founding Fellow of the American College of Medical Informatics, and a Senior Member of Institute of Electrical and Electronics Engineers. For his invention of the world's "rst whole body computerized tomograph (CT) scanner, he was inducted into the National Inventor's Hall of Fame in 1990, and in 1997 was honored by the President of the United States with the National Medal of Technology, the nation's highest honor for technological achievement. He is founder and Editor-in-Chief of four reviewed scienti"c journals published by Elsevier Science, including Pattern Recognition. He holds over 60 patents, and is the author of a number of books on electronic engineering, computer science, and computers in medicine.
1598
R.E. Frye, R.S. Ledley / Pattern Recognition 33 (2000) 1585}1598
About the Author*RICHARD E. FRYE obtained a Bachelor of Arts degree in Psychobiology from Long Island University in 1986, a Masters of Science degree in Biomedical Science and Biostatistics from Drexel University in 1993 and a Doctor of Medicine and Doctor of Philosophy Degree in Physiology and Biophysics from Georgetown University in 1998. After receiving his B.A., Dr. Frye, as a principal investigator of an NIH funded project, developed computerized tools to measure the biophysical characteristics of the nasal airway. During medical and graduate school, Dr. Frye was a Research Fellow at the National Biomedical Research Foundation where he investigated methods to optimize plane "lm stereo X-ray imaging and developed a model of visual cortical spatial-frequency processing for texture analysis and stereo object correspondence recognition. Dr. Frye is currently a Resident Physician in the Department of Pediatrics at Jackson Memorial Hospital, and plans to "nish a residency in Child Neurology at Boston Children's Hospital.
Pattern Recognition 33 (2000) 1599}1610
Improved orientation estimation for texture planes using multiple vanishing points Eraldo Ribeiro, Edwin R. Hancock* Department of Computer Science, University of York, York Y01 5DD, UK Received 11 February 1999
Abstract Vanishing point locations play a critical role in determining the perspective orientation of textured planes. However, if only a single vanishing point is available then the problem is undetermined. The tilt direction must be computed using supplementary information such as the texture gradient. In this paper we describe a method for multiple vanishing point and, hence complete perspective pose estimation, which obviates the need to compute the texture gradient. The method is based on local spectral analysis. It exploits the fact that spectral orientation is uniform along lines that radiate from vanishing points on the image plane. We experiment with the new method on both synthetic and real-world imagery. This demonstrates that the method provides accurate pose angle estimates, even when the slant angle is large. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Perspective pose recovery; Planar shape from texture; Spectral analysis; Multiple vanishing points
1. Introduction The perspective foreshortening of surface patterns is an important one for the recovery of surface orientation from 2D images [1,2]. Broadly speaking there are two routes to recover the parameters of perspective projection for texture patterns. The "rst of these is to estimate the texture gradient [3,4]. Geometrically, the texture gradient determines the tilt direction of the plane in the line-of-sight of the observer and its magnitude determines the slant angle of the plane. A more direct and geometrically intuitive alternative route to the local slant and tilt parameters of the surface is to estimate the whereabouts of vanishing points [5,6]. When only a single vanishing point is available, the direction of the texture gradient is still a pre-requisite since the surface orientation parameters can only be determined provided
that the tilt direction is known. However, if two or more vanishing points are available, then not only can the slant and tilt be determined uniquely, they can also be determined more accurately. Unfortunately, the location of a single vanishing point from texture distribution is not itself a straightforward task. If direct analysis is being attempted in the spatial domain, then the tractability of the problem hinges on the regularity and structure of the texture primitives [5,6]. Moreover, multiple vanishing point detection may be even more elusive. It is for this reason that frequency domain analysis o!ers an attractive alternative. The main reason for this is that the analysis of spectral moments can provide a convenient means of identifying the individual radial patterns associated with multiple vanishing points. 1.1. Related literature
* Corresponding author. Tel.: #44-1904-43-2767; fax: #441904-43-2767. E-mail address:
[email protected] (E.R. Hancock). Supported by CAPES-BRAZIL, under grant: BEX1549/95-2.
To set the work reported in this paper in context we provide an overview of the related literature. We commence by considering texture gradient methods for shape from texture. Broadly speaking, there are two ways in which texture gradients can be used for shape estimation.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 8 - X
1600
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
The "rst of these is to perform a structural analysis of pre-segmented texture primitives using the geometry of edges, lines or arcs [7}9]. The attractive feature of such methods is the way in which their natural notion of perspective appeals to the psychophysics of visual perception. Unfortunately, structural approaches are heavily dependant on the existence of reliable methods for extracting the required geometric primitives from the raw texture images. Moreover, they cannot cope with scenes in which there are no well-de"ned texture elements. As mentioned above, the second approach is to cast the problem of shape-from-texture in the frequency domain [10}13]. The main advantage of the frequency domain approach is that it does not require an image segmentation as a pre-requisite. For this reason it is potentially more robust than its structural counterpart. Bajcsy and Lieberman [14] were among the "rst to use frequency analysis in order to recover shape from texture. Their work is based on the analysis of the frequency energy gradient for oblique views of outdoor images. Subsequently, the frequency domain has been exploited to provide both local and global descriptions of the perspective distortion of textured surfaces. Speci"cally, Super and Bovik [11] describe texture information in terms of local spectral frequencies. Perspective pose is recovered by back-projecting the spectral energy to minimise a variance criterion. Stone [15] has proposed an iterative method based on a similar local frequency decomposition. The novel feature is to include feedback into the back-projection process, whereby the local frequency "lters undergo a$ne transformation. In this way the method is able to improve the spectral analysis for planes at high slant angles by using an adaptive scale. Underpinning these latter two methods is the local a$ne analysis of the spectral decomposition of textures under global perspective projection. In this respect, there are similarities with the work of Malik and Rosenholtz [16] which provides an a$ne quilting for neighbouring texture patches. Alternatively, planar orientation can be recovered provided that vanishing points in the image plane can be detected. Since most textures are normally rich in geometric distortion when viewed under perspective geometry, vanishing points can be detected based on the projected texture and used to solve for planar surface orientation [5,6]. For example, Kender has proposed an aggregation transform which accumulates edges directions of the texture primitives in order to estimate a vanishing point for planar surfaces [5]. Kwon et al., on the other hand have employed mathematical morphology in order to estimate planar surface orientation from the location of a single vanishing point [6]. These methods, however, approach the texture analysis in a structural manner. They are hence limited by the inherent drawbacks mentioned above.
1.2. Paper outline The aim of the work reported in this paper is to combine the advantages of the local spectral analysis of textures with the geometry of vanishing point detection in order to recover planar surface orientation. To achieve this goal, we model the texture content in terms of the local spectral frequency. This provides a more general representation of texture. It is therefore both more #exible and more stable to local texture variations. However, rather then using the radial frequency [11,14], we use the angular information provided by the local spectral distribution. Our aims are twofold. Firstly, we introduce the prerequisites for our study by providing a detailed review of the properties of the local spectral moments under perspective geometry. We commence by considering the local distortions of the spectral moments. Here, we follow Super and Bovik [11] and Krumm and Shafer [10] by using the Taylor expansion to make a local linear approximation to the transformation between texture plane and image plane. In other words, we identify the a$ne transformation that locally approximates the global perspective geometry. With this local approximation to hand, we follow Malik and Rosenholtz [13] by applying Bracewell's [17] a$ne Fourier theorem to compute the local frequency domain distortions of the spectral distribution. Based on this analysis we make a novel contribution and show that lines of uniform spectral orientation radiate from the vanishing point. The practical contribution in this paper is to exploit this property to locate multiple vanishing points. We provide an analysis to show how the slant and tilt parameters can be recovered from a pair of such points. We illustrate the new method for estimating plane orientation on a variety of realworld and synthetic imagery. This study reveals that the method is accurate even when the slant angle becomes large. The outline of this paper is as follows. In Section 2 we review the perspective geometry of texture planes. Section 3 shows how the vanishing point position is related to the slant and tilt angles of the texture plane. Details of the projective distortion of the power spectrum are presented in Section 4. These three sections provide the mathematical prerequisites for our study. They provide a synopsis of existing results widely used for shape-fromtexture. For instance, Super and Bovik [11] and Krumm and Shafer [10] have both made use of locally a$ne approximations of perspective geometry. The relationship between perspective pose parameters and vanishing point position is nicely described in the textbook of Haralick and Shapiro [18]. The Fourier analysis of local spectra under a$ne projection is exploited by Malik and Rosenholtz [13] in their work on curved surface analysis. Section 5 develops the novel idea underpinning the paper, namely, that lines radiating from vanishing points
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
have a uniform spectral angle. Section 6 experiments with the new method for estimating perspective pose on both synthetic and real-world images, Finally, Section 7 o!ers some conclusions.
2. Geometric modelling We commence by reviewing the projective geometry for the perspective transformation of points on a plane. Speci"cally, we are interested in the perspective transformation between the object-centred co-ordinates of the points on the texture plane and the viewer-centred coordinates of the corresponding points on the image plane as shown in Fig. 1. Suppose that the texture plane is a distance h from the camera which has focal length f(0. Consider two corresponding points that have co-ordinates X "(x , y )2 on the texture plane and X "(x , y )2 R R R G G G on the image plane. The perspective transformation between the two co-ordinate systems is X "¹ X , (1) G N R where ¹ is the perspective transformation matrix. We N represent the orientation of the viewed surface plane using the slant angle p and tilt angle q. This parametrisa-
1601
tion is a natural way to model local surface orientation. Fig. 2 illustrates the slant and tilt model. For a given plane, considering a viewer-centred representation, the slant is the angle between viewer sight line and the normal vector of the plane. The tilt is the angle of rotation of the normal vector around the sight line axis. The elements of the transformation matrix ¹ can be comN puted using the slant angle p and tilt angle q in the following manner:
cos q !sin q f cos p ¹ " N h!x sin p sin q cos q R
1 0 0
N
.
(2)
The perspective transformation in Eq. (2) represents a non-linear geometric distortion of a surface texture pattern onto an image plane pattern. Unfortunately, the non-linear nature of the transformation makes Fourier domain analysis of the texture frequency distribution somewhat intractable. In order to proceed, we therefore derive a local linear approximation to the perspective transformation. However, it should be stressed that the global quilting of the local approximations preserves the perspective e!ects required for the recovery of shapefrom-texture. With this linear model, the perspective distortion can be represented as shown in Fig. 3. In the
Fig. 1. Perspective projection of a planar surface onto an image plane. The projection of a local patch over the texture plane onto the image plane is also shown.
1602
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
instead of the normal vector components p and q. Let (xo , yo , h) be the origin or expansion point of the local R R co-ordinate system of the resulting a$ne transformation. This origin projects to the point (xo , yo , f ) on the image G G plane. We denote the corresponding local co-ordinate system on the image plane by X "(x , y , f ) where G G G x "x #xo and y "y #yo . The linearised version of G G G G G G ¹ in Eq. (2) is obtained through the Jacobian J(.) of N X where each partial derivative is calculated at the point G X "0. After the necessary algebra, the resulting linear G approximation is ¹H"J(X )" XG "J(¹ X )" XG , (3) N R N G where * * x (x , y ) x (x , y ) *x R G G *y R G G J(X )" G G . (4) G * * y (x , y ) y (x , y ) *x R G G *y R G G G G Rewriting ¹H is terms of the slant and tilt angles we have N xo sin p#f cos q cos p !f sin q ) G ¹H" , (5) N hf cos p yo sin p#f sin q cos p f cos q G where )"f cos p#sin p(xo cos q#yo sin q). Hence, G G ¹H depends on the expansion point (xo , yo ) which is N G G a constant. The transformation ¹H in Eq. (5) operates from the N texture plane to the image plane. Later on when we come to consider the Fourier transform from the observed texture distribution on the image plane to that on the texture plane, it is the inverse of the transpose transformation matrix, i.e. (¹H\)2 which will be of interest. N The matrix is given by
Fig. 2. Slant and tilt parametrisation of the plane orientation.
Fig. 3. Perspective mapping of X to X where ¹H is a linear R G N version of the perspective projection. ¹H is an a$ne transformaN tion between X and X . R G
h cos p (¹H\)2" N X
f cos q !yo sin p!f sin q cos p G . f sin q xo sin p#f cos q cos p G (6)
3. Vanishing points diagram, ¹H is the linear version of the perspective transN formation given by Eq. (2). Linear approximations of the local spectral distortion due to perspective projection have been employed by several authors in spectral shape-from-texture. Super and Bovik [11] derive their geometric model in terms of the instantaneous frequency at a point of the image plane. Krumm and Shafer [10] use a "rst-order Taylor series approximation of the perspective projection at a local point. Malik and Rosenholtz [16] employ a di!erential approximation of the local spectral perspective distortion. We follow Krumm and Shafer [10] and linearize ¹ using a "rst-order Taylor formula. Our transformaN tion, however, is given in terms of the slant and tilt angles
The net e!ect of the global perspective transformation is to distort the viewer-centred texture pattern in the direction of the vanishing point V"(x , y )2 in the image T T plane. Suppose that the object-centred texture pattern consists of a family of parallel lines which are oriented in the direction of the vanishing point. When transformed into the image-centred co-ordinate system, this family of lines can be represented using the following set of parametric equations [18]: ¸"+X "X "A#jB,, (7) where j is the parameter of the family. The three constants forming the vector B"(b , b , b )2 are the direc tion cosines for the entire family ¸. The individual lines in
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
¸ are each parametrised by the vector A"(a , a , a )2. The co-ordinates of the vanishing point V"(x , y )2 are T T computed using the standard perspective projection equations x "( f/(x !h)) x and y "( f/(x !h))y . The G R R G R R vanishing point is found by taking the limit of the coordinates as jPR, i.e. a #jb b "f , x " lim x " lim f T H G H a #jb !h b a #jb b "f . y " lim y " lim f (8) T H G H a #jb !h b Suppose that the vector N"(p, q, 1)2 represents the surface-normal to the texture-plane in the viewer-centred co-ordinate system of the image. Since every line lying on the texture-plane will be perpendicular to the normal vector, then by using Eq. (8) we have N ' B"pb #qb #b "p #q
x b T f
y b T #b "0. f
(9)
Since bO0, this implies that px #qy #f"0. In T T order to solve this equation and to determine the surface plane normal vector, we need to estimate two di!erent vanishing points in the image plane. Suppose that the two points are V "(x , y )2 and V "(x , y )2. The T T T T resulting normal vector components p and q are found by solving the system of simultaneous linear equations
x y p f T T "! . x y q f T T The solution parameters, p and q are
(10)
y !y x !x T T T T , q"f . (11) x y !x y x y !x y T T T T T T T T Using the two slope parameters, the slant and tilt angles are computed using the formulae p"f
p"arccos
1
(p#q#1
,
r"arctan
q . p
(12)
4. Projective distortion of the power spectrum In Section 2 we described the a$ne approximation of the perspective projection using a "rst-order Taylor series. The approximation allows us to model the global perspective geometry using local a$ne patches which are quilted together. The model is similar to the scaled orthographic projection [19]. Furthermore, in Section 3 we presented the geometric relationships necessary to recover planar surface orientation from two vanishing points in the image. In this section we will develop these
1603
ideas one step further by showing how the spectral content of the locally a$ne patches relates to the global parameters of perspective geometry. Speci"cally, we will describe how the vanishing point position V"(x , y ) T T can be estimated from local spectral information. The Fourier transform provides a representation of the spatial frequency distribution of a signal. In this section we show how local spectral distortion resulting from our linear approximation of the perspective projection of a texture patch can be computed using an a$ne transformation of the Fourier representation. We will commence by using an a$ne transform property of the Fourier domain [17]. This property relates the linear e!ect of an a$ne transformation A in the spatial domain to the frequency domain distribution. Suppose that G(.) represents the Fourier transform of a signal. Furthermore, let X be a vector of spatial co-ordinates and let U be the corresponding vector of frequencies. According to Bracewell et al., the distribution of image-plane frequencies U resulting from the Fourier transform of the R a$ne transformation X "AX #B is given by G R 1 U2R \ epH G[(A2)\U ]. (13) G(U )" R G "det(A)" In our case, the a$ne transformation is ¹H as given in N Eq. (5) and there are no translation coe$cients, i.e., B"0. As a result Eq. (13) simpli"es to 1 G(U )" G[(¹H2)\U ]. (14) G N R "det(¹H)" N In other words, the e!ect of the a$ne transformation of co-ordinates ¹H induces an a$ne transformation N (¹H2)\ on the texture-plane frequency distribution. The N spatial domain transformation matrix and the frequency domain transformation matrix are simply the inverse transpose on one-another. We will consider here only the a$ne distortion over the frequency peaks, i.e., the energy amplitude will not be considered in the analysis. For practical purposes we will use local power spectrum as the spectral representation of the image. This describes the energy distribution of the signal as a function of its frequency content. In this way we will ignore complications introduced by phase information. Using the power spectrum, small changes in phase due to translation will not a!ect the spectral information and Eq. (14) will hold. Our overall goal is to consider the e!ect of perspective transformation on the power-spectrum. In practice, however, we will be concerned with periodic textures in which the power spectrum is strongly peaked. In this case we can con"ne our attention to the way in which the dominant frequency components transform. According to our a$ne approximation and Eq. (14), the way the Fourier domain transforms locally is governed by U "(¹H2)\U . G N R
(15)
1604
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
In the next section we will use this local spectral distortion model to establish some properties of the projected spectral distribution in the viewer-centred co-ordinate system. In particular, we will show that lines radiating from the vanishing point connect points with identically oriented spectral components. We will exploit this property to develop a geometric algorithm for recovering the image-plane position of the vanishing-point, and hence, for estimating the orientation of the texture-plane.
5. Lines of constant spectral orientation In this section we focus on the directional properties of the local spectrum distribution. We will show how the uniformity of the spectral angle over the image plane can be used to estimate the vanishing point location and hence compute planar surface orientation. On the texture plane the frequency-domain angle of the unprojected spectral component is given by b"arctan(v /u ). Using the a$ne transformation of frequencies R R given in Eq. (15), and the transformation in Eq. (5), after perspective projection, the corresponding frequency domain angle in the image plane is
v G u G u f sin q#v (xo sin p#f cos q cos p) R G "arctan R . u f cos q!v (yo sin p#f sin q cos p) R R G
!f cos p r"x cos h#y sin h" cos h. T T sin p
(16)
( f cos p#xo sin p)v G R . (17) yo v sin p!fu G R R Let us now consider a line in the image plane radiating from a vanishing point which results from the projection to a family of horizontal parallel lines on the texture plane. This family of parallel lines would originally be described by the spectral component U "(0, v )2. After Q R perspective projection this family of parallel lines can be written in the `normal-distancea representation as L : r"xo cos h#yo sin h ∀(xo , yo ) 3 L, (18) G G G G where r is the length of the normal from the line to the origin, and, h is the angle subtended between the linenormal and the x-axis. Fig. 4 illustrates the geometry of this line representation together with the angle a of the spectral component at the point of expansion (xo , yo ). G G Since this line passes through the vanishing point V"(x , y )"(!f cos p/sin p, 0)2 on the image plane, T T
f cos p#xo sin p G . tan h sin p
yo "! G
For simplicity, we con"ne our attention to a rotated system of image-plane co-ordinates in which the x-axis is aligned in the tilt direction. In this rotated system of co-ordinates, the above expression for the image-plane spectral angle simpli"es to a"!arctan
for f(0, r can also be written as (19)
Substituting Eq. (19) into Eq. (18) and solving for yo we G obtain
a"arctan
Fig. 4. Normal form for lines radiating from a vanishing point (x , y ). A local spectral component centred at a point (xo , yo ) T T G G point over the line is also shown.
(20)
By substituting the above expression in Eq. (17) and using the fact that u "0, after some simpli"cation Q we "nd that a"h, ∀(xo , yo )3L. As a result, each G G line belonging to the family L connects points on the image plane whose local spectral distributions have a uniform spectral angle a. These lines will intercept at a unique point which is a vanishing point on the image plane. By using this property we can "nd the co-ordinates of vanishing points on the image plane by connecting those points which have corresponding spectral components with identical spectral angles. We meet this goal by searching lines for which the angular correlation between the spectral moments is maximum. To proceed we adopt a polar representation for the power spectrum. Suppose P (g, a) is the power spectrum in polar co-ordinates E? where g"(u#v is the radial variable and a"arcG G tan (v /u ) is the angular variable. Integrating over the G G radial variable, the angular power-spectrum is given by
P (a)" ?
P (g, a) dg. E? E
(21)
The angular distribution of spectral power at any given image can now be matched against those of similar orientation by maximising angular correlation. For the
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
purpose of matching we use the normalised correlation P (a)P (a) da ? o" ? ? , P (a) da P (a) da ? ? ? ?
(22)
where P (a) and P (a) are the two angular distributions ? ? being compared. The points with the highest values of o can be now connected to determine a line pointing in the direction of the vanishing point. To compute the angular power distribution P (a) we ? require a way of sampling the local power spectrum. In particular, we need a sampling procedure which provides a means of recovering the angular orientation information residing in the peaks of the power spectrum. We accomplish this by simply searching for local maxima over a "ltered representation of the local power spectrum. Since we are interested in the angular information rather than the frequency contents of the power spectrum, we ignore the very low-requency components of the power spectrum since these mainly describe microtexture patterns or very slow energy variation. In order to obtain a smooth spectral response we therefore use the Blackman}Tukey(B¹) power-spectrum estimator which is the frequency response of the windowed autocorrelation function. We employ a triangular smoothing window w(X) [20] due to its stable spectral response. The spectral estimator is then P(U ) 2"F+c (X );w(X ),, G VV G G
(23)
1605
where c is the estimated autocorrelation function of the VV image patch. Providing that we have at least two representative spectral peaks we can directly generate line directions according to the angular property we have described in previous the section of this paper. We can use as many distinct spectral components as we can estimate. However, a two-component decomposition is su$cient for our purposes. 6. Experiments In this section we provide some results which illustrate the accuracy of planar pose estimation achievable with our new shape-from-texture algorithm. This evaluation is divided into three parts. We commence by considering textures with known ground-truth slant and tilt. This part of the study is based on both synthetic textures and projected Brodatz textures [21]. The second part of our experimental study focuses on natural texture planes where the ground truth is unknown. In order to give some idea of the accuracy of the slant and tilt estimation process, we back-project the textures onto the frontoparallel plane. Since the textures are man-made and rectilinear in nature, the inaccuracies in the estimation process manifest themselves as residual skew. Finally, we compare the sensitivity of the new method based on multiple vanishing points with that of a gradient based or single vanishing point method.
Fig. 5. Arti"cial texture images: Group 1*sinusoidal texture images.
1606
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
6.1. Synthetic texture planes We commence with some examples for a synthetic regular sinusoid texture. Fig. 5 shows the synthetic texture in a number of poses. Superimposed on the textures are some of the detected lines radiating from the vanishing points. In Table 1 we list the ground-truth and estimated orientation angles. Also listed is the absolute error. The agreement between the ground-truth and estimated values is generally very good. A point to note from these "rst set of images is that the estimated slant error is
Table 1 Actual;estimated slant and tilt values (Arti"cial Group 1) Image
(a) (b) (c) (d) (e) (f )
Actual
Estimated
(p)
(q)
(p)
20 30 45 50 60 70
0 !30 45 225 120 0
19.7 30.0 44.7 51.1 58.7 72.5
(q) 0.0 !28.2 46.0 225.4 118.3 !0.3
Abs.
Error
p
q
0.3 0.0 0.3 1.1 1.3 2.5
0.0 1.8 1.0 0.4 1.7 0.3
proportional to the slant angle. This is due to the signi"cant variations of the texture scale for larger slant angles. This suggests that an adaptive scale process should be used to improve the accuracy of the method. From the results obtained from this set of images, there is no systematic variation in the tilt angle errors which have an average value of 0.83. The average error for the slant angle is 0.93. A second group of arti"cial images is shown in Fig. 6. In this group we have taken three di!erent arti"cial textures and have projected them onto planes of known slant and tilt. The textures are composed of regularly spaced geometric primitives of uniform size. Speci"cally, we have chosen elliptical, rectangular and lattice shaped primitives. However, it is important to stress that in this case the texture elements are not oriented in the direction of the vanishing point. Superimposed on the projected textures are the estimated lines of uniform spectral orientation. In Table 2 we list the ground-truth and estimated values of the slant and tilt angles together with the corresponding errors. The agreement between the estimated and ground-truth angles is good. Moreover, the computed errors are largely independent of the slant angle. The average slant-error is 13 and the average tilt-error is 2.53.
Fig. 6. Arti"cial texture images: Group 2*regular geometric images.
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
1607
A third group of images is shown in Fig. 7. In this group we have taken three di!erent texture images from the Brodatz album and have projected them onto planes of known slant and tilt. The textures are regular natural textures of almost regular element distribution. Superimposed on the projected textures are the estimated lines of uniform spectral orientation. The values for the estimated orientation angles are listed in Table 3.
6.2. Real-world examples
Table 2 Actual;estimated slant and tilt values (Arti"cial Group 2)
Table 3 Actual;estimated slant and tilt values (Brodatz Textures)
Image
Image
(a) (b) (c) (d) (e) (f )
Actual (p)
(q)
20 45 30 50 30 45
0 45 !30 225 !30 45
Estimated
Abs.
Error
(p)
p
q
0.1 1.2 0.0 0.5 0.1 3.8
1.0 1.4 2.4 2.6 5.2 1.0
20.1 46.2 30.0 49.5 30.1 41.2
(q) 1.0 46.4 !32.4 222.4 !35.2 44.0
This part of the experimental work focuses on realworld textures with unknown ground-truth. The textures used in this study are two views of a brick wall, a York pantile roof and the lattice casing enclosing a PC monitor. The images were collected using a Kodak DC210 digital camera and are shown in Fig. 8. There is some geometric distortion of the images due to camera optics. This can be seen by placing a ruler or straight edge on the
(a) (b) (c) (d) (e) (f )
Actual
Estimated
Abs.
Error
p
q
4.5 3.9 2.5 6.7 0.4 0.4
0.0 1.5 0.0 0.4 6.7 5.0
(p)
(q)
(p)
30 50 30 45 30 60
0 225 0 45 !30 120
34.5 53.9 27.5 51.7 30.4 59.6
(q) 0.0 223.5 0.0 44.6 !23.3 125.0
Fig. 7. Natural texture images: Group 3*Arti"cially projected Brodatz textures. (a) and (b) D101; (b) and (c) D1; (d) and (e) D20.
1608
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
Fig. 8. Outdoor texture images. (a) and (b) Brick wall; (c) Roof; (d) PC casing.
Fig. 9. Back-projected outdoor texture images. (a) and (b) Brick wall; (c) Roof; (d) PC casing.
brick-wall images and observing the deviations along the lines of mortar between the bricks. Superimposed on the images are the lines of uniform spectral orientation. In the case of the brick-wall images these closely follow the mortar lines. In Fig. 9 we show the back-projection of the textures onto the fronto-parallel plane using the estimated orientation angles. In the case of the brick wall, any residual skew is due to error in the estimation of the slant and tilt parameters. It is clear that the slant and tilt estimates are accurate but that there is some residual skew due to poor tilt estimation.
6.3. Sensitivity analysis: multiple vanishing point and the gradient based-method Finally, we provide some comparison between our two vanishing point method and the use of a single vanishing point in conjunction with texture gradient information [22]. In this analysis we investigate the behaviour of the estimated orientation with varying slant and tilt angles. We use a sinusoidal image texture of the sort used in Fig. 5. We commence by studying the errors for the slant and tilt angle estimation when only the slant angle varies. The graphs in Fig. 10 plot the slant and tilt errors when the orientation varies from 103 to 803 and the tilt angle remains constant at 03. In Fig. 10(a), the slant error is considerably larger for the gradient-based method than for the multiple vanishing point method. Moreover, our
new method provides better estimates for small slant angles. It is worth commenting that since we operate with a single "xed-scale representation of the power spectrum, there is a danger of texture sampling errors for large slant angles. However, the average slant error is still small and around 23. The average slant error for the gradient method is this case is 9.53. Turning our attention for the tilt error in Fig. 10(b), we observe that the new method accurately estimates the tilt angle, except when the slant angle is greater than 503. Again the "xed-scale problem can be observed. The accuracy of the gradient-based method is low for very small slant angles. This is due to di$culties in estimating the direction of the energy gradient due to small variations in texture density with location on the image plane. In the case of tilt estimation, both methods have similar accuracy for medium slant angles. However, the gradientbased method also looses accuracy for larger slant angles due to problems related to the texture scale. In Fig. 11 we show the estimation error under varying tilt and "xed slant angle. The tilt varies over the range from 03 to 1703, whilst the slant angle remains constant at 403. In Fig. 11(a) we show the slant error. The error for the new multiple vanishing point method is very small and almost constant with an average of 0.83. By contrast, the average error for the gradient method is 133. In Fig. 11(b) where we plot the tilt error, there is little to distinguish the two methods. The average value for the tilt error for the new method is 2.43 while for the gradient-based method it is 2.83.
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
Fig. 10. Slant/tilt error plot for both methods. Image slant varying from 10 to 803. (a) estimated slant error; (b) estimated tilt error.
1609
Fig. 11. Slant/tilt error plot for both methods. Image tilt varying from 0 to 1703. (a) estimated slant error; (b) estimated tilt error.
at developing these ideas are in hand and will be reported in due course. 7. Conclusions We have described an algorithm for estimating the perspective pose of textured planes from pairs of vanishing points. The method searches for sets of lines that connect points which have identically oriented spectral moments. These lines intercept at the vanishing point. The main advantage of the method is that it does not rely on potentially unreliable texture gradient estimates to constrain the tilt angle. As a result the estimated tilt angles are more accurately determined. There are a number of ways in which the ideas presented in this paper can be extended. In the "rst instance, we are considering ways of improving the search for the vanishing points. Speci"c candidates include Houghbased voting methods. The second line of investigation is to extend our ideas to curved surfaces, using the method to estimate local slant and tilt parameters. Studies aimed
References [1] D. Marr, Vision: A Computational Investigation into the Human Representation and Processing of Visual Information, Freeman, New York, 1982. [2] J.J. Gibson, The Perception of the Visual World, Houghton Mi$n, Boston, 1950. [3] E.J. Cutting, R.T. Millard, Three gradients and the perception of #at and curved surfaces, J. Exp. Psychol. 113 (2) (1984) 198}216. [4] K.A. Stevens, On texture gradients, J. Exp. Psychol. 113 (2) (1984) 217}220. [5] J.R. Kender, Shape from texture: an aggregation transform that maps a class of texture into surface orientation, in: 6th IJCAI, Tokyo, 1979, pp. 475}480. [6] J.S. Kwon, H.K. Hong, J.S. Choi, Obtaining a 3-d orientation of projective textures using a morphological method, Pattern Recognition 29 (1996) 725}732.
1610
E. Ribeiro, E.R. Hancock / Pattern Recognition 33 (2000) 1599}1610
[7] K. Ikeuchi, Shape from regular patterns, Artif. Intell. 22 (1984) 49}75. [8] J. Aloimonos, M.J. Swain, Shape from texture , Biol. Cybernet. 58 (5) (1988) 345}360. [9] K. Kanatani, T. Chou, Shape from texture: general principle, Artif. Intell. 38 (1989) 1}48. [10] J. Krumm, S.A. Shafer, Texture segmentation and shape in the same image, IEEE Interational Conference on Computer Vision, 1995, pp. 121}127. [11] B.J. Super, A.C. Bovik, Planar surface orientation from texture spatial frequencies, Pattern Recognition 28 (5) (1995) 729}743. [12] M.J. Black, R. Rosenholtz, Robust estimation of multiple surface shapes from occluded textures, in: IEEE International Symposium on Computer Vision, 1995, pp. 485}490. [13] J. Malik, R. Rosenholtz, A di!erential method for computing local shape-from-texture for planar and curved surfaces, IEEE Conference on Vision and Pattern Recognition, 1993, pp. 267}273. [14] R. Bajcsy, L. Lieberman, Texture gradient as a depth cue, Comput. Graphics Image Process. 5 (1976) 52}67.
[15] J.V. Stone, S.D. Isard, Adaptive scale "ltering: a general method for obtaining shape from texture, IEEE Trans. Pattern Anal Mach. Intell. 17 (7) (1995) 713}718. [16] J. Malik, R. Rosenholtz, Recovering surface curvature and orientation from texture distortion: a least squares algorithm and sensitive analysis, Lectures Notes in Computer Science } ECCV'94, Vol. 800, 1994, pp. 353}364. [17] R.N. Bracewell, K.-Y. Chang, A.K. Jha, Y.-H. Wang, A$ne theorem for two-dimensional fourier transform, Electron. Lett. 29 (3) (1993) 304. [18] R.M. Haralick, L.G. Shapiro, Computer and Robot Vision, Addison-Wesley, Reading, MA, 1993. [19] J. Aloimonos, Perspective approximations, Image Vision Compu. 8 (3) (1990) 179}192. [20] S.M. Kay, Modern Spectral Estimation: Theory and Application, Prentice-Hall, Englewood Cli!s, NJ, 1988. [21] P. Brodatz, Textures: A Photographic Album for Artists and Designers, Dover, New York, 1966. [22] E. Ribeiro, E.R. Hancock, 3-d planar orientation from texture: estimating vanishing point from local spectral analysis, IX British Machine Vision Conference, September 1998, pp. 326}335.
About the Author*ERALDO RIBEIRO is currently undertaking research towards a D.Phil. degree in computer vision in the Department of Computer Science at the University of York. Prior to this he gained his Master of Science Degree with distinction in Computer Science (Image Processing) at the Federal University of Sao Carlos (UFSCar-SP), Brazil in 1995. His "rst degree is in Mathematics from the Catholic University of Salvador*Brazil (1992). His research interests are in shape from texture techniques and 3-D scene analysis. About the Author*EDWIN HANCOCK gained his B.Sc. in physics in 1977 and Ph.D. in high energy nuclear Physics in 1981, both from the University of Durham, UK. After a period of postdoctoral research working on charm-photo-production experiments at the Stanford Linear Accelerator Centre, he moved into the "elds of computer vision and pattern recognition in 1985. Between 1981 and 1991, he held posts at the Rutherford}Appleton Laboratory, the Open University and the University of Surrey. He joined the University of York as a lecturer in the Department of Computer Science in July 1991. After being promoted to Senior Lecturer in October 1997 and to Reader in October 1998, he was appointed Professor of Computer Vision in December 1998. He leads a group of some 15 researchers in the areas of computer vision and pattern recognition. He has published about 180 refereed papers in the "elds of high energy nuclear physics, computer vision, image processing and pattern recognition. He was awarded the 1990 Pattern Recognition Society Medal and received an Outstanding Paper Award in 1997. Professor Hancock serves as an Associate Editor of the journal Pattern Recognition and has been a guest editor for the Image and Vision Computing Journal. He is currently guest-editing a special edition of the Pattern Recognition journal devoted to energy minimisation methods in computer vision and pattern recognition. He chaired the 1994 British Machine Vision Conference and has been a programme committee member for several national and international conferences.
Pattern Recognition 33 (2000) 1611}1620
Wavelet coe$cients clustering using morphological operations and pruned quadtrees Eduardo Morales, Frank Y. Shih* Computer Vision Laboratory, Department of Computer and Information Science, New Jersey Institute of Technology, Newark, NJ 07102, USA Received 9 February 1999; received in revised form 1 June 1999; accepted 1 June 1999
Abstract Transform coding has been extensively applied in image compression. The wavelet transform possesses the characteristic of providing spatial and frequency domain information. This characteristic plays an important role in image compression so that identi"cation and selection of the signi"cant coe$cients in the wavelet transform become easier. The result has the advantages of better compression ratio and better image quality. The paper presents a new approach to create an e$cient clustering of the signi"cant coe$cients in the wavelet transform based on morphological operations and pruned quadtrees. In this way, only the signi"cant coe$cients and their map will be encoded and transmitted. The decoder process will use the map to place the signi"cant coe$cients in the correct locations and then apply the inverse wavelet transform to reconstruct the original image. Experimental results show that the combination of morphological operations and pruned quadtrees outperforms the conventional quadtrees by a compression ratio of 2 to 1 with the similar image quality. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Wavelet transform; Image compression; Quadtree; Mathematical morphology
1. Introduction Image compression using the transform coding techniques [1}4] has been an active research area for many years. One of their characteristics is the ability to decorrelate data. As a consequence, a `densea signal is transformed into a `sparsea signal where most of the information is concentrated on just a few coe$cients. Compression can be achieved by setting all non-signi"cant coe$cients to a single value and then applying the entropy encoding. The resulting bit sequence will mostly contain the information about the signi"cant coe$cients. One popular transform during the last few years has been the Discrete Cosine Transform (DCT). Industrial standards for compressing still images (e.g. JPEG) and motion pictures (e.g. MPEG) have been based on the
* Corresponding author. Tel.: #1-973-596-5654; fax: #1973-596-5777. E-mail address:
[email protected] (F.Y. Shih).
DCT. Both standards have produced good results, but have limitations at high compression ratios [5,6]. At low data rates, the DCT-based transforms su!er from a `blocking e!ecta due to its unnatural block partition that is required in the computation. Other drawbacks include mosquito noise and aliasing distortions. Furthermore, the DCT does not improve the performance as well as the complexities of motion compensation and estimation in video coding [6]. Due to the shortcomings of DCT, the discrete wavelet transform (DWT) has become increasingly important. The main advantage of the DWT is that it provides the space}frequency decomposition of images [7,8], overcoming the DCT and the Fourier transform that only provide the frequency decomposition. By providing the space}frequency decomposition, the DWT allows energy compaction at the low-frequency subbands and the space localization of edges at the high-frequency subbands. Furthermore, the DWT does not present a blocking e!ect at the low data rates. In this paper, we present a new algorithm for identifying signi"cant coe$cients in the DWT and "nding an
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 7 - 8
1612
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
e$cient way to describe their space locations. By doing so, the image compression coders can bene"t from obtaining lower bits-per-pixel ratios. The coders are able to only encode the coe$cients that contain most of the image information as well as their respective spatial locations. The coe$cients that represent noise and the details that are almost imperceptible will be excluded from the encoding process. The paper is organized as follows. Section 2 brie#y reviews the basic information of the wavelet transform. Section 3 introduces the steps in the transform coding process. Sections 4 and 5 discuss how to identify and represent the signi"cant transform coe$cients. Experimental results are presented in Section 6. Finally, conclusions are made.
2. The wavelet transform Wavelets are functions that integrate to zero waving above and below the x-axis [9]. Like sines and cosines in the Fourier transform, wavelets are used as the basis functions for signal and image representation. Such basis functions are obtained by dilating and translating a mother wavelet t(x) by amounts s and q, respectively:
( (x)" t OQ
x!q , (q, s)3R;R> . s
The translation and dilation allow the wavelet transform to be localized in time and frequency. Also, wavelet basis functions can represent functions with discontinuities and spikes in a more compact way than sines and cosines [8]. The continuous wavelet transform (CWT) can be de"ned as
1 cwt (q, s)" x(t)(H (t) dt, R OQ ("s"
where N and N denote the number of samples at scales s and s , respectively, and s 's . This rule means that at higher scales (lower frequencies), the number of samples can be decreased. The sampling rate obtained is the minimum rate that allows the original signal to be reconstructed from a discrete set of samples. A dyadic scale satis"es the Nyquist's rule by discretizing the scale parameter into a logarithmic series, and then the time parameter is discretized by applying Eq. (3) with respect to the corresponding scale parameters. The following equations set the translation and dilation to the dyadic scale with logarithmic series of base 2 for t : IH q"k2H,
s"2H.
We can view these coe$cients as "lters which are classi"ed into two types. One set, H, works as a low-pass "lter and the other, G, as a high-pass "lter. These two types of coe$cients are called quadrature mirror xlters used in the pyramidal algorithms. For a two-dimensional (2-D) signal, it is not necessary, although straight-forward, to extend the 1-D wavelet transform to its 2-D one. The strategy is described as follows. The 1-D transform can be applied individually to each of the dimensions of the image. By using the quadrature mirror "lters we can decompose an n;n image I into the wavelet coe$cients as below. Filters H and G are applied on the rows of an image, splitting the image into two subimages of dimensions n/2;n (half the columns) each. One of these subimages H I (where the P subscript r denotes row) contains the low-pass information and the other, G I, contains the high-pass P
(1)
where (H is the complex conjugate of ( and x(t) is the OQ OQ input signal de"ned in the time domain. The inverse CWT can be obtained from the transpose of Eq. (1):
1 1 x(t)" cwt (q, s) ( (t) dt ds, R C s O Q R Q O
(2)
where C is a constant and depends on the wavelet used. R To discretize the CWT, the simplest case is the uniform sampling of the time}frequency plane. However, the sampling could be more e$cient by using the Nyquist's rule: s N " N , s
(3)
Fig. 1. Wavelet decomposition of an image.
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
information. Next, the "lters H and G are applied to the columns of both subimages. Finally, four subimages with dimensions n/2;n/2 are obtained. Subimages H H I, A P G H I, H G I, G G I (where the subscript c denotes A P A P A P column) contain the low}low, high}low, low}high and high}high passes, respectively. Fig. 1 illustrates this decomposition. The same procedures are applied iteratively to the subimage containing the most low band information until the subimage's size reaches 1;1. Therefore, the initial dimensions of the image are required to be powers of two. In practice, it is not necessary to carry out all the possible decompositions until the size of 1;1 is reached. Usually, just a few levels are su$cient. Fig. 2 shows the `Lenaa image decomposed in two levels. Each of the resulting subimages is known as a subband.
1613
3. The transform coding process The transform coding of images consists of three stages: a linear decorrelating transform, a quantization process and entropy coding of the quantized data [3]. Fig. 3 shows the transform coding #ow chart. The quantization step reduces the #oating-point coe$cients to a small set of discrete quantities. The entropy encoding step "nds an e$cient way to represent the quantized values as a bit stream based on their occurrence frequency. In order to obtain the original information from the compressed data, the transform must be invertible. Even if the transform allows for perfect reconstruction, it is impossible to obtain the original data because only an approximation to the original coe$cients is provided by the reverse quantization process. This is an unavoidable e!ect that quantization implies a loss of fraction. It is required to convert the coe$cients into discrete quantities suitable for the entropy coding. Therefore, all methods that use transform coding inherently cause loss. With the proposed algorithm, the image coders will not quantize the whole set of coe$cients. Instead, coders will identify the signi"cant coe$cients, then proceed to characterize their space location and quantize them into discrete values. The entropy encoder will process the selected coe$cients as well as the spatial information to obtain a cost-e!ective bit stream. Essentially, the proposed algorithm will replace the entropy encoding of non-signi"cant coe$cients for the encoding of the spatial information. It is desirable that the latter be brief and use only a limited number of symbols (Fig. 4).
4. Signi5cant coe7cients identi5cation 4.1. Applying the DWT
Fig. 2. Two-level wavelet decomposition of the Lena image.
The DWT is applied to an image, and the number of wavelet decomposition depends on the dimensions of the image. In our experiments, most images are decomposed into "ve levels, yielding 16 subbands. Without loss of generality, Fig. 5 shows the numbering scheme for the decomposition of two levels. It is desirable that the smallest subband be about 32;32 pixels. This is because DWT will have most of the energy compact at the higher scales, meaning that all coe$cients in such subbands are most likely to be signi"cant and therefore thresholding will have no e!ect in such scales. Any type of wavelets can be used in the decomposition. However, for better results a smooth wavelet is preferable. Wavelets such as the B-spline are better choices than non-continuous wavelets such as the Haar. In general, smooth wavelets produce better reconstruction of the original image from a fewer (or modi"ed) set of original coe$cients. In the following sections, we adopt
1614
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
Fig. 3. An image coder.
Fig. 4. A proposed image coder.
Fig. 5. Subband numbering scheme.
centage of the coe$cients with the smallest magnitude to zero. Another option is hard thresholding where j is set to a value which produces a satisfactory visual e!ect and good compression ratios. The higher the j, the higher the compression ratio, but the distortion increases. We have chosen to use hard thresholding because the signi"cance of a coe$cient is directly related to its magnitude, regardless of its probability distribution. The compression ratio is directly proportional to the threshold value. Ideally, a user should be able to specify a compression ratio so that the threshold value can be calculated accordingly. This can be done by testing several ratedistortion values for di!erent j. We can also a!ect the compression ratio by modifying the quantization parameters. Fig. 7 shows the subband 11 of the Lena image with a hard threshold of 25. 4.3. Clustering
the discrete "nite variation (DFV) 7/9 wavelet as described by Odegard [10]. Fig. 6 shows a 512;512 Lena image and its respective 5-level wavalet transformation using the DFV 7/9 wavelet. 4.2. Thresholding After the wavelet transform is applied to an image, the most signi"cant coe$cients can be extracted. The procedure is to eliminate non-signi"cant coe$cients by thresholding since these have a magnitude close to zero. After thresholding, only the desired coe$cients remain. There are several ways to choose the threshold value j. Universal thresholding by Donoho and Johnstone [11] sets p(2 log n j" , (n where p is the standard deviation of the coe$cients and n is the total size of samples. Other possibility is quantile thresholding where j is statistically set to replace a per-
Once thresholding is applied, the resulting image can be used as a map to "nd the signi"cant coe$cients. However, prior to transmission it is desirable to e$ciently represent their locations in space. Many techniques have been proposed. One of the most notable techniques is the zerotree scheme by Shapiro [4]. It exploits the interrelationship among di!erent levels of subbands to identify non-signi"cant coe$cients in low-frequency subbands and infers that similar locations at high-frequency subbands will also contain non-signi"cant coe$cients. The result is a map with regions of most zeroes. All coe$cients not contained in the map are quantized and entropy encoded for transmission or storage. Although this technique performs well, it has the disadvantage that it maps zero space. The proposed approach in this paper attempts to map the signi"cant coe$cients instead of mapping zero space as zerotrees do. One of the goals is to provide a map with a better `"ta and less empty pockets. The "rst step is to identify the areas with high density of signi"cant coe$cients and form the clusters of relevant data. The next
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
1615
Fig. 7. Subband 11 of the Lena image thresholded at 25.
proximity. The next dilation and erosion form a closing operation which fuses narrow breaks and "lls in small holes without altering the shape of the cluster signi"cantly. Fig. 8(b) show the result of the morphological clustering produced by Eq. (4) in subband 11. Fig. 8(a) and c show the result of comparable operations such as I kk and I kkkk, respectively. We can obH H serve that Fig. 8(a) does not properly cluster the coe$cients in Lena's hair. Fig. 8(c) does a much better job, however the clusters do not o!er a `tight "ta, a large number of zero coe$cients are introduced. Fig. 9 shows the resulting clusters after applying Eq. (4) to Fig. 7.
5. Cluster shape description Fig. 6. (a) Original Lena (b) 5-level wavelet transform.
step is to e$ciently bound and represent those clusters for storage or transmission. Mathematical morphology [12}14] is used to create clusters of signi"cant coe$cients. The thresholded image is applied by consecutive dilations (denoted by ) followed by an erosion (denoted by ): I "I kkk, AJSQ H
(4)
where k is a circular structuring element whose size is 3. The "rst dilation joins signi"cant coe$cients of close
To describe the shape of the clusters for storage or transmission, we use a modi"ed version of quadtrees. The resulting tree is a map that covers the locations of signi"cant coe$cients, also a `coarsera version of the tree that could be used as a map of `areas of interesta for the search of motion vectors. 5.1. Quadtrees Quadtrees [15] are generally used to describe squared binary regions of data. A region can be black, white or gray. A black (white) region is assigned when all pixels are turned o! (on), and a gray region is assigned when all pixels are mixed-typed. The quadtree of an image is obtained by subdividing it into four quadrants and
1616
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
Fig. 8. (a) One dilation. (b) Two dilations. (c) Three dilations.
Fig. 9. Clusters of signi"cant coe$cients.
coding them as either black, white or gray. The process is applied recursively on all gray quadrants. The result is a tree where all internal nodes are gray and all the leaves are either black or white. Thus by grouping regions of the similar color, the quadtree is able to represent a compact structure of a binary image. For the wavelet compression, we do not need an exact representation of the clusters. The clusters themselves are an approximation of an area with high density of signi"cant coe$cients. We shall focus instead on how to economically encode a quadtree of the image by sacri"cing some non-signi"cant details.
Considering Fig. 10(a), we have a cluster that "ts within a quadrant of a quadtree. Such a quadrant could be directly encoded by a sequence of 16 bits. When calculating its quadtree, we obtain a tree with 17 nodes (Fig. 10(b)), which has a cost of 28 bits by using the Hu!man encoding. It is clear that the cost to represent a cluster using the quadtree is expensive. If we are willing to sacri"ce some accuracy of shape description, we could assume that the whole quadrant is of color white. In that case the encoding cost will be only 1 bit. However, we introduce "ve extra pixels in our map. When we use this map to select the signi"cant coe$cients from the thresholded transform, these "ve pixels will introduce "ve zero-value coe$cients to the encoding process. Thus the encoding cost is increased. However, the cost will not increase much since the clusters contain zeroes usually in between signi"cant coe$cients. When all clusters are considered for entropy coding, zero coe$cients have the largest frequency. Experimental results show that the entropy encoders usually encode zero coe$cients with 2 or 3 bits. This brings the overall cost to 11 (1#5;2) or 16 (1#5;3) bits. We observe that the encoding cost can be reduced if the clusters are represented by quadrants rather than by pixels. We modify the quadtree algorithm as follows. Before characterizing a quadrant with color gray, we obtain the cost to code such a quadtree. At the same time the cost of characterizing the quadrant with color white as well as the zero pixels is computed. We choose the characterization with less cost. Fig. 11 shows a map obtained by reconstructing the resulting quadtree using the technique explained above. We observe that most clusters are preserved with little modi"cation. A conventional quadtree for the 128;128 Lena image in Fig. 2 requires 793 bytes to encode. The modi"ed quadtree has a cost of only 416.
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
1617
Fig. 10. (a) Quadrant with cluster (b) quadtree.
Table 1 Information about the area covered by the quadtree Image
Lena Goldhill
Pixels
262 144 262 144
Pruned
Not Pruned
Area covered
Nodes
Area covered
Nodes
15.9% 27.2%
15 105 22 805
13.9% 24.2%
29 277 43 617
Table 2 Compression trade o! Image
Signi"cant#Zeroes
Signi"cant#Quadtree
Lena Goldhill
42 470 45 095
12 679 18 729
Fig. 11. Map obtained from the modi"ed quadtree.
6. Experimental results The proposed algorithm is applied to the images of Lena and Goldhill, both being dimensions of 512;512. As described before, the "rst step is to perform the wavelet transform, followed by a thresholding to identify the signi"cant coe$cients. The next step is to cluster these coe$cients with a series of dilations and erosions. A circular structuring element was used for this task. Once the clusters have been identi"ed, we compute the quadtree map. In Table 1 we observe that the signi"cant coe$cients map produced by pruned quadtree technique clearly outperforms the conventional quadtree by almost a ratio of 2 : 1. In the case of the Lena image, the pruned quad-
tree produces 15, 105 nodes (black, white or gray), almost half of what the conventional quadtree would produce. We also abserve that the pruned quadtree is almost as accurate as the conventional one. The pruned quadtree only covers 2% more pixels, all of them being insigni"cant (zero valued). The quadtree is proved to be useful for image compression. Table 2 shows a comparison in bytes of compressing the entire set of quantized coe$cients against the compression of the pruned quadtree nodes plus only the signi"cant coe$cients. Figs. 12 and 13 show the Lena image reconstructed at 0.5 bpp (bits-per-pixel) and 0.25 bpp. The coder is modeled according to Fig. 4. The quantizer used is a simple multi-level quantizer. The entropy encoding used was arithmetic coding. The quadtrees and the signi"cant coe$cients were encoded separately. Similarly, Figs. 14}16 show the original Goldhill image reconstructed at 0.5 and 0.25 bpp. Let the measure
1618
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
Fig. 12. Lena reconstructed at 0.5 bpp.
Fig. 14. Original Goldhill.
Fig. 13. Lena reconstructed at 0.25 bpp.
Fig. 15. Goldhill reconstructed at 0.5 bpp.
of peak-signal-to-noise-ratio (PSNR) be
structed images and the comparison to Shapiro's zerotree encoder.
((peak value of g(x, y)) PSNR" , ((1/NM) ,\ +\ (g(x, y)!f (x, y)) V W
7. Conclusions
where f (x, y) and g(x, y) denote the original and reconstructed images. Table 3 shows the PSNR of the recon-
Pruned quadtrees have been proved to be an e$cient approach for mapping all the signi"cant coe$cients from
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
1619
compression by only mapping and tracking the regions of interest. References
Fig. 16. Goldhill reconstructed at 0.25 bpp.
Table 3 PSNR for zerotree and pruned quadtree Image
Lena
Rate (bpp) Zerotree Quadtree
0.25 33.17 32.22
0.50 36.28 34.98
1.00 39.55 36.13
the wavelet transform. Such a map allows us to compress images by only entropy encoding the relatively few signi"cant coe$cients along with a map represented by a pruned quadtree. Without the map, the decoder would be unable to place the coe$cients in their correct positions. Future work will focus on image compression using the embedded quantizers along with the embedded quadtrees. We will further extend our technique to work with multi-resolution motion compensation and video
[1] J.D. Eggerton, M.D. Srinath, A visually weighted quantization scheme for image bandwidth compression at low data rates, IEEE Trans. Commun. 34 (1986) 840}847. [2] R.A. DeVore, B. Jawerth, B. Lucier, Image compression through wavelet transform coding, IEEE Trans. Inform. Theory 38 (1992) 719}746. [3] M. Vetterli, J. Kovavc\ evicH , Wavelets and Subband Coding, 1st Edition, Prentice-Hall, Englewood Cli!s, NJ, 1995. [4] J.M. Shapiro, Embedded image coding using zerotrees of wavelet coe$cients, IEEE Trans. Signal Process. 41 (1993) 3445}3462. [5] M.K. Mandal, E. Chan, X. Wang, S. Panchanathan, Multiresolution motion estimation techniques for video compression, Opt. Eng. 35 (1996) 128}136. [6] J. Luo, C.W. Chen, K.J. Parker, T.S. Huang, A scene adaptive and signal adaptive quantization for subband image and video compression using wavelets, IEEE Trans. Circuits Systems Video Technol. 7 (1997) 343}357. [7] I. Daubechies, Orthonormal bases of compactly supported wavelets, Commun. Pure Appl. Math. 41 (7) (1988) 909}996. [8] A. Graps, An introduction to wavelets, IEEE Comput. Sci. Eng. 2 (1995). [9] R.M. Rao, A.S. Bopardikar, Wavelet Transforms. Introduction to Theory and Applications, 1st Edition, Addison-Wesley, Reading, MA, 1998. [10] J.E. Odegard, C.S. Burrus, Smooth biorthogonal wavelets for applications in image compression, Proceeding of DSP Workshop, DSP Workshop, September 1996. [11] D. Donoho, I. Johnstone, G. Kerkyacharian, D. Picard, Density estimation by wavelet thresholding, Technical Report Stanford University, 1993. [12] J. Serra, Image Analysis and Mathematical Morphology, Academic Press, New York, 1982. [13] F.Y. Shih, O.R. Mitchell, Threshold decomposition of grayscale morphology into binary morphology, IEEE Trans. Pattern Anal. Mach. Intell 1 (1989) 31}42. [14] F.Y. Shih, C.T. King, C.C. Pu, Pipeline architectures for recursive morphological operations, IEEE Trans. Image Process 4 (1995) 11}18. [15] R.C. Gonzales, R.E. Woods, Digital Image Processing, 3rd Edition, Addison-Wesley, Reading, MA, 1992.
About the Author*EDUARDO MORALES received the B.S. degree from Indiana University of Pennsylvania in 1991, the M.S. degree from New Jersey Institute of Technology (NJIT) in 1993, and is currently working towards a Ph.D. degree at NJIT, all in Computer Science. He is presently a special lecturer in the Department of Computer and Information Science at NJIT. His research interests include image processing, computer graphics, object oriented design and distributed processing. About the Author*FRANK Y. SHIH received the B.S. degree from National Cheng-Kung University, Taiwan, in 1980, the M.S. degree from the State University of New York at Stony Brook, in 1984, and the Ph.D. degree from Purdue University, West Lafayette, Indiana, in 1987, all in electrical engineering. He is presently a professor jointly appointed in the Department of Computer and Information Science (CIS) and the Department of Electrical and Computer Engineering (ECE) at New Jersey Institute of Technology, Newark, New Jersey. He currently serves as an associate chairman of the CIS department and the director of Computer Vision Laboratory. Dr. Shih is
1620
E. Morales, F.Y. Shih / Pattern Recognition 33 (2000) 1611}1620
on the Editorial Board of the International Journal of Systems Integration. He is also an associate editor of the International Journal of Information Sciences, and of the International Journal of Pattern Recognition. He has served as a member of several organizing committees for technical conferences and workshops. He was the recipient of the Research Initiation Award from the National Science Foundation in 1991. He was the recipient of the Winner of the International Pattern Recognition Society Award for Outstanding Paper Contribution. He has received several awards for distinguished research at New Jersey Institute of Technology. He has served several times in the Proposal Review Panel of the National Science Foundation on Computer Vision and Machine Intelligence. He holds the IEEE senior membership. Dr. Shih has published over 100 technical papers in well-known prestigious journals and conferences. His current research interests include image processing, computer vision, computer graphics, arti"cial intelligence, expert systems, robotics, computer architecture, fuzzy logic, and neural networks.
Pattern Recognition 33 (2000) 1621}1636
Binary object representation and recognition using the Hilbert morphological skeleton transform Essam A. El-Kwae, Mansur R. Kabuka* Center for Medical Imaging and Medical Informatics, Department of Electrical and Computer Engineering, University of Miami, 1251 Memorial Drive, MCA Rm 406, Miami, FL 33146, USA Received 8 May 1998; received in revised form 9 July 1999; accepted 9 July 1999
Abstract A binary shape representation called the Hilbert Morphological Skeleton Transform (HMST ) is introduced. This representation combines the Morphological Skeleton Transform (MST ) with the clustering capabilities of the Hilbert transform. The HMST preserves the skeleton properties including information preservation, progressive visualization and compact representation. Then, an object recognition algorithm, the Hilbert Skeleton Matching Algorithm (HSMA), is introduced. This algorithm performs a single sweep over the HMSTs and renders the similarity between them as a distance measure. Testing the HSMA against the Skeleton Matching algorithm (SMA) and invariant moments revealed that the HSMA algorithm achieves slightly better object recognition rates while substantially reducing the complexity. In an experiment of 14,400 shape matches, the HSMA achieved a 90.36% recognition rate as opposed to 89.76% for the SMA and 89.49% for invariant moments. On the other hand, the HSMA improved the SMA processing more than 40%. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Shape recognition; Skeleton; Morphological skeleton transform; Hilbert curve; Space "lling curves
1. Introduction Image databases received enormous attention in recent years. The application areas based on images include, but are not limited to, Medical Imaging, Medical Information Systems, Document Image Processing, Of"ce Information Systems, Remote Sensing, Management of Earth Resources, Geographic Information Systems (GIS), Cartographic Modeling, Mapping and Land Information Systems, Robotics, Interactive ComputerAided Design and Computer-Aided Manufacturing Systems. Queries on image databases are based on image content such as: color, texture, shape, motion, volume, semantic constraints or spatial constraints. Since queries of this nature are imprecise, the database is required to
* Corresponding author. Tel.: #1-305-284-2212; fax: #1305-284-4044. E-mail address:
[email protected] (M.R. Kabuka).
return similar images, ranked by similarity, in response to a given query. Similarity retrieval is a two-step process. First, an index is searched to retrieve the candidate set of images. This step acts as a "lter to avoid unnecessary searches (false alarms). Next, the candidate set is submitted to a similarity algorithm, which ranks images according to their degree of similarity to the query image. Eakins classi"ed image queries into 3 levels that range from the highly concrete to the very abstract [1]. Level 1, the lowest level, comprises retrieval by primitive features such as texture, color, and shape. Systems supporting level 1 retrieval usually rely on automatic extraction and matching of primitive features such as: the QBIC system [2] and MITs Photobook [3]. Level 2 is more general and it comprises retrieval by derived attributes involving logical inference about the objects depicted in the image. Level 3 comprises retrieval by abstract attributes, involving a high degree of abstract and possibly subjective reasoning about the meaning and purpose of the objects or scenes depicted. At this level, problems so far proved insurmountable.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 6 9 - 7
1622
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
Most existing image retrieval systems use low-level features for image retrieval. Given improved mechanisms for building up knowledge bases of image feature interpretations, see for example [4,5], it should be possible to construct systems that o!er level 2 image retrieval within restricted but non-trivial domains such as photographs of sporting events. Shape is one of the strongest cues for retrieval of content information from images [6]. Shape representation is usually accomplished by adopting a representation scheme that reduces a shape to a number of shape descriptors and then classifying a shape according to the value of the descriptors [7]. Alternatively, a shape can be represented by a set of shape primitives and possibly a set of connection operators and then a structural recognition scheme can be used to perform the recognition task [8,9]. In image databases, objects may be represented in various forms. One form is to store signi"cant points from each object such as the centroid, the corners of the minimum bounding rectangle (MBR) [10,11] or the skeleton [7]. The skeleton is a compact abstract object representation that preserves the shape and the size of the original object. Encoding an image as a set of object skeletons allows this image to be partially reconstructed and progressively re"ned back to the original image. Spatial reasoning becomes more accurate while visualization is more #exible in terms of displaying pictures with various degrees of accuracy to meet di!erent users' requirements. Information preserving shape descriptors are extremely useful in image database applications especially in distributed environments where physical images are stored at distributed nodes and logical images are stored at each local node. An approximate reconstruction of the original image may be generated without having to retrieve the physical image from a remote location. Only those images that are deemed by the user to be relevant to the given query need to be physically retrieved. This scheme is both time and bandwidth e$cient which makes it suitable for large image databases. One of the most used techniques for obtaining the skeleton of an object is the morphological skeleton transform (MST ). Trahanias [7] presented a scheme for binary shape recognition, which uses the MST representation. This scheme was based on a skeleton matching algorithm (SMA) which renders the similarity between two MSTs as a distance measure. The computation complexity of the SMA algorithm is in the O(N) in the worst case, where N is the number of rows or columns in the image grid and in order O(2kN) on average. An added di$culty is that there are no de"nite start and end points for the skeletons to be matched. In addition, no apparent sequence of pixels that belong to an MST can be de"ned [7]. Thus, a more e$cient skeleton matching algorithm is required. A binary shape representation called the Hilbert Morphological Skeleton Transform ( HMST ) is introduced.
This representation combines the Morphological Skeleton Transform (MST ) with the superior clustering capabilities of the Hilbert transform. The HMST preserves the favored properties of skeleton representation including information preservation, progressive visualization and compact representation. Then, an e$cient object recognition algorithm called the Hilbert Skeleton Matching Algorithm (HSMA) is introduced. This algorithm renders the similarity between two objects represented by HMST as a distance measure. The HSMA performs a single sweep over the HMSTs and has a linear worst-case complexity. The rest of this paper is organized as follows: In Section 2, the MST extraction is reviewed. The introduced HMST representation and the HSMA are explained in Section 3. The various experiments used to test the HSMA against the SMA are discussed in Section 4, followed by conclusions. 2. The Morphological Skeleton Transform (MST) One of the most used techniques for obtaining the skeleton of an object is based on mathematical morphology and is thus called the morphological skeleton transform (MST ). Three fundamental morphological operations are used for obtaining skeletons and reconstructing binary objects. These operations are erosion, dilation and opening and they are de"ned as follows: Let X be any binary shape and B is a symmetric structuring element. The erosion of X by B is X#B" 7 X!b @Z "+z : (B#z)-X, The dilation of X by B is X B" 7 X#b @Z "+x#b : x3X and b3B, The opening of X by B is X B"(X#B) B Erosion is a shrinking operation while dilation is an expanding operation. The output of erosion is the set of translation points such that the translated structuring element is contained in the input set X. Similarly, the output of a dilation is the set of translated points such that the translate of the re#ected structuring element has a non-empty intersection with X. The morphological skeleton SK(X) of a discrete binary shape X can be obtained by the formula: , SK(X)" 8 S (X) L L , " 8 [(X#nB)![(X#nB) B]] L
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
where N"max+n : X#nB)O', and `!a is the set di!erence operation. S is referred to as the nth skeleton L subset of X. From the sets S , the original object X can be L reconstructed as: , X KB" 8 S nB, 0)k)N. IWLW, If k"0, then all the skeleton components are used and X kB"X (exact reconstruction). If k"m, m'0, then we obtain the opening (a smoothed version) of X by mB. Margos and Schafer [12] have given two algorithms for skeleton decomposition and reconstruction which operate in linear time. The decomposition algorithm takes into account that the erosion of X by nB can be performed by n successive erosions. nB"BB2B GFFHFFI LRGKCQ X!nB"X#(BB2B "(((X#B)#B)#2#B) The information contained in the all the skeleton subsets S can be compactly represented by adopting the L morphological skeleton functions (SKF ) de"ned as follows:
[SKF(X)](x, y)"
n#1, 0,
(x, y)3S (X), L . (x, y) , S (X) L
Since the skeleton subsets are disjoint sets, the SKF is a single-valued function. The set of pixels transformed from the skeleton points with large SKF values represent the main body of the shape, while those with small values patch the boundary details. Thus, a sketch picture can be quickly reconstructed from skeleton subsets with larger SKF values, then incrementally re"ned by other skeleton subsets with smaller SKF values to make progressive visualization possible in picture browsing [13].
1623
The MST representation is compact because only a few points need to be stored for each object, while in the meantime, exact reconstruction of the object is possible. Illustrative examples of the MST of some simple geometrical shapes are shown in Fig. 1. The reconstruction steps for the upper left shape of Fig. 1. are shown in Fig. 2. The maximum SKF value in this case was 15; thus, it takes 15 steps to skeletonize or reconstruct the shape. Trahanias [7] presented a scheme for binary shape recognition that uses the MST representation. This scheme was based on a skeleton matching algorithm (SMA) which renders the similarity between two MSTs as a distance measure. Small distance measures indicate similar shapes whereas large distance measures indicate dissimilar shapes. Based on the distance measure, the shape is classi"ed as that shape from which its distance is minimal. The shapes to be matched are "rst normalized with respect to translation, scaling, and rotation. The SMA algorithm tries to align the two MSTs by matching each pixel in the "rst MST with its nearest in the second MST. Pixels in the second MST not considered in this process are subsequently matched with their nearest ones in the "rst MST. The cost associated with each match is computed as the distance between the two pixels weighted by a coe$cient that is a function of their SKFs. This weighting is important in order to ensure that the matching of pixels representing di!erent parts of the two shapes is expensive whereas the matching of two similar parts is cheap. The sum of all the costs is the "nal distance measure computed by the SMA algorithm. The computed distance by the SMA is not minimal. However, using a local cost function is justi"ed since it is computationally cheaper and it resembles human perception of shape similarity which usually starts from a similarity of parts examination and then based on the overall geometry (spatial relations), the "nal decision is made. By matching each pixel in the "rst MST with its closest in the second MST and accumulating the costs associated with each match into the global distance measure, the SMA actually performs a local non-linear alignment of the two MSTs in both directions in the image plane. The cost
Fig. 1. MST of di!erent shapes with respect to the rhombus structuring element.
1624
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
Fig. 2. Reconstruction steps for the upper left shape in Fig. 1.
accumulation in the global distance measure provides a means for determining the overall similarity of the two shapes. The use of the weighting factor guarantees that each pair of pixels being matched has a larger contribution to the whole distance measure if they represent di!erent parts of the two shapes whereas it has a small contribution if they represent similar parts of the shapes. The computation complexity of the SMA algorithm is di$cult to compute analytically, since the number of iterations needed to "nd the closest pixel from one MST to the other is not known in advance. Similarly, the number of pixels not visited in the second MST are not known in advance. In the worst case, the whole N;N image grid would have to be searched in order to "nd the nearest pixel to a given pixel, and thus the worst-case computational complexity of searching for the closest point is O(n N) where n is the number of pixels in the "rst MST. If n out of the n pixels in the second MST have not been W visited in the "rst pass, then the computational complexity of visiting these points is O((n !n )N). Since n and W n are usually in the order of N, the worst-case complexity of the SMA algorithm is O(N). In practice, only a small number of iterations is needed for the search for a nearest pixel which, can be taken as a constant k. Thus, the two passes of calculations require O(n k) and O((n !n )k), W respectively. Taking n to be max of n and n !n , then, W the complexity of the whole algorithm is O(2kN).
3. The Hilbert morphological skeleton transform (HMST) The computation complexity of the SMA algorithm is in the O(N) in the worst case, where N is the
number of rows or columns in the image grid and in order O(2kN) on average when k iterations are needed to search for the nearest point on one skeleton relative to the other skeleton. In a large image database, the number of di!erent shape classes is expected to be large. In the SMA matching algorithm, an added di$culty is that there are no de"nite start and end points for the skeletons to be matched. In addition, no apparent sequence of pixels that belong to an MST can be de"ned [7]. Thus, a more e$cient skeleton matching algorithm is required. An object representation, called the Hilbert morphological skeleton transform (HMST ), is introduced. This representation combines the MST with the Hilbert space-"lling curve to generate the object representation. A space "lling curve is a continuous path, which visits every point in a k-dimensional grid exactly once and never crosses itself. Space-"lling curves provide a way to order linearly the points of a grid. The goal is to preserve the distance, that is, points which are close in k-dimensional space should remain close together in the linear order [14]. The Z-order (or Morton key order, or bitinterleaving or peano curve) (Fig. 3(a)), the gray-code curve (Fig. 3(b)), and the Hilbert curve (Fig. 3(c)) are examples of space "lling curves. It was previously shown that the Hilbert curve achieves the best clustering among the above three methods [14]. The basic Hilbert curve on a 2;2 grid (denoted by H ) is shown in Fig. 4(a). To derive a curve of order i, each vertex of the basic curve is replaced by the curve at order i!1, which may be appropriately rotated and/or re#ected. Figs. 4(b) and (c). show the Hilbert curves of orders 2 and 3 respectively. Algorithms to draw the high-dimensional curve are given in Ref. [15].
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
1625
Fig. 3. Space "lling curves. (a) Z-curve, (b) gray code, (c) Hilbert curve.
Fig. 4. Hilbert curves of order 1, 2, and 3. (a) First step H , (b) second step H , (c) third step H .
The path of a space "lling curve imposes a linear ordering on the grid points which may be calculated by starting at one end of the curve and following the path to the other end. Fig. 4(b). shows one such ordering for a 4;4 grid (see curve H ). For example, the point (0,0) on the H curve has a hilbert value of 0, while the point (1,1) has a hilbert value of 2. Due to its clustering capabilities, the Hilbert transformation was used for packing R-trees to achieve better storage utilization and response times [16,17]. It was also used to design a good distancepreserving mapping to improve the performance of secondary-key retrieval and spatial access methods where multidimensional points have to be stored on a 1-dimensional medium (e.g. disk) Ref. [14]. The algorithm to "nd the Hilbert value of a two-dimensional point (x, y) given the order of the curve is given in Ref. [14] and is included in Appendix A. If the order of the curve n is known in advance, such as cases when the image size is known, a table is created for all possible h-values from 0 to 2L!1. In this case, "nding the Hilbert value of any point
involves only a table lookup operation, which requires a constant time. Exploiting the superior clustering that is achieved by the Hilbert curve, a linear ordering was imposed on the MST representation to generate the HMST representation. The basic idea behind the HMST is to store the Hilbert value of each pixel in the MST instead of its 2D-coordinates. The resulting skeleton representation is then sorted by the Hilbert value of its pixels. Thus, the HSMT is a list of the SKF values of the pixels belonging to the MST sorted by their Hilbert values. Since the Hilbert curve possesses superior distance preserving properties [14,16,17], the approximation introduced in the skeleton representation is minimal. The di$culties mentioned for the MST do not exist in the HMST because of the linearization process performed by the Hilbert curve. Since the HMST is a linear representation of the MST, then it has a de"nite start, end, and pixel sequence. The algorithm that changes the MST representation to HMST (Fig. 5) is simple and induces
1626
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
Fig. 5. An algorithm to create HMST representation from the MST representation.
a very small overhead. The skeletonization process is done only once for each object. On the other hand, shape matching is applied between a query shape and all database shapes in order to assign the query shape to the most similar shape in the database. Thus, an improvement in skeleton matching is expected to far outweigh the small overhead incurred in skeleton representation. The HMST and the MST are both invariant under translation but not under scaling and rotation. Thus, a normalization process is required before skeletonization. Shape normalization ensures representation invariance and consequently recognition invariance. Any existing shape normalization technique may be used with the HMST representation. Some of these techniques are discussed below. A geometric normalization procedure may be used for the polygonal representation of a shape [18]. The centroid of the shape may be used to obtain the translation information. Besides, a scale factor, which can be obtained from the shape, such as the perimeter, the region area or a speci"ed length, can be used for scale normalization. For rotation information, the principal axis "nding method [19] is a common approach to obtain the shape's orientation. However, this method cannot accurately determine the shape's orientation for a shape symmetric with respect to its centroid [20]. A shape can be normalized with respect to translation and scaling using the shape's centroid and a speci"ed factor g, which is the longest distance from boundary points to the shape's centroid [18]. Then, each point on the original shape is normalized by subtracting the centroid and dividing by g.
YuK ceer and O#azer [21] proposed a normalization technique for binary shapes that is invariant under translation, scaling and rotation. The center of gravity of the shape is calculated and the origin of the shape is translated to that point to achieve translation invariance:
m m f (x, y)"f x# , y# ∀x, y. 2 m m For scale invariance, the image is normalized using the average radius r for the shape pixels in a quarter of the input grid dimension. 1 r " f (x , y )(x#y, m 2 H I H I H I
r r f (x, y)"f 4 x, 4 y 21 2 N N
∀x, y.
The rotation invariance is achieved by rotating the image so that the direction of maximum variance coincides with the x-axis. The direction of maximum variance is given by the eigen vector corresponding to the largest eigenvalue of the covariance matrix for the set of vectors for activated pixels. YuK ceer and O#azer derived an expression for the required angle of rotation thus obtaining a function f that is invariant under the three 210 kinds of transformations. Since there are always two directions of maximum variance (in opposite directions), hence two normalized images have to be used to de"ne each input shape [22]. For the sake of testing the HSMA, the technique used to normalize objects in this paper was the same as the one
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
described in Ref. [7]. The same technique was also used in Refs. [13,23]. This way, a fair comparison to the SMA is ensured. To obtain Translation normalization, the axial system is moved so that the origin coincides with the center of mass of the shape. That is, if (x , y ) are the A A coordinates of the object's center of mass and f (x, y) represents the initial binary shape, the translated version f (x, y) is given as: R f (x, y)"f (x#x , y#y ) R A A To ensure scale invariance, it is required that the number of pixels in the given shape is constant. Thus, the shape is enlarged or reduced so that the number of pixels n in it becomes a prede"ned constant b. This can be N accomplished by transforming the function f (x, y) to f (x, y) [24]: Q f (x, y)"f (x/a, y/a) where a"(b/n . Q N To obtain rotation invariance, the shape is rotated by an angle ' so that the angle between the major axis of the shape and the horizontal direction takes a prede"ned value (. The initial angle # between the major axis and the horizontal direction is computed as [7]:
1 2k #" arctan 2 k !k
where k denotes the (p#q)th order central moments NO and is computed as: k " (x!x )N(y!y )O f (x, y). NO A A V W Rotation is performed by computing the new location (x, y) of each pixel (x, y) as follows:
x
cos sin
"
y
!sin
cos
x y
.
An algorithm called the Hilbert skeleton matching algorithm (HSMA) (Fig. 6) is used for matching two shapes represented by the HMST, referred to S and S . The matching algorithm assumes that the shapes to be matched have been normalized with respect to translation, scaling and rotation. The HSMA algorithm aligns the two HMSTs by matching each Hilbert value in S with its nearest Hilbert value in S . Hilbert values in S that are skipped as not being closest to any others in S are directly detected and matched within the same loop with their nearest Hilbert values in S . The cost associated with each match is computed as the Hilbert (linear) distance between the two values weighted by a coe$cient that is a function of their SKFs. The sum of
1627
all the costs is the "nal distance measure computed by the HSMA algorithm. The cost accumulation in the global distance measure provides a means for determining the overall similarity of the two shapes. The use of the weighting factor guarantees that each pair of pixels being matched has a larger contribution to the whole distance measure if they represent di!erent parts of the two shapes whereas it has a small contribution if they represent similar parts of the shapes. The HMST requires only one pass over the object skeletons to calculate the distance between S and S compared to two passes in the SMA. In addition, distance calculation requires simple integer addition as compared to the more complex Euclidean distance used in the SMA. In terms of memory requirements, the HSMA algorithm does not need to keep track of visited pixels as in the SMA because all the processing is done in one pass. The computation complexity of the HSMA algorithm can be estimated as follows: since a single pass is required over the hilbert values of S and for each value, the closest point on S is located and the distance calculated. In the meantime, non-visited pixels from S are directly spotted and their closest values in S are located and the distance adjusted. Thus, the complexity of the HSMA algorithm corresponds to two passes, one over the points of S and the other over the points of S , i.e. the complex ity of the HSMA is O(n #n ). Since n and n are usually in the order of N (for an image of size N;N), the worst-case complexity of the HSMA algorithm is O(2N). This value represents both the average and worst-case complexity. This is a signi"cant improvement over the O(N) complexity of the SMA in the worst case and the O(2kN) complexity of the SMA in the average case. Consider the example shown in Table 1. where H , H , partial lists of the Hilbert values of skeletons S and S are shown. If the search starts for H at index I"0, since the distance "10}18" is less than "10}24", the closest Hilbert value to H "10 is 18, i.e. i will be equal to 0. The distance D is adjusted and the H value is moved to the next value 32. Now, i is moved until the closest Hilbert value is found at H "37. This is because "32}37"("32}26"("32}24"("32}18" while "32}60"' "32}37". This means that, for H , the points 24 and 26 were not visited and will never be visited while scanning H points. The HSMA algorithm now loops on these 2 points to "nd their closest H points and adjust the distance accordingly. This is done using another pointer i that is moved on H starting from I"0. For point H "24, the point H "32 is the closest since "24}10"("24}32" while "24}33"'"24}33". This process is repeated until all points on both H and H are visited by moving the 2 pointers i and i all the way from I"0 to I"7, in this case. This shows that only 1 pass over the points of H and H is required which means that com plexity of the HSMA is indeed linear in both the average and the worst-cases.
1628
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
Fig. 6. The HSMA skeleton matching algorithm.
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636 Table 1 Example of skeleton matching using the HSMA algorithm I
H
H
0 1 2 3 4 5 6 7
10 32 33 35 38 39 41 51
18 24 26 37 60 62 66 68
1629
4. Testing the HMST and the HMSA The HMST representation and the HSMA object recognition algorithm were subjected to a series of tests. The test database is similar to that used to test the SMA algorithm [7]. In the "rst test, the degree of matching of the recognition results obtained from the HSMA and the SMA was examined. The shape database used for testing the SMA algorithm in Ref. [7] was regenerated and used as a test set for the HSMA algorithm. This set consists of 12 shapes shown in Fig. 7 after being normalized such that the number of pixels in each is approximately 500. The MST of these shapes is shown superimposed on the original shapes. The number of points in the resulting skeleton is given in Table 2. The similarity between each
Fig. 7. Shape database used for testing object recognition rate.
1630
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
Table 2 Number of points on the skeletons of the test database Shape
No points
0 1 2 3 4 5 6 7 8 9 10 11
49 99 44 54 40 44 138 62 58 98 142 142
pair of shapes was calculated using the SMA and the HSMA algorithms. The confusion matrices for the distance measures computed by the HSMA and the SMA algorithms are shown in Tables 5 and 6, respectively. It can be veri"ed from Table 5 that the results in this confusion matrix of the HSMA match human perception of shape similarity in that similar shapes result in smaller distance measures than other shapes. For example, the most similar shape to the square (shape 2) is the ellipse (shape 4) followed by the rectangle (shape 3), the most similar shape to the rectangle (shape 3) are the ellipse (shape 4), then the rhombus (shape 5) and the square (shape 2), and the most similar shape to the O (shape 11) is the 8 (shape 10). The R measure, introduced in the LIVE-Project LMPK [25], was then used to assess the degree of matching for the results obtained from the HSMA and the SMA. Calculation of R requires two rank orderings of the LMPK database shapes relative to the query shape. The "rst is considered the system provided rank ordering (represented here by the HSMA) and the second is the expert provided rank ordering that de"nes the desired system output (represented here by the SMA). R values range LMPK from 0.0 to 1.0 where an R value of 1.0 indicates that LMPK the system provided an acceptable rank ordering with respect to that provided by the expert. The higher the value of the R , the higher the similarity between the ,-0+ two rank orderings. To calculate the R , let I be a "nite set of shapes ,-0+ with a user de"ned similarity preference relation *. Let *1+ be the rank ordering of I induced by the SMA algorithm. Also, let *&1+ be the rank ordering of I induced by the similarity values computed by HSMA. Then R is de"ned by LMPK
S>!S\ 1 R (*&1+)" 1# LMPK Smax> 2
where E S> is the number of shape pairs where a more similar shape in *1+ is ranked ahead of a worse one in *&1+. E S\ is the number of shape pairs where a less similar shape in *1+ is ranked ahead of a better one *&1+. E S> is the maximum possible number of S>.
For example, consider the following two rank orderings: *&1+"(i , i "i , i "i ), *1+"(i "i , i "i , i ). According to the user, i and i have the highest preference, followed by both i and i at the next level of preference, followed by i at the lowest level of prefer ence. On the other hand, the system considers i as having the highest preference, followed by i and i , while i and i have the lowest level of preference. In this example, from *1+, we can conclude that S> is equal
to 8 (since in *1+ : i is ranked better than i , i , and i !i is ranked better than i , i , and i !i is ranked better than i !i is ranked better than i ). The HSMA agrees with SMA on 1 ranking only (i is ranked better than i ) which means that S> is equal to 1 while it disagrees with SMA in 5 cases (i is ranked better than i and i !i is ranked better than i !i is ranked better than i !i is ranked better than i ) which means that S\ is equal to 5. Therefore, R is calculated as LMPK follows:
1 1!5 R (*&1+)" 1# "0.25. LMPK 2 8 Now consider another rank ordering, *1+"(i1"i , i "i , i ). This second system agrees with the user on 6 rankings (i is ranked better than i !i is ranked better than i !i is ranked better than i !i is ranked better than i !i is ranked better than i !i is ranked better than i ) which means that S> is equal to 6. The system does not disagree with the system in any case which means that S\ is equal to 0. Therefore, R is calculated LMPK as follows:
1 6!0 R (*&1+)" 1# "0.875. LMPK 2 8 This example shows that a higher R was given to ,-0+ an HSMA ranking when it is more similar to the SMA ranking.
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
To test the similarity between the rankings calculated for the shape database using the SMA and the HSMA, the R measure was calculated for each shape using the ,-0+ confusion matrices calculated for the HSMA and the SMA (Tables 5 and 6, respectively). The results are shown in Table 7. The average value for the R over the shape ,-0+ database is 0.780303, which shows that there is a high similarity between the shape rankings provided by the SMA and the HSMA. In the second test, shape recognition in the test database was performed for the HSMA, the SMA and invariant moments. Invariant moments are a very popular shape measure, which is invariant under a$ne transformations (translation, rotation and scaling). However, invariant moments are not information preserving since it is not possible to reconstruct a reasonable approximation of a shape given its moments. Invariant moments have been given considerable attention in the literature and various reports exist for their experimental results [26]. In the recognition experiment, the set of 7 moments was calculated as de"ned in [26]. Since some of the values of invariant moments tend to be quite small, the values of these invariant moments were normalized. The moments of the entire collection of shapes were calculated and the limits of each moment invariant were found. These limits were then used to normalize the moment invariants between 0 and 1. The normalized moments were then used to create a feature vector f "(m , m ,2, m ) for each shape. The distance beK tween two shapes s and s was calculated as the distance / ' between two such feature vectors and a shape was assigned to the class from which it has the smallest distance: d (s , s )"( (m/!m'). / ' G G G The shapes, shown in Fig. 7, were "rst corrupted by noise using a boundary deformation procedure similar to that in Ref. [7]. This procedure produces a noisy version of a binary shape in two steps. First, for each contour pixel of the initial shape, a pixel is randomly selected from the set consisting of this contour pixel and the background pixels neighboring it. Then, the value of the selected pixel is changed with a probability of c. Higher deformations can be produced by increasing c or by repeating the process d times for a given c. Boundary deformation introduces spurious pixels in the MST. Sample noisy shapes, with their MSTs superimposed, are shown in Fig. 8 for d"3 and c"10%. In the object recognition experiment, the value of c has been kept constant while d was varied from 1 to 10. For each value of d, the deformed version of each shape was produced resulting in a test set of 1200 deformed shapes (100 for each shape or 120 for each value of d). The set of initial shapes was used as reference shapes and each of the 1200 shapes was presented to the HSMA algorithm and the SMA algorithm for classi"cation. Each test shape was
1631
classi"ed to the shape to which it had the minimum distance as calculated by the HSMA and the SMA. This process was repeated 25 times with di!erent initial random seeds to reduce statistical variations. The average recognition errors for various values of d and for each shape are given in Table 8 for the HSMA, Table 9 for the SMA and Table 10 for invariant moments. For HSMA, it can be veri"ed that no errors were made for d"1 for all classes of shapes in any of the 25 trials. Also, this was the case for all d values and 3 out of the 12 shape classes. For 5 out of the remaining 9 shapes, less than 1 error occurred in average for any of the tested d values. The majority of errors were made for similar shapes that were mistaken one for the other. These shapes were the square (shape 2), the rectangle (shape 3), the ellipse (shape 4), and the rhombus (shape 5). Most of the errors occurred at a high noise level (d*6). A summary of the results of this experiment is shown in Table 3. The average number of errors for the HMSA was 115.68 compared to 122.88 for the SMA algorithm and 126.08 for invariant moments. This accounts for a recognition rate of 90.37% for the HSMA, 89.76% for the SMA and 89.49% for invariant moments. The third test was to measure the improvement of the processing speed of the HSMA over the SMA. The processing time of the shape recognition experiment described above was measured. The tests were carried out on a PC with a Pentuim processor (200 MHZ) and 64 MB of RAM. The time measured was the total time to match the 1200 test shapes against the 12 shapes in the shape database, i.e. the total number of shape matches was equal to 14,400. A summary of the results of this experiment is shown in Table 4. The average processing time was 166.6 s for the HSMA (which accounts for 86.4 shape matches/s) while it was 279 s for the SMA (which accounts for 51.6 shape matches/s). The improvement of the HSMA processing speed was about 40.3% (Tables 5}10). To explain the speed improvement of the HSMA over the SMA, cost analysis of both algorithms is used. The cost of matching two shapes i and j using the SMA algorithm (C ) and the HSMA algorithm (C ) may 1+ &1+ be estimated as follows: ,G ,HY C " (C #C )# (C #C ), 1+ ANGH !?JA ANHG !?JA I J ,G ,HY C " (C #C )# (C #C ), &1+ AFGH !?JA AFHG !?JA I J where Ni is the number of points on the skeleton of shape i, Nj is the number of points on shape j that were not accessed during the "rst pass of matching, C is the ANVW cost of "nding the closest point on shape y to a given point on shape x, C is the cost of "nding the closest AFVW hilbert value on shape y to a given hilbert value on shape x and C is the cost of calculating the incremental !?JA
1632
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
Fig. 8. Noisy shapes produced with the `boundary deformationa procedure (d"3 and c"10%).
Table 3 Recognition errors for the HSMA, the SMA and MOMENTS
Table 4 Processing time comparison of the HSMA and the SMA
Errors
HSMA
SMA
MOMENTS
Time
HSMA
SMA
Minimum Maximum Average Standard Deviation
104 134 115.68 74
114 137 122.88 6.77
105 141 126.08 9.35
Minimum Maximum Average Standard deviation
166 168 166.6 0.58
277 281 279 1.08
distance between a point on a certain shape and its closest point on the other shape. Comparing the two equations, it may be seen that the number of loops in both equations is quite similar. The
"rst loop of each equation is performed Ni times which is a constant while the second is performed Nj times where Nj is a dynamic number whose average was calculated during the test experiment to be 67.64 /shape matching
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
1633
Table 5 The confusion matrix for the shape database in Fig. 7 using HSMA Shape
0
1
2
3
4
5
6
7
8
9
10
11
0 1 2 3 4 5 6 7 8 9 10 11
0 8017 10,521 8044 5953 4349 29,749 11,481 8028 6017 17,734 36,268
7915 0 13,588 22,385 21,983 8494 31,532 21,122 10,313 4012 11,136 28,886
10,318 13,274 0 6535 4535 25,330 136,349 32,488 10,758 16,533 77,565 179,950
7868 22,433 6529 0 2232 5450 19,489 10,465 12,448 23,188 21,622 40,507
6033 22,042 4532 1644 0 5945 23,787 9708 13,326 22,082 18,600 38,512
4399 8595 25,699 5491 5943 0 10,001 10,193 13,183 12,133 7036 14,572
29,656 31,575 136,359 19,519 23,792 9982 0 37,221 70,354 40,079 7159 12,143
11,460 21,123 32,488 10,542 9706 10,029 37,210 0 36,995 21,054 21,519 56,402
8025 10,494 10,873 12,461 13,789 13,073 70,741 36,991 0 10,086 54,880 79,003
6106 4077 16,857 23,244 22,129 11,884 40,036 21,107 10,000 0 16,711 40,584
17,826 11,124 77,613 21,446 18,402 7117 7177 21,540 54,889 16,777 0 4924
36,365 28,811 180,798 40,507 38,512 14,446 12,291 56,352 79,079 40,559 4921 0
Table 6 The confusion matrix for the shape database in Fig. 7 using SMA Shape
0
1
2
3
4
5
6
7
8
9
10
11
0 1 2 3 4 5 6 7 8 9 10 11
0 1416 876 1061 714 536 2195 1051 1154 1290 1728 1913
1445 0 791 2093 1810 1625 2194 2689 1292 515 908 1287
832 763 0 214 360 851 1498 704 906 1413 1796 2888
1032 1836 223 0 213 686 1047 576 1432 1938 2569 3326
672 1824 386 215 0 305 1582 543 1263 2073 2025 3033
563 1639 944 753 314 0 1899 897 1578 1927 1975 2377
2119 2197 1592 1082 1637 1919 0 1947 3825 2286 1432 1717
1141 2503 889 663 579 848 1879 0 2705 1990 2244 2750
1119 1378 1069 1667 1185 1480 3680 2428 0 1833 3724 4574
1275 553 1345 2147 2000 1890 2099 2180 1633 0 1316 1757
1653 934 1852 2649 1998 1974 1349 2466 3662 1506 0 572
1813 1335 2670 3037 2940 2326 1331 2822 4562 1923 569 0
Table 7 The R measure calculated for each shape in the shape ,-0+ database Shape
R ,-0+
1 2 3 4 5 6 7 8 9 10 11 12
0.818182 0.672727 0.836364 0.872727 0.890909 0.654545 0.654545 0.763636 0.818182 0.800000 0.800000 0.781818
for the HSMA and 68.72/shape matching for the SMA. In addition, the cost required for calculating C is the !?JA same for the HSMA and the SMA. This means that the main source of performance improvement in the HSMA is replacing the C cost with the C cost. The cost AN AF C requires "nding the closest point, to the current AN point, on the second skeleton. This requires calculating the Euclidean distance to all points on the second skeleton and "nding the minimum distance. The cost C , on AF the other hand, requires "nding the closest hilbert value on the second skeleton which requires calculating a linear distance only to a small subset of the points on the second skeleton.
5. Conclusions A binary shape representation called the Hilbert Morphological Skeleton Transform (HMST) is introduced. This representation combines the Morphological
1634
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
Table 8 HSMA average recognition errors for the shapes in Fig. 7 deformed at various d levels SHAPE d
0
1
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0.04 0 0.04
0 0 0.04 0.4 1.08 2.08 3.56 5.32 6.92 7.64
0 0.12 0.84 1.48 2.96 3.68 5 5.96 6.84 7.6
0 0.04 0.12 0.36 0.96 2.36 3.48 4.16 5.52 6.32
0 0.16 1.04 1.64 2.56 3.28 3.4 4.08 4.4 4.12
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0.04 0 0.2 0.4 0.72 0.84
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.04 0 0.04 0.2 0.44
0 0.88 0.76 0.44 0.28 0.16 0.04 0 0.08 0
0 0 0 0 0.04 0.04 0.08 0.04 0.2 0.12
Table 9 SMA average recognition errors for the shapes in Fig. 7 deformed at various d levels SHAPE d
0
1
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0.08 0.04 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0.16 0.2 0.36 0.96 1.6 3.32 4.2
0 0 0.56 2.12 5.32 6.96 8.32 8.96 9.56 9.64
0 0.04 0 0.28 1.08 2.2 3.84 5.64 6.56 7.68
0 0.2 0.68 1.88 2.48 2.88 2.68 3.72 3.64 3.68
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0.04 0.12 0.4 1 1.8 2.24
0 0 0 0 0.2 0.28 0.48 1.24 1.52 2.04
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
Table 10 MOMENTS average recognition errors for the shapes in Fig. 7 at various d levels SHAPE d
0
1
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.04 0.08 0.32 0.64 1
0 0 0 0 0 0.12 0.6 1.48 2.24 3.72
0 0.72 1.56 2.08 2.04 2.64 2.96 3.24 3.36 3.72
0 0 0.08 0.48 0.56 0.64 0.84 0.64 0.64 1.2
0 2.2 3.2 3.8 3.8 3.52 3.76 3 4.28 3.68
0 0.44 1.48 1.76 1.96 2.84 3.16 3.36 4.04 4.2
0 0 0.28 0.52 1.04 1.28 1.24 1.28 2.08 2.28
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0.04 0 0.12 0.2 0.2
0 0.24 0.52 0.6 0.52 0.64 0.76 1.04 1.32 1.28
0 0.16 0.68 0.96 0.92 1.44 2.12 2.76 3.56 3.88
Skeleton Transform (MST) with the superior clustering capabilities of the Hilbert transform. The HMST preserves the favored properties of skeleton representation including information preservation, progressive vis-
ualization and compact representation. Then, an object recognition algorithm called the Hilbert Skeleton Matching Algorithm (HSMA) is introduced. This algorithm renders the similarity between two objects represented by
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
HMST as a distance measure. The HSMA performs a single sweep over the HMSTs and has a linear worst-case complexity. Several experiments were done to test the HSMA against the SMA. Object recognition achieved by the HSMA and the SMA was compared using a shape database previously used to test the SMA. The R ,-0+ measure was used to measure the similarity between the confusion matrices obtained. An R value of ,-0+ 0.78 was obtained which con"rms that the similarity rankings obtained by HSMA are similar to those obtained by SMA. Shape recognition rates in the shape database corrupted by noise was also tested. The test set consisted of 1200 deformed shapes. The test was repeated 25 times with di!erent initial random seeds. The average number of errors for the HMSA was 115.68 compared to 122.88 for the SMA algorithm and to 126.08 for invariant moments. This accounts for a recognition rate of 90.36% for the HSMA, 89.76% for the SMA and 89.49% for invariant moments. During the shape recognition experiment, the processing time of the HSMA was found to be 40.3% better than that of the SMA. Cost analysis of both algorithms was used to show that the saving in processing time was mainly due to the use of the 1D Hilbert curve instead of the traditional 2D grid for shape representation. Future extensions of this work include studying di!erent rotation normalization techniques to be used for general shapes. The major axis normalization, used in the experiments in this paper, is sensitive to shape changes especially in the case of compact shapes. Inaccurate rotation normalization will lead to inaccurate similarity assessment by the HSMA. Thus, additional analysis and experimentation of di!erent preprocessing techniques are required. Another future extension to this work is to study the e!ect of removing skeleton redundancy [27] on the HMST and the HSMA. A third extension is to study the extension of the binary HSMA approach to grayscale morphology [28].
References [1] J.P. Eakins, Automatic image content retrieval-are we going anywhere? Proceedings of the Third International Conference on Electronic Library and Visual Information Research, De Montfort University, Milton Keynes, May 1996. [2] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. DOM, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele, P. Yanker, Query by image and video content: the QBIC system, IEEE Comput. 9 (1995) 23}32. [3] A. Pentland, R.W. Picard, S. Sclro!, Photobook: contentbased manipulation of image databases, SPIE Storage and Retrieval for Image and Video Databases II 2185, San Jose, CA, Feb. 6}10, 1994, pp. 34}47.
1635
[4] W.A. Grosky, R. Mehrotra, Index-based object recognition in pictorial data management, Comput. Vision, Graphics, Image Understanding 52 (3) (1990) 416}436. [5] E. El-Kwae, M. Kabuka, A Boolean neural network approach for image understanding, in: Proceedings of Arti"cial Neural Network in Engineering Conference (ANNIE'96), St Louis, Missouri, Nov. 10}13, 1996, pp. 437}442. [6] Michael Kliot, Ehud Rivlin, Invariant-based shape retrieval in pictorial databases, Comput. Vision Image Understanding 71 (2) (1998) 182}197. [7] P.E. Trahanias, Binary shape recognition using the morphological skeleton transform, Pattern Recognition 25 (1) (1992) 1277}1288. [8] P.M. Gri$n, B.L. Deuermeyer, A methodology for pattern matching of complex objects, Pattern Recognition 23 (1990) 245}254. [9] L.G. Shapiro, R.S. MacDonald, S.R. Sternberg, Ordered structural shape matching with primitive extraction by mathematical morphology, Pattern Recognition 20 (1987) 75}90. [10] V.N. Gudivada, A uni"ed framework for retrieval in image databases, Ph.D. Thesis, University of Southwestern Louisiana, (May 1993). [11] Dimitro Papadias, Yannis Theodoridis, Timos Sellis, Max. J. Egenhofer, Topological relations in the world of minimum bounding rectangles: a study with R-trees, Proceedings of ACM-SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, CA, ACM Press, May 1995. [12] P.A. Maragos, R.W. Schafer, Morphological skeleton representation and coding of binary images, IEEE Trans. Acoust. Speech, and Signal Process ASSP-34 (5) (1986) 1228}1244. [13] P.W. Huang, Y.R. Jean, Reasoning about pictures and similarity retrieval for image information systems based on SK-set knowledge representation, Pattern Recognition 28 (12) (1995) 1915}1925. [14] C. Faloutsos, S. Roseman, Fractals for secondary key retrieval, Eighth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PoDS), Philadelphia, PA, March 29}31, 1989, pp. 247}252. [15] T. Bially, Space-"lling curves: their generation and their application to bandwidth reduction, IEEE Trans Inform Theory IT-15 (6) (1969) 658}664. [16] I. Kamel, C. Faloutsos, Hilbert R-trees: an improved R-tree using fractals, VLDB 94. [17] I. Kamel, C. Faloutsos, On packing R-trees, Proceedings of the Second International Conference on Information and Knowledge Management (CIKM), Washington D.C., Nov. 1}5, 1993. [18] L.-K. Huang, M.-J.J. Wang, E$cient shape matching through model-based shape recognition, Pattern Recognition 29 (2) (1996) 207}215. [19] D.H. Ballard, C.M. Brown, Computer Vision, PrenticeHall, Englewood Cli!s, NJ, 1982. [20] M.H. Han, D.J. Jang, J. Foster, Inspection of 2-D objects using pattern matching method, Pattern Recognition 22 (1989) 567}575. [21] C. YuK ceer, K. O#azer, A rotation scaling and translation invariant pattern classi"cation system, Pattern Recognition 26 (5) (1993) 687}710.
1636
E.A. El-Kwae, M.R. Kabuka / Pattern Recognition 33 (2000) 1621}1636
[22] J. Wood, Invariant pattern recognition: a review, Pattern Recognition 29 (1) (1996) 1}17. [23] Z. Xiaoqi, Y. Baozong, Shape description and recognition using the high order morphological pattern spectrum, Pattern Recognition 28 (9) (1995) 1333}1340. [24] A. Khotanzad, J.-H. Lu, Classi"cation of invariant image representations using neural networks, IEEE Trans. Acoust. Speech, and Signal Process. 38 (6) (1990) 1028}1038. [25] P. Bollmann, F. Jochum, U. Reiner, V. Weissmann, H. Zuse, The LIVE-project-retrieval experiments based on evaluation viewpoints, Proceedings of the Eighth Inter-
national ACM/SIGIR Conference on Research and Development in Information Retrieval, Montreal, Canada, June 1985, pp. 213}214. [26] B.M. Mehtre, M.S. Kankanhalli, W.F. Lee, Shape measures for content based image retrieval: a comparison, Inform. Process Management 33 (3) (1997) 319}337. [27] R. Kresch, D. Malah, Morphological reduction of skeleton redundancy, Signal Process 38 (1994) 143}151. [28] S.R. Sternberg, Grayscale morphology, Comput. Vision, Graphics Image Process 35 (1986) 333}355.
About the Author*ESSAM A. EL-KWAE received his B.Sc. and M.Sc. degrees in Computer Science from Alexandria University, Egypt in 1990 and 1994, respectively. He received his Ph.D. degree in Computer Engineering from the University of Miami in 1997. Since January 1998, he has been a Research Associate with the Center of Medical Imaging and Medical Informatics at the University of Miami. He is a member of IEEE, ACM and SPIE. His research interests include image and video databases, data mining and intelligent systems. About the Author*DR. MANSUR R. KABUKA is a Professor of Computer Engineering and Radiology. He received his Ph.D. degree in Computer Engineering from the University of Virginia at Charlottesville in 1983. He has extensive experience in Information Technology and Medical Informatics. He has published numerous research papers addressing several aspects of the above areas. Dr. Kabuka is the founder and director of the Information Technology program at the University of Miami, Florida. His research interests include Multimedia Computing and Medical Informatics.
Pattern Recognition 33 (2000) 1637}1649
An algorithm for the rapid computation of boundaries of run-length encoded regions Francis K.H. Quek* Vision Interfaces and Systems Laboratory (VISLab), Computer Science and Engineering Department, Wright State University, Dayton, OH, USA Received 25 March 1998; accepted 3 August 1998
Abstract A "nite state machine-based algorithm for the rapid extraction of boundaries of run-length encoded regions is presented. The algorithm operates directly on the run data structure contained in run-length encoding, and yields boundaries in the form of four- or eight-connected point lists describing closed positively directed countours of four- or eight-connected regions. The algorithm di!erentiates between external and internal (boundaries of holes in the region) boundaries, and can handle multiple external and internal boundaries for each run-length encoded region. The four state "nite state machine can be reduced to a two-state machine. A state transition occurs as the boundary traverses each run. The transition predicates can be implemented as a simple decision tree. Run-length encoding is a common image/region encoding techniques. This algorithm complements the existing operators which work directly on such region descriptors, making them more attractive as a representation for region manipulation. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Run-length encoding; Regions; Region boundary; Finite state machine; Computer vision; Graphics
1. Introduction Run-length encoding (RLE), which was invented in the 1950s for television signal compression [1}4], is a popular representation scheme for image regions [5]. Examples of current RLE use are in image digitizers [6], common image formats [6], satellite image representation [7], and in region representation for computer vision [8}12] and graphics [13]. Given a region description, it is often desired to obtain its closed boundaries. This paper describes a fast "nite state machine-based algorithm to extract the boundary of RLE regions directly from the RLE data structure. The extracted boundaries are connected (4- or 8-connected) point lists which describe closed positively directed (region to the left of the boundary) contours. The algorithm
* E-mail address:
[email protected] (F.K.H. Quek).
is capable of extracting and distinguishing between external and internal boundaries. It handles regions with multiple external boundaries (which might arise, for example, when regions are fragmented by occlusion) as well as multiple internal boundaries (boundaries of holes in regions). The algorithm has been applied in [9,11].
2. Run-length regions Run-length encoding represents each scan line of an image as a series of runs of constant pixel value. Each run may be described by a two-tuple comprising a pixel value and a pixel repetition count. For binary images, the pixel value is elided and the representation is a list of repetition counts of alternately &0' and &1' pixels. A number of operations and image computations may be performed directly on an RLE representation.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 8 ) 0 0 1 1 8 - 6
1638
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
Pixelwise Boolean operations such as image &and', &or' and &negation' operations can be readily performed on RLE data [14]. Geometric properties like area and centroid may be computed directly from RLE as well [5]. Furthermore, RLE representation lends itself readily to connected component labelling [14]. When RLEs are used to represent region objects in machine vision and graphics applications, it is convenient to think of an image region as a list of runs as shown in Fig. 1. We shall designate the ith run of a region R as r "(y , x1 , x2 ), where y is the y coordinate (line G G G G G number in the image), and x1 and x2 are the start (lowest G G column index) and the end (highest column index) of the run. Thus r in region R is the set of pixels: G
r for r where j'i); right resume or R-run (i.e. r that is H G H adjacent to both r and r where j'i); left start or G G\ S-run (i.e. r that is adjacent to both r and r where H G G> j(i); and other runs or X-run (i.e. any run that is not a D-, F-, S- and R-run). Fig. 2 illustrates these labels. As shown in Fig. 3, a run may have more than one of these labels simultaneously yielding D/F-run, R/S-run, D/R-run, and F/S-run labels. From these labeled runs, the algorithm produces two tables: a COD-table that tracks the labels of runs and an ABS-table that maintains the abscissas of the corresponding runs. Kim et al. [16] does not address how this table may be used to label multiple objects and how runs describing di!erent objects may be separated.
+(x, y ) 3R"∀x1 )x)x2 ,. G G G
2.2. An object-oriented extraction algorithm
If region R is made up on N runs, then
Our run-length region extraction algorithm is similar to [16] in that out RLE algorithm begins with the set of runs in raster order. Instead of a labeling scheme, we organize the runs into a run table that is indexed by the image row (see Fig. 4). Our algorithm uses an active object list (AOL) that maintains all objects that intersect the current line y . When the y no longer intersects an A A object it is removed from the AOL. A current run (CR) pointer keeps track of the run currently under consideration. A previous line runs (PLR) pointer points to the "rst run on the previous line (y !1) that may be adjacent to A the CR. As the CR advances, the PLR advances accordingly. For each line of runs, we do the following:
, R" 8 r r 5 r " ∀( jOk). G H I G 2.1. Run-length region extraction algorithms A variety of algorithms exist for extracting RLEs from a segmented binary image. Typically, one may use a region labelling algorithm [15] which apply a region equivalency table. Kim et al. [16] describe an algorithm by applying a run labeling technique that employs "ve di!erent labels that assumes the runs have been precomputed and are sorted in raster order (top to bottom and right to left), One might think of these labels for run r as G potential start or D-run (i.e. no adjacent run r for r H G where j(i); potential ,nal or F-run (i.e. no adjacent run
FOR each line y of runs in the runtable E Set y to y, CR to the "rst run in y , and PLR to the A A "rst run is not completely to the left of CR. WHILE PLR and CR are not NULL,
Fig. 1. A run-length region representation of spatial occupancy in an image.
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
1639
object descriptions of di!erent regions in object-oriented fashion. In this format, one can easily compute such region properties as the region bounding box, width and height. The area of a region is simply , (x2 !x1 #1). The G G G "rst-order moments may be computed by , m " y (x2 !x1 #1), G G G G , (x2 #x1 )(x2 !x1 #1) G G G G , m " 2 G , y (x2 #x1 )(x2 !x1 #1) G G G m " G G . 2 G General region moments may be computed as follows: Fig. 2. Connected properties of runs. (Reproduced from Kim et al. [16].)
, VG m " yO xN . NO G G VVG The ability to perform such computation on RLE regions makes the representation attractive for machine vision and graphics applications.
3. Extracting region boundaries
Fig. 3. Runs with composite labels. (Reproduced from Kim et al. [16].)
} If PLR is adjacent to CR, add CR to the object in AOL containing PLR } Advance PLR until it points to the "rst run that is completely right of CR } Advance CR to the next run on y A E Remove any object in AOL that did not have a run added to it The key to this algorithm is the observation that all connectivity information for some run r with all runs G +r "y (y , is derivable from its relationship with runs on H H G line y . By using an object-oriented representation of H\ the objects in the AOL, the algorithm avoids the costly process of maintaining a &&label image'' and updating a region equivalency table that is required in traditional connected component algorithms. It also improves on the algorithm described in [16] which needs to evaluate adjacency of runs in both the previous and next image lines and maintain a special set of run labels. Also, [16] does not discuss how multiple regions are extracted as di!erent run lists. The algorithm discussed here produces
Algorithms exist for the extraction of region boundaries from a labelled image [17,18]. These algorithms typically scan an image array until a background-toregion transition occurs, and performs a contour following operation to track the boundary. Obviously, if region objects are represented as RLE entities, it is undesirable to create an image array, write the region out as an image and then compute the boundary. Rosenfeld and Kak [14] describe an algorithm for computing the boundary of RLE regions. The algorithm entails creating an alternate string-based data structure in which each run is translated into a string of numerals (comprising character from 0 to 3). A set of rules based upon the adjacency of runs on a line to those on the previous and next lines are used to generate this code. After the code is generated, the boundary may be extracted in the form of &crack' code by rescanning the coded strings. The algorithm is involved and the reader may refer to [14] for details. The algorithm has three disadvantages. First, it involves an initial stage of translating the run encoding into the string-based encoding. Second, it generates the boundary segments in scan order and not as a closed directed boundary. To obtain such a boundary, the crack code has to be analyzed and linked. Third, it does not di!erentiate between internal and external boundaries. Capson [19] describe a boundary extraction algorithm that is similar to the algorithm we present in that it operates directly on RLE data. The single pass algorithm operates in O(n) time, where n is the number of
1640
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
Fig. 4. Runtable for the object in Fig. 3.
transitions per line (and by extension, the number of transitions in an image). The algorithm begins by converting the RLE data into a vertex table. These vertices are analyzed in line order to extract a set of objects where each object has an external boundary and a list of internal boundaries. A doubly linked &active-list' of objects is maintained, and these objects may be merged or split depending on the topology of the runs. This algorithm di!ers from the one presented in this paper in that it makes no assumption about the connectivity of the RLE data. A common approach in pattern recognition is to "rst apply a connected-component analysis algorithm to extract the regions in an image [15]. If such an RLEbased region description is available, it would be super#uous to reorganize all the runs from the extracted regions into a common vertex list, and in e!ect, to resegment the image based on the boundary information. The algorithm presented in this paper assumes that an RLE description of such connected regions is available, and does not need the bookkeeping required by the Capson algorithm. We are hence able to apply a novel and more compact "nite state machine approach. Another advantage of the algorithm described here is that it is able to extract the boundaries of both 4- and 8-connected regions instead of the 4-connected boundaries presented in [19]. Kim et al. [16] describe an e$cient algorithm that relies on the labels produced in their run-length extraction algorithm described in Section 2.1. The algorithm "rst constructs two tables both of which are indexed by the image row. The "rst table, the COD-table, contains a list of label codes for the runs on that row arranged from left to right. The second table, the ABS-table, maintains the corresponding list of run begin-end pairs for the runs on each row. They make the observation that external boundaries must contain the right-most point of a D-run and internal boundaries must contain the right-
most point of an S-run. Hence, the algorithm uses these points and their corresponding runs as the initial points and runs for tracking the external and internal boundaries respectively. For the external boundary, for example, the current run is initialized to the topmost D-run. In similar fashion, the algorithm of [16] makes extensive use of the other run labels throughout the boundary tracking process. The algorithm resembles an elaborate rule-base that is applied iteratively as the current run pointer is advanced in each iteration to the next adjacent run in the run list. The algorithm presented here di!ers from the Kim et al. algorithm in that it does not require the special run labels produced by their region extraction processes. This is especially useful for the extraction of boundaries from RLE encoded TIFF or Silicon Graphics RLE format "les. These formats may produced by scanners or other image capture devices. Furthermore, as observed earlier the RLE extraction algorithm in [16] does not address the separation of the runs into di!erent regions as is assumed by the boundary tracking algorithm. Additional overhead is necessary to make this separation. The boundary tracking algorithm described in this paper avoids the elaborate rule-base that is in [16] by exploiting the inherent information in the RLE structure and results is a more straighforward implementation.
4. Finite state machine approach to boundary extraction Given the popularity of RLE representation, it would be advantageous to have an algorithm which can extract region boundaries as closed directed contours directly from the RLE data structure. In this section, a "nite state machine-based algorithm, which takes advantage of the run list topology and the boundary direction constraints, is presented.
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
1641
4.1. The states The RLE representation a!ords us certain topological and connectivity information which may be exploited for computing the boundary of the region. Consider the region morphology in Fig. 5. The region shown is made up of six runs: (i, k, k#14) (i#1, k, k#15) (i#2, k, k#15) (i#3, k, k#10) (i#3, k#13, k#14) (i#4, k, k#8) (i#5, k, k#5). The positively directed eight-connected boundary begins by traversing the length of the length of the run on line i#5, and continues along portions of other runs in the list. The key to our algorithm is knowing which portions of various runs into include in the boundary. To do this, we may exploit the second constraint that the required boundary is positively directed. This constraint requires that immediate history of the boundary traversal be maintained to determine the direction in which to traverse the RLE data. The algorithm presented does this by maintaining the state of the current traversal as a state in a "nite state machine. The original RLE data serves as input to the machine to determine state transitions. Observe that for a positively directed region boundary (either internal or external), boundary segments may be grouped into four forms (see Fig. 6): Up-Right: Where the boundary is moving up and/or right with the region above and/or to the left. Down-Left: Where the boundary is moving down and/or left with the region beneath and/or to the right. Down-Right: Where the boundary is moving down and/or right with the region above and/or to the right. Up-Left: Where the boundary is moving up and/or left with the region beneath and/or to the left. These segment types constitute the states of our "nite state machine. 4.2. State transition In the computations of the boundary of some region R, let p be the current point on the boundary, t be the current run, and j be the next run in a positively directed &trek' around the boundary. Let the positively directed region boundary comprise the ordered boundary segments +b , b ,2, b ,, where , each boundary segment is a list of connected boundary points with constant y values (i.e. on the same run). State transitions occurs for each move from one segment to the next consecutive segment. In our example in Fig. 5, assuming that the initial state is Up-Right, the state transition will taken place at point (k#5, i#5) to state Up-Right, point (k#8, i#4) to state Up-Right, point (k#10, i#3) to state Up-Right,
Fig. 5. A positively directed boundary segment of a RLE region.
Fig. 6. Boundary segment labels for positively directed boundary.
point (k#12, i#2) to state Down-Right, point (k#14, i#3) to state Up-Right, and point (k#15, i#1) to state Up-Left. Notice that from (k, i#5) to (k#12, i#2), three state transitions occur where the resulting state and the originating state are the same. 4.2.1. Transition predicate dexnitions Let R(n) denote the set of runs in R on line n R(n) , R(n) denote the subsets of R(n) to the left JA PA and right of point c, respectively. R(n) , R(n) denote the subset of R(n) completely to JE PE the left and right of run g, respectively. left(g), right(g) denotee the leftmost and rightmost points of run g, respectively. x(c) denote the x (column) value of point c. line(g), line(c) denote the y (line or row) value of run g and point c, respectively. State transitions occur when certain predicates are satis"ed. Two sets of predicates are needed to handle 4- and 8-connected regions. The 4-connected region predicates are straightforward, and are de"ned as follows: IN(c, R) which searches a set of runs R for a run g which satis"es the condition;
1642
I (R, g) J
I (R, g) P
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
x(left(g)))x(c))x(right(g)). The function returns g if it is found, else NI¸ returned. which searches a set of runs R for a leftmost run l which satis"es the condition: x(left(g)))x(left(l)))x(right(g)). The function returns l if it is found, else NI¸ is returned. which searches a set of runs R for a rightmost run l which satis"es the condition: x(left(g)))x(right(l)))x(right(g)). The function returns l if it is found, else NI¸ is returned.
The 8-connected region predicates, on the other hand, are sensitive to the relationship between the test point c (always the current point p) and the run on which it resides. Since the algorithm hops from run to run, the test point c will always be an end (either left or right) of a run. The 8-connected region predicates are: IN(c, R) which searches a set of runs R for a run g which satis"es the condition; x(left(g))) x(c))x(right(g))#1). If c is a left-end, and(x(left(g))!1))x(c))x(right(g)) if c is a right-end. The function returns g if it is found, else NI¸ returned. I (R, g) which searches a set of runs R for a leftJ most run l which satis"es the condition: x(left(g)))x(left(l))!1))(x(right(g)). The function returns l if it is found, else NI¸ is returned. I (R, g) which searches a set of runs R for a rightP most run l which satis"es the condition: x(left(g)))(x(right(l))#1))x(right(g)). The function returns l if it is found, else NI¸ is returned. 4.2.2. Operator dexnitions For bookkeeping purposes, each run may be tagged as 1left\used2 and/or 1right\used2. If neither tag appears on a run, it is assumed to be 1unused2 (the default). Finally, we de"ne the &walking' functions: =A¸K (n, c, end(g)) P and =A¸K (n, c, end(g)) J which adds points to the boundary list moving to the right (increasing x) and left (decreasing x) respectively from point c toward the speci"ed end point of run g (end may be either left or right designating left(r) or right(r) respectively) such that: E All the points are on line n except for the "nal point which will be the speci"ed end point of run g. Run g will always be within $1 line from n.
E If c on line n the set of points begins with (x(c)#1, n) for a rightward walk, and (x(c)!1, n) for a leftward walk. E If c is not on line n, the set of points begins with (x(c), n) for 4-connected boundaries. For 8-connected boundaries, it begins with (x(c)#1, n) for a rightward walk, and (x(c)!1, n) for a leftward walk. E If g is on line n, end(g) (i.e. left(g) or right(g)) is the "nal point of the point set. E If g is not on line n, provision must be made to handle 4-connected and 8-connected boundaries. For 4-connected boundaries, the point list includes point (x(end(g)), n). For 8-connected boundaries (x(end(g)), n) is excluded from the boundary point list. At the end of the =A¸K, g is tagged as being 1left\used2 or 1right\used2 with end being left or right respectively. For example, =A¸K (i, p, left(j)) (remember t is the P current run, j is the next run and p is the current point) does the following to obtain a 4-connected boundary: 1. If p is not on line i, add point (x(p), i) to the boundary list 2. Add points (x(p)#1, i), (x(p)#2i),2, (x(left(j)), i) to the boundary list. 3. If (line(j)Oi), add left(j) to the boundary list. 4. Tag j as being 1left\used2. 5. Set p " : left(j) and t " : j Hence WALK( ) concludes with the current run t and current point p set to their new values after the =A¸K. 4.3. The four state transition table Fig. 7 is the transition table making use of these operations to generate a closed boundary. The rows of the transition table represent the FROM states while the columns are the target ¹O states. Each transition box in the table has one or more "gures showing the run con"guration. The horizontal lines in these "gures represent the runs in the region with the solid black run being the current run t. The current point p before the transition is marked by the &E' on j. Each transition box also contains one or two sets of &Test-Operation' pairs. Notice that some tests lead to ambiguous ¹O states. The ambiguities in rows 1 and 4 are in columns 2 and 3 and the ambiguities of row 2 and 3 are in columns 1 and 4. This is not a problem because rows 1 and 4 are identical in Test and Operation as are rows 2 and 3. An ambiguity in any transition is thus resolved in subsequent transitions. The transition table, therefore speci"es a "nite state machine with e-moves [20,21]. For example, from state ;P-RIGH¹, satisfying the test of condition R yields two possible target states:
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
1643
Fig. 7. Transition table for the 4 state "nite state machine to compute region boundaries.
DO=N-¸EF¹ and DO=N-RIGH¹. From states DO=N-¸EF¹ and DO=N-RIGH¹, however, all the tests and operations are identical and there is no ambiguity between DO=N-¸EF¹ and DO=N-RIGH¹ as ¹O states*thereby resolving the ambiguity. 4.4. The reduced two-state machine The "nite state machine may be reduced to two states with ;P-RIGH¹ and ;P-¸EF¹ merging into an ;P state, and DO=N-¸EF¹ and DO=N-RIGH¹ merging into a DO=N state (The two-state machine is easier to implement, but is more di$cult to understand and explain than the four-state version). Fig. 8 shows the state diagram of the two-state "nite state machine. Each state has four arcs originating from and four arcs terminating at it. Each arc embodies a Test-Operation pair identical to the contents of a transition box in Fig. 7. The 2-tuple label of each arch speci"es are [row, column] entry of the transition table whose contents constitute the test to select the arc and the operation to be executed when the arc is selected.
The two predicates which make up each test in the "nite state machine may be implemented as a decision tree. Fig. 9 shows the decision trees of both the ;P and DO=N states of the two-state "nite state machine. The predicates of the transition table are represented by the following symbols: k,IN(p, R(i!1)) l,I (R(i) , j) J PR i,I (R(i!1) , t) P JN a,IN(p, R(i#1)) b,I (R(i) , j) P PR c,I (R(i!1) , t) J PN 4.5. External and internal boundaries Given a starting point and an initial state, the "nite state machines (both the two and four state machines will produce the same result) will generate a closed positively
1644
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
Fig. 8. State diagram of the 2-state the "nite state machine to compute region boundaries.
Fig. 11. Organization of the region run lists for boundary extraction.
Fig. 9. The ;P state decision tree. (b) The DO=N state decision tree.
RIGH¹). The machine may thus be used to extract as many internal and external boundaries as may be present, given a list of runs. It does not matter if the ®ion' is made up of more than one connected sub-regions (as may be the case when a single surface is broken up owing to occlusion).
4.6. The algorithm Let the runs be organized into an array of run lists indexed by line number and ordered from left to right as shown in Fig. 11. The algorithm to extract all the internal and external boundaries is as follows:
Fig. 10. Finite state machine initialization for (a) External and (b) Internal boundaries.
directed boundary. The machine is indi!erent as to whether the boundary is internal or external to the region. The starting point of a boundary is always the topleftmost unused end of a run. This can either be left or right end. As can be seen in Fig. 10, if the "rst unused end encountered is a left-end, the boundary is external, and the initial state of the two-state machine is set to DO=N (for the four-state machine, it is set to DO=N-¸EF¹). If the starting usused end is a right-end, the boundary is internal, and the initial state of two-state machine is set to ;P (for the four-state machine, it is set to ;P-
Loop while there are 1unused2 ends If the "rst 1unused2 end is a left-end: Set S¹A¹E :"DOWN; p to the 1unused2 end point; and s " : p. Tag the 1unused2 end as 1left\used2. Else If the "rst 1unused2 end is a right-end: Set S¹A¹E " : UP; p to the 1unused2 end point; and s " : p. Tag the 1unused2 end as 1right\used2. Loop Apply the "nite state machine for one state transition. UNTIL p""s Loop end
4.6.1. Algorithm ezciency The algorithm described here is straightforward and operates directly on an RLE description of image regions. For data that is already in RLE format (e.g. TIFF-RLE and Silicon Graphics RLE images), all that is necessary is for the runs to be "rst separated organized into distinct regions. Our algorithm to obtain the RLE lists as separate object (described in Section 2.2) realizes this in a single
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
iteration through the runs. By maintaining the active objects in the AOL, all object regions are extracted during a single pass. Since the algorithm needs only to examine the runs in the previous image row, even the organization of the runs into the run table (Fig. 4) may be done in the same pass. Let N be the number of runs in the RLE description. For each run, the algorithm only needs to test its adjacency with the runs in the previous line. Furthermore, by using the PLR pointer, we need not inspect all runs in the previous line for adjacency. On the average, there are as many object-line intersections in line y as in line y!1. This yields an average of two adjacency tests ¹ per run (one that is adjacent, and a second test to "nd the "rst non-adjacency run in the previous line). Our RLE extraction algorithm, therefore needs to perform an average of 2N adjacency tests. This analysis holds true for the extraction of RLE objects from a labeled binary object as well. Once the RLE objects have been extracted, the boundary extraction algorithm is even most straightforward. Let there be K object regions in the data, and let B be the average boundary length of each region. In accordance to the discussion in the previous paragraph, an average of two predicates tests (from Section 4.2.1) are performed for each run. Hence the boundary extraction algorithm for all objects requires KB point inclusions (the walking operations) and 2N predicate operations. It requires a single pass through the RLE set to "nd all the initial points for both internal and external boundaries. The preceding analysis shows that the algorithm is basically of order N. We do not present any timing experiments because the computation requirements of both algorithms are so modest that the timing would be more a measure of the e$ciency of the compiler, operating system, and memory management system (for allocating/deal locating memory for runs, regions, and run tables) than a measure of the algorithm.
1645
Fig. 12. A 32;32 image of the test region.
(19 23 27) (20 8 11) (20 22 26) (21 9 13) (21 22 24) (22 10 28) (23 10 28) (24 9 28) (25 6 27) (26 2 24) (27 3 9) (27 14 24) (28 5 7) (28 17 27) (29 6 6) (29 22 26) (30 22 26)))
The algorithm found the three boundaries: one external and two internal. The following is a transition trace of the "nite state machine operating on the RLE code. The transitions are tabulated by transition count, the transition speci"er in the form of the [row, column] tuple of Fig. 8, p (the current point upon entering each state), the direction of the =A¸K, and the list of points added to the boundary during the =A¸K.
EXTERNAL BOUNDARY: 1 \
5. Results Fig. 12 is a 32;32 image of region with two holes embedded in it. The region is described by (def\region (state 1) (runs (1 18 23) (2 1 1) (2 16 24) (3 1 1) (3 6 7) (3 14 25) (4 1 1) (4 4 7) (4 13 26) (5 1 1) (5 3 27) (6 1 29) (7 3 29) (8 6 29) (9 8 20) (9 22 29) (10 8 8) (10 10 19) (10 23 29) (11 8 8) (11 10 18) (11 24 29) (12 8 8) (12 10 17) (12 25 29) (13 8 8) (13 10 17) (13 24 28) (14 8 8) (14 10 18) (14 24 25) (15 1 19) (15 23 24) (16 6 18) (16 20 25) (17 7 17) (17 22 25) (18 8 16) (18 25 29) (19 8 14)
1. 2. 3. 4.
[2, 2] [2, 2] [2, 2] [2, 4]
(18 1) (16 2) (14 3) (13 4)
5. 6. 7. 8. 9.
[1, 1] [1, 2] [2, 2] [2, 2] [2, 4]
(7 4) (7 3) (6 3) (4 4) (3 5)
10. 11.
[1, 1] (1 5) [1, 1] (1 4)
LEFT (18, 2) (17, 2) (16, 2) LEFT (16, 3) (15, 3) (14, 3) LEFT (14, 4) (13, 4) LEFT (13, 5) (12, 5) (11, 5) (10, 5) (9, 5) (8, 5) (7, 5) (7, 4) RIGHT (7, 3) LEFT (6, 3) LEFT (6, 4) (5, 4) (4, 4) LEFT (4, 5) (3, 5) LEFT (3, 6) (2, 6) (1, 6) (1, 5) RIGHT (1, 4) RIGHT (1, 3)
1646
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649
12. 13. 14. 15. 16. 17. 18. 19.
[1, 1] [1, 2] [2, 2] [2, 2] [2, 2] [2, 2] [2, 3] [2, 3]
(1 3) (1 2) (1 2) (1 3) (1 4) (1 5) (1 6) (3 7)
20. 21. 22. 23. 24. 25. 26.
[2, 3] [2, 2] [2, 2] [2, 2] [2, 2] [2, 2] [2, 2]
(6 8) (8 9) (8 10) (8 11) (8 12) (8 13) (8 14)
27.
[2, 3] (1 15)
28. 29. 30. 31. 32. 33. 34. 35. 36.
[2, 3] [2, 3] [2, 2] [2, 2] [2, 3] [2, 3] [2, 2] [2, 2] [2, 2]
37.
[2, 2] (6 25)
(6 16) (7 17) (8 18) (8 19) (8 20) (9 21) (10 22) (10 23) (9 24)
38. 39.
[2, 3] (2 26) [2, 3] (3 27)
40. 41. 42. 43.
[2, 3] [2, 1] [1, 1] [1, 1]
44.
[1, 3] (9 27)
(5 28) (6 29) (6 29) (7 28)
45.
[2, 3] (14 27)
46.
[2, 3] (17 28)
47. 48.
[2, 2] (22 29) [2, 1] (22 30)
49.
[1, 1] (26 30)
RIGHT (1, 2) LEFT LEFT (1, 3) LEFT (1, 4) LEFT (1, 5) LEFT (1, 6) RIGHT (2, 6) (3, 6) (3, 7) RIGHT (4, 7) (5, 7) (6, 7) (6, 8) RIGHT (7, 8) (8, 8) (8, 9) LEFT (8, 10) LEFT (8, 11) LEFT (8, 12) LEFT (8, 13) LEFT (8, 14) LEFT (8, 15) (7, 15) (6, 15) (5, 15) (4, 15) (3, 15) (2, 15) (1, 15) RIGHT (2, 15) (3, 15) (4, 15) (5, 15) (6, 15) (6, 16) RIGHT (7, 16) (7, 17) RIGHT (8, 17) (8, 18) LEFT (8, 19) LEFT (8, 20) RIGHT (9, 20) (9, 21) RIGHT (10, 21) (10, 22) LEFT (10, 23) LEFT (10, 24) (9, 24) LEFT (9, 25) (8, 25) (7, 25) (6, 25) LEFT (6, 26) (5, 26) (4, 26) (3, 36) (2, 26) RIGHT (3, 26) 3, 27) RIGHT (4, 27) (5, 27) (5, 28) RIGHT (6, 28) (6, 29) RIGHT RIGHT (6, 28) (7, 28) RIGHT (7, 27) (8, 27) 9, 27) RIGHT (9, 26) (10, 26) (11, 26) (12, 26) (13, 26) (14, 26) (14, 27) RIGHT (15, 27) (16, 27) (17, 27) (17, 28) RIGHT (18, 28) (19, 28) (20, 28) (21, 28) (22, 28) (22, 29) LEFT (22, 30) RIGHT (23, 30) (24, 30) (25, 30) (26, 30) RIGHT (26, 29)
50. 51.
[1, 1] (26 29) [1, 4] (27 28)
52. 53.
[1, 1] (24 27) [1, 1] (24 26)
54. 55. 56. 57.
[1, 1] [1, 1] [1, 1] [1, 4]
58.
[1, 1] (24 21)
59. 60.
[1, 1] (26 20) [1, 1] (27 19)
61.
[1, 4] (29 18)
62. 63. 64. 65.
[1, 1] [1, 4] [1, 1] [1, 1]
(25 17) (25 16) (24 15) (25 14)
66. 67. 68. 69. 70. 71. 72. 73.
[1, 1] [1, 1] [1, 1] [1, 1] [1, 1] [1, 1] [1, 1] [1, 4]
(28 13) (29 12) (29 11) (29 10) (29 9) (29 8) (29 7) (29 6)
74. 75. 76. 77. 78.
[1, 4] [1, 4] [1, 4] [1, 4] [1, 2]
(27 5) (26 4) (25 3) (24 2) (23 1)
(27 25) (28 24) (28 23) (28 22)
RIGHT (26, 28) (27, 28) LEFT (26, 28) (25, 28) (24, 28) (24, 27) RIGHT (24, 26) RIGHT (24, 25) (25, 25) (26, 25) (27, 25) RIGHT (27, 24) (28, 24) RIGHT (28, 23) RIGHT (28, 22) LEFT (27, 22) (26, 22) (25, 22), (24, 22) (24, 21) RIGHT (24, 20) (25, 20) (26, 20) RIGHT (26, 19) (27, 19) RIGHT (27, 18) (28, 18) (29, 18) LEFT (28, 18) (27, 18) (26, 18) (25, 18) (25, 17) RIGHT (25, 16) LEFT (24, 16) (24, 15) RIGHT (24, 14) (25, 14) RIGHT (25, 13) (26, 13) (27, 13) (28, 13) RIGHT (28, 12) (29, 12) RIGHT (29, 11) RIGHT (29, 10) RIGHT (29, 9) RIGHT (29, 8) RIGHT (29, 7) RIGHT (29, 6) LEFT (28, 6) (27, 6) (27, 5) LEFT (26, 5) (26, 4) LEFT (25, 4) (25, 3) LEFT (24, 3) (24, 2) LEFT (23, 2) (23, 1) LEFT (22, 1) (21, 1) (20, 1) (19, 1) (18, 1)
INTERNAL BOUNDARY: 1 \ No. Trans Pcurr WALK Points Added 1.
[1, 3] (20 9)
2. 3. 4. 5. 6. 7. 8.
[2, 3] [2, 3] [2, 3] [2, 2] [2, 2] [2, 2] [2, 2]
9.
[2, 3] (20 16)
(22 9) (23 10) (24 11) (25 12) (24 13) (24 14) (23 15)
RIGHT (20, 8) (21, 8) (22, 8) (22, 9) RIGHT (23, 9) (23, 10) RIGHT (24, 10) (24, 11) RIGHT (25, 11) (25, 12) LEFT (25, 13) (24, 13) LEFT (24, 14) LEFT (24, 15) (23, 15) LEFT (23, 16) (22, 16) (21, 16) (20, 16) RIGHT (21, 16) (22, 16) (22, 17)
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649 10.
[2, 3] (22 17)
11.
[2, 2] (25 18)
12. 13. 14.
[2, 2] (23 19) [2, 2] (22 20) [2, 4] (22 21)
15.
[1, 4] (13 21)
16.
[1, 1] (11 20)
17.
[1, 1] (14 19)
18. 19. 20. 21. 22. 23. 24. 25. 26.
[1, 1] [1, 1] [1, 1] [1, 4] [1, 4] [1, 1] [1, 1] [1, 1] [1, 1]
(16 18) (17 17) (18 16) (19 15) (18 14) (17 13) (17 12) (18 11) (19 10)
RIGHT (23, 17) (24, 17) (25, 17) (25, 18) LEFT (25, 19) (24, 19) (23, 19) LEFT (23, 20) (22, 20) LEFT (22, 21) LEFT (22, 22) (21, 22) (20, 22) (19, 22) (18, 22) (17, 22) (16, 22) (15, 22) (14, 22) (13, 22) (13, 21) LEFT (12, 21) (11, 21) (11, 20) RIGHT (11, 19) (12, 19) (13, 19) (14, 19) RIGHT (14, 18) (15, 18) (16, 18) RIGHT (16, 17) (17, 17) RIGHT (17, 16) (18, 16) RIGHT (18, 15) (19, 15) LEFT (18, 15) (18, 14) LEFT (17, 14) (17, 13) RIGHT (17, 12) RIGHT (17, 11) (18, 11) RIGHT (18, 10) (19, 10) RIGHT (19, 9) (20, 9)
INTERNAL BOUNDARY: 2 \ No. Trans Pcurr WALK Points Added 1.
[1, 3] (8 10)
RIGHT (8, 9) (9, 9) (10, 9) (10, 10)
2. 3. 4. 5. 6.
[2, 2] [2, 2] [2, 2] [2, 2] [2, 4]
(10 10) (10 11) (10 12) (10 13) (10 14)
7. 8. 9. 10.
[1, 1] [1, 1] [1, 1] [1, 1]
(8 14) (8 13) (8 12) (8 11)
1647 LEFT (10, 11) LEFT (10, 12) LEFT (10, 13) LEFT (10, 14) LEFT (10, 15) (9, 15) (8, 15) (8, 14) RIGHT (8, 13) RIGHT (8, 12) RIGHT (8, 11) RIGHT (8, 10)
A listing of the resulting closed positively directed contours is presented below. Each point list is attributed by the region from which it was extracted (in this experiment, the region numbers are all &1' since only one region was fed to the program), the type of boundary, and the list of points. These boundaries are plotted in the images shown in Fig. 13. (def point list (region 1) \ \ (type External Boundary) \ (pts (18 2) (17 2) (16 2) (16 3) (15 3) (14 3) (14 4) (13 4) (13 5) (12 5) (11 5) (10 5) (9 5) (8 5) (7 5) (7 4) (7 3) (6 3) (6 4) (5 4) (4 4) (4 5) (3 5) (3 6) (2 6) (1 6) (1 5) (1 4) (1 3) (1 2) (1 3) (1 4) (1 5) (1 6) (2 6) (3 6) (3 7) (4 7) (5 7) (6 7) (6 8) (7 8) (8 8) (8 9) (8 10) (8 11) (8 12) (8 13) (8 14) (8 15) (7 15) (6 15) (5 15) (4 15) (3 15) (2 15) (1 15) (2 15) (3 15) (4 15) (5 15) (6 15) (6 16) (7 16) (7 17) (8 17) (8 18) (8 19) (8 20) (9 20) (9 21) (10 21) (10 22) (10 23) (10 24)
Fig. 13. (a) The external boundary. (b) The internal boundary.
1648
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649 (26 28) (27 28) (26 28) (25 28) (24 28) (24 27) (24 26) (24 25) (25 25) (26 25) (27 25) (27 24) (28 24) (28 23) (28 22) (27 22) (26 22) (25 22) (24 22) (24 21) (24 20) (25 20) (26 20) (26 19) (27 19) (27 18) (28 18) (29 18) (28 18) (27 18) (26 18) (25 18) (25 17) (25 16) (24 16) (24 15) (24 14) (25 14) (25 13) (26 13) (27 13) (28 13) (28 12) (29 12) (29 11) (29 10) (29 9) (29 8) (29 7) (29 6) (28 6) (27 6) (27 5) (26 5) (26 4) (25 4) (25 3) (24 3) (24 2) (23 2) (23 1) (22 1) (21 1) (20 1) (19 1) (18 1)))
Fig. 14. Boundary extracted from a segment of a carotid artery in an X-ray angiogram image.
(9 24) (9 25) (8 25) (7 25) (6 25) (6 26) (5 26) (4 26) (3 26) (2 26) (3 26) (3 27) (4 27) (5 27) (5 28) (6 28) (6 29) (6 28) (7 28) (7 27) (8 27) (9 27) (9 26) (10 26) (11 26) (12 26) (13 26) (14 26) (14 27) (15 27) (16 27) (17 27) (17 28) (18 28) (19 28) (20 28) (21 28) (22 28) (22 29) (22 30) (23 30) (24 30) (25 30) (26 30) (26 29)
(def point list (region 1) \ \ (type Internal Boundary) \ (pts (20 8) (21 8) (22 8) (22 9) (23 9) (23 10) (24 10) (24 11) (25 11) (25 12) (25 13) (24 13) (24 14) (24 15) (23 15) (23 16) (22 16) (21 16) (20 16) (21 16) (22 16) (22 17) (23 17) (24 17) (25 17) (25 18) (25 19) (24 19) (23 19) (23 20) (22 20) (22 21) (22 22) (21 22) (20 22) (19 22) (18 22) (17 22) (16 22) (15 22) (14 22) (13 22) (13 21) (12 21) (11 21) (11 20) (11 19) (12 19) (13 19) (14 19) (14 18) (15 18) (16 18) (16 17) (17 17) (17 16) (18 16) (18 15) (19 15) (18 15) (18 14) (17 14) (17 13) (17 12) (17 11) (18 11) (18 10) (19 10) (19 9) (20 9)))
Fig. 15. Boundaries extracted from a thresholded scanned image of text data.
F.K.H. Quek / Pattern Recognition 33 (2000) 1637}1649 (def2d point list (region 1) \ \ (type Internal Boundary) \ (pts (8 9) (9 9) (10 9) (10 10) (10 11) (10 12) (10 13) (10 14) (10 15) (9 15) (8 15) (8 14) (8 13) (8 12) (8 11) (8 10)))
Notice that the boundaries are positively directed. The contour runs counterclockwise for the external boundary, and clockwise for the internal boundaries. For the external boundary, transitions 9}15 travel down and bracktracks on a one-pixel-wide isthmus, as do transitions 26}27. This is necessary if the contour is to be closed. Fig. 14 shows the boundary extracted from a segment of a carotid artery in an X-ray angiogram image. Fig. 15 shows the boundaries extracted from a thresholded scanned image of text data.
6. Conclusion A fast "nite state machine-based algorithm provides a straightforward solution to the computation of boundaries of run-length encoded regions has been presented. The algorithm operates directly on the runs and yields closed positively directed contours for all external and internal boundaries of the regions. Each closed boundary is labelled as internal or external. The algorithm handles four-connected and eight-connected regions producing four-connected and eight-connected boundary point lists as required. The algorithm enhances the use of RLE as an objectbased region representation by adding a signi"cant capability of the suite of operations and computations which may be performed on the data structure.
References [1] G.G. Gouriet, Bandwidth compression of television signals, IEE Proc., London 104B (1957) 256}272. [2] C. Cherry, M.B. Kubba, D.E. Pearson, M.P. Barton, An experimental study of the possible bandwidth compression of visual signals, Proc. IEEE 51 (1963) 1507}1517. [3] A.H. Robinson, C. Cherry, Results of prototype television bandwidth compression scheme, Proc. IEEE 55 (1967) 356}564.
1649
[4] A. Rosenfeld, A.C. Kak, Digital Picture Processing, 2nd ed., 1, Academic Press, New York, 1982. [5] B. Klaus, P. Horn, Robot Vision, MIT Press, Cambridge, MA, 1986. [6] C.A. Lindley, Practical Image Processing in C, Wiley New York, 1991. [7] R.C. Gonzalez, P. Wintz, Digital Image Processing, Addison-Wesley Reading, MA, 1987. [8] J.B. Paul, Surfaces in early range image understanding, Ph.D. Thesis, The University of Michigan, Ann Arbor, MI, Mar. 1986, Electr. Eng. Comp. Sci. Dept. RSD-TR-10-86. [9] K.H.Q. Francis, On Three-dimensional object recognition and pose determination: an abstraction based approach, Ph.D. Thesis, University of Michigan, Ann Arbor, MI, March 1990, Electr. Eng. Comp. Sci. Dept. [10] B. Mitchell, A. Gilles, Model based computer vision system for recognizing hand written zip codes, Machine Vision Appl. 2 (1989) 231}241. [11] F. Quek, M. Petro, Interactive map conversion: combining machine vision and human input, Proc. IEEE Workshop on Applications of Computer Vision, Palm Springs, CA., Nov. 30}Dec. 2 1992, pp. 255}264. [12] G. Sen, E. Cohen, External word segmentation of o!-line handwritten text lines, Pattern Recognition 27 (1994) 41}52 [13] W.M. Newman, R.F. Sproull, Principles of Interactive Computer Graphics, 2nd ed., McGraw-Hill, New York, 1979. [14] A. Rosenfeld, A.C. Kak, Digital Picture Processing, 2nd ed., vol. 2, Academic Press, New York, 1982. [15] R. Jain, R. Kasturi, B.G. Schunck, Machine Vision, McGraw-Hill, New York, 1995. [16] S.-D. Kim, J.-H. Lee, J.-K. Kim, A new chain-coding algorithm for binary images using run-lengthy codes, Computer Vision, Graphics, Image Process. 41 (1988) 114}128. [17] D.H. Ballard, C.M. Brown, Computer Vision, PrenticeHall, Englewood Cli!s, NJ, 1982. [18] R.O. Duda, P.E. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, 1973. [19] D.W. Capson, An improved algorithm for the sequential extraction of boundaries from a raster scan, Computer Vision, Graphics Image Process, 28 (1984) 109}125. [20] J.E. Hopcroft, J.D. Ullman, Introduction to Automata Theory, Languages, and Computation, Addison-Wesley, Reading, MA, 1979. [21] Zvi Kohavi, Switching and Finite Automata Theory, McGraw-Hill, New York, 1978.
About the Author*FRANCIS QUEK is currently an Associate Professor in the Department of Computer Science and Engineering at the Wright State University. He has formerly been a$liated with the University of Illinois at Chicago, the University of Michigan Arti"cial Intelligence Laboratory, the Environmental Research Institute of Michigan (ERIM) and Hewlett-Packard Human Input Division. Francis received both his B.S.E. summa cum laude (1984) and M.S.E. (1984) in electrical engineering from the University of Michigan in two years. He completed his Ph.D. C.S.E. at the same university in 1990. He also has a Technician's Diploma in Electronics and Communications Engineering from the Singapore Polytechnic (1978). Francis is a member of the IEEE and ACM. He is director of the Vision Interfaces and Systems Laboratory (VISLab) which he established for computer vision, medical imaging, vision-based interaction, and human-computer interaction research. He performs research in multimodal verbal/non-verbal interaction, vision-based interaction, multimedia databases, medical imaging, collaboration technology, and computer vision. For more information and publications list see http: //www.cs.wright.edu/&quek and http: //vislab.cs.wright.edu.
Pattern Recognition 33 (2000) 1651}1664
Snakes simpli"ed N.E. Davison *, H. Eviatar, R.L. Somorjai Department of Physics, University of Manitoba, Winnipeg, MB, Canada Department of Imformatics, Institute for Biodiagnostics, National Research Council, 435 Ellice Ave, Winnipeg, MB, Canada R313 1Y6 Institute for Biodiagnostics, National Research Council, 435 Ellice Ave, Winnipeg, MB, Canada Received 16 June 1998; received in revised form 21 June 1999; accepted 21 June 1999
Abstract The snake formulation of Eviatar and Somorjai has the advantages of being both conceptually simple and rapidly convergent. We extend this formulation in two ways, by exploring additional energy terms whose interpretation is transparent and by using a simple minimization technique. The usefulness of the simpli"ed model is illustrated using arti"cial images as well as images obtained with MRI, optical microscopy and ultrasound. 2000 Published by Elsevier Science Ltd on behalf of Pattern Recognition Society. Keywords: Snakes; Active contours; Energy terms; Simpli"ed
1. Introduction The detection and description of boundaries in images is one of the fundamental tasks of early vision. It is often found, however, that because of the presence of noise and image anomalies, robust boundaries cannot be detected solely on the basis of intensity and contrast. Additional problems may arise when the boundaries found by an edge detection algorithm do not conform well with the subjective expectations of a human observer. For instance, there may be parts of the boundary where the observer feels the curvatures are unphysically large. Similarly, the vicinity of the boundary may be noisy and an objective edge detector may "nd many edges where the observer prefers to believe that only a single boundary is present. The problem of de"ning a boundary thus depends on a model of what is acceptable. Active contours or `snakesa are often the preferred approach to boundary identi"cation because they are based on the paradigm that the presence of an edge depends not only on local gradient information but also
* Corresponding author. Tel.: 001-204-984-6219. E-mail address:
[email protected] (N.E. Davison).
on the long-range spatial distribution of the gradient. Snakes incorporate this long-range view by combining continuity and curvature constraints with local gradient strength. A snake may be thought of as an elastic curve that, through minimization of an energy functional, deforms and adjusts its initial shape on the basis of additional image information to provide a continuous boundary. The energy functional being minimized is derived from forces which are both internal and external to the snake. Internal forces are responsible for snake response, that is, its elasticity and rigidity, whereas external forces come from the image or from higher-level processes and are responsible for moving the snake toward important features in the image. A central problem in the use of snakes is the choice of appropriate forms for the internal and external energies. The classic formulation [1] allows the problem to be reduced to a matrix form, but implicitly places restrictions on the energy functionals. The work of Eviatar and Somorjai [2] employs a less elaborate model than the classic one. First, the energy functionals are chosen to be simple, and second, energy minimization is carried out by adjusting the vertex positions using Powell's method. While less elegant in some ways than the classic formulation, this approach allows a wider range of energy functionals.
0031-3203/00/$20.00 2000 Published by Elsevier Science Ltd on behalf of Pattern Recognition Society. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 0 - 5
1652
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
The purpose of this paper is to extend the simpli"ed snake of Eviatar and Somorjai [2], and to present additional energy functionals found to be useful in speci"c applications. The behavior of snakes that employ these new energy functionals is discussed and compared to the behavior of snakes that use functionals previously presented in the literature. The remainder of this paper is subdivided as follows. In Section 2, an introduction to snakes is given. A more detailed presentation of energy terms is given in Section 3, and the e!ects of these terms is illustrated in Section 4 using arti"cial images. The application of snakes to several realistic problems is discussed in Section 5, and "nally, we present our overall experience in Section 6.
2. Overview of snakes The focus of this paper is on the use of active contours in static 2D images. We begin with the `classica snake introduced by Kass et al. [1], and then proceed to the simpli"ed model of Eviatar and Somorjai [2]. We represent the snake as a contour C (open or closed) that is described by coordinate pairs +x(s), y(s),, where s is the normalized arc length along C, (0)s)1). With the snake is associated an `energya consisting of three terms: (i) an internal energy determined by the elasticity and rigidity of the snake, (ii) an image energy representing the characteristics of the image (intensity, gradient, etc.) in the immediate vicinity of the snake and (iii) a constraint energy that results from user interaction with the snake, from prior knowledge, etc. The desired contour is found by minimizing the energy functional
+E [x(s), y(s)]#E [x(s), y(s)] GLR GK?EC #E [x(s), y(s)], ds. AML
E " QL?IC
(1)
In what follows, we make no use of external constraints represented by E . AML In the classic snake model, the internal energy can be written as E (s)"+a(s)[x(s)#y(s)]#b(s)[x (s)#y (s)],, Q Q QQ QQ GLR
continuous variable s. For practical applications, the problem must be discretized, usually through sampling the energies at N equally spaced `verticesa v around the G contour C, so that v "v(s) and h"1/N. When the G QGF equations are rendered in discrete form, the energies can be sampled either at the N vertices alone or integrated between vertices. A variational approach has usually been employed to minimize the snake energy E [1,3}9]. The variation QL?IC in the energy then reduces to a pentadiagonal matrix in the vertex positions. Such a matrix is easily inverted to obtain the vertex positions. Unfortunately, the desire to retain a band-diagonal matrix when other terms are added to the energy can be very restrictive. It will be seen in the discussion below that there are often signi"cant advantages in removing this restriction by using other methods to minimize the snake energy. Eviatar and Somorjai [2], in a simpli"ed approach, treat the problem somewhat di!erently in that the vertex positions are adjusted using Powell's method [10]. This method does not depend for e$ciency on the problem being formulated in a special matrix form. Although Powell's method is e$cient, it is not particularly intuitive. For this reason, we have carried out some of our calculations using a more pedestrian algorithm that is easily grasped. The method involves successively displacing the vertices over "xed distances in each of several directions. The displacement, if any, yielding the greatest reduction in energy compared to that of the former vertex position is taken as the new vertex position. Perhaps surprisingly, this algorithm tends to be somewhat faster than Powell's algorithm when used to "t snake vertices to a contour. The algorithm has very low `overheada that more than compensates for its lesser sophistication, at least in these applications. In addition, its intuitive operation makes it useful especially when the principal goal of an investigation is exploration of the e!ects of a novel energy term.
3. Energy terms in snakes 3.1. Internal energy terms
(2)
where the subscripts on x and y indicate derivatives with respect to the arc length s. The coe$cients a(s) and b(s) give, respectively, the strength of the elastic and rigidity forces in the snake. A large value of a(s) encourages the snake to shrink and penalizes the development of positional discontinuities. A large value of b(s) discourages sharp bends whereas a small value allows corners to develop. In principle, the internal energy could be extended to include other characteristics of the snake. In the above expressions, the internal snake energy and the image energy are formulated as integrals over the
The internal energy of a snake represents its elasticity and rigidity, plus other characteristics deemed useful to include. In the classic snake, only elastic and rigidity terms appear. Eq. (2), when discretized, yields the form 1 ,\ 1 G a "v !v "# b "v E " G G G\ G G\ GLR 2h 2h ,\ G !2v #v ", G G>
(3)
where v "v(s)" . As noted above, the "rst term corresG QGF ponds to an elastic energy in the snake and the second
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
1653
term to a rigidity energy. Note that in principle, the values of a and b can be functions of the position of the vertex along the snake. 3.1.1. Elastic energy The form for the elastic energy given in expression (3) has the disadvantage that there is no minimum size for the contour, and it can therefore shrink to a point if it is not constrained by image forces. One approach to circumvent this problem is to add an additional force normal to the contour so that there is no net tendency for the snake to expand or contract [11,12]. Other authors, [2,8], while retaining the inherent tendency for a snake to contract, have modi"ed the elastic energy so that there is an intervertex distance greater than zero for which the elastic energy goes to zero. This is particularly convenient when the distance between the vertices is known approximately from the initial contour. Eviatar and Somorjai [2] used an elastic energy of the form E (i)"i ("x !x "!r ) GLRV G G G> G
(4)
for the X-coordinate and a similar expression for the >-coordinate. This potential has the form of a harmonic potential and describes a `stretching or compressiona energy with a minimum when the di!erence in the Xcoordinates is r . The total energy was the sum of the G expressions for the X- and >-coordinates. In the present work, we have used a form of the elastic energy that is attractive for large intervertex distances and repulsive for short distances: ,\ E " [k /r#k r], CJ A G CG G
(5)
Here k and k are constants for compression and extenA C sion, respectively, and r"(x !x )#(y !y ) G G> G G> G
Fig. 1. Graph of the elastic energy given by expression (5) for the case where both of the adjustable constants, k and k , are set to A C 1. It may be seen that the minimum of the elastic energy is at r"1 as expected.
(6)
is the square of the distance between neighboring vertices. With this energy, it is also possible to choose the intervertex distance that will yield a minimum energy. It can be demonstrated that the energy is a minimum if all the intervertex distances are equal to (k /k ). The A C form of this potential is shown in Fig. 1 for the case (k "1; k "1). A C 3.1.2. Rigidity energy Most applications of active contours have employed the classic formulation for the rigidity energy shown in expression (3). Unfortunately, unless the vertices remain uniformly spaced around the snake, this form depends
Fig. 2. De"nition of the angle h between line segments as used G in expression (7).
not only on the curvature of the snake, but on the spacing as well. This problem may be circumvented by using the angles between segments explicitly and formulating the rigidity energy as ,\ E "k h. (7) P P G G In this expression, !p)h )p, and the angle h is G G de"ned as the angle between the vector from vertex i!1 to vertex i and the vector from vertex i and i#1 as shown in Fig. 2. This form of the rigidity energy can be modi"ed readily for the case where the angles at the vertices are expected to be approximately the same as in the initial approximate contour. A suitable rigidity energy would then be
,\ h !hM G G , E "k (8) P P hM G G where the angles hM are those at the vertices of the initial G approximation. This form of the rigidity energy is used in the calculations described below.
1654
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
3.1.3. Area energy With a snake employing only the elastic and rigidity energies of the classic formulation, one observes the following: (i) The snake, when not subjected to any external image forces, "nds its equilibrium as a point or a line, depending on the elastic and rigidity parameters and (ii) when the initial contour is not su$ciently close to relevant image features it is not attracted to them. In order to circumvent these problems, various workers [7,13,14] have employed a `balloona force which causes the snake to expand or contract as a whole. The additional force has been formulated as F "k n(s), (9) @?J @?J where k is a constant and n(s) is a unit vector perpen@ dicular to the snake at the point s. Note that this is a force, not an energy. In the present work, we have investigated an energy proportional to the area enclosed by the snake which also has the e!ect of causing the contour to expand or contract as a whole. With this energy, the snake tends either to contract or expand inde"nitely until stopped by elastic or image forces. The form we have used is E "k A, (10) where A is the area enclosed by the snake. The sign of k determines whether the snake will tend to expand or contract. The area is calculated from the expression , A" [(x !x ) (y !y ) G K G> K G !(x !x ) (y !y )], (11) G> K G K where x is the smallest X-coordinate of all the vertices, K and y is the smallest >-coordinate. The vertex at K (x ,y ) is taken to be identical to that at (x , y ). ,> ,> This expression holds for any polygon, convex or otherwise, provided that the vertices are ordered around the contour and that no line segments joining the vertices intersect. During energy minimization with an area energy, an explicit check for intersecting line segments is required to ensure that this relationship is applicable. 3.1.4. Symmetry energy When an object is blurred, broken or apparently unphysically convoluted, boundaries obtained by thresholding or standard edge detectors can exhibit unphysical departures from symmetry. In such a case, imposing a tendency toward symmetry can be essential. In the present work, we have used an energy that explicitly depends on the departure from mirror symmetry with respect to an axis already approximately known. The symmetry axis was allowed to move to minimize the total energy, but was assumed to remain straight.
Fig. 3. De"nition of the distance d for vertex 2 in the symmetry potential of expression 12. Vertex 2 is re#ected across the symmetry axis (dashed line) to the position 2' shown by the open circle. The distance d is de"ned as the Euclidean distance between the point 2' and the corresponding vertex (5) represented by the solid disc.
The calculation of the symmetry energy was carried out as follows. Each point (x , y ) was re#ected across the G G axis and the Euclidean distance d to the nearest point on G the other side was calculated. The symmetry energy was then calculated using , E "k d. QWK QWK G G
(12)
Fig. 3 illustrates the procedures used in the calculation. 3.2. Image energy terms Generally, one is interested in causing the snake to move toward strong image features or ensuring that the snake avoids strong features. This is accomplished using an image energy of the form , E "k g(x , y ). ' ' G G G
(13)
The value of g(x , y ) may be taken as the grey value for G G the pixel at coordinates (x , y ), or alternately, it may be G G interpreted as the average of grey values for the pixels along the line segments joining vertex (x , y ) to its nearest G G neighbors. Since the total energy is to be minimized, a positive value of k will cause the snake to seek out ' regions in the image with small values of g(x , y ). ExtenG G sions to this basic form may be found in Refs. [15,16]. When the goal is to "t a snake to a boundary within an image, it is useful to preprocess the image with an edge detector so that the points of maximum gradient are emphasized. The edge detector most commonly employed uses the gradient of the image convolved with a
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
Gaussian smoothing function. This was modi"ed somewhat in Ref. [4] where, in addition, an edge detector without smoothing was employed. The "rst term allowed the snake to `feela the edge from some distance away, and then as it approached the edge, to fall into the `pita provided by the unsmoothed edge detector. 3.3. Comments on energy terms The internal energy functions are not `mutually orthogonala in that some of the energy functionals have e!ects similar to those of others. For instance, attractive "rst or second neighbor energies will cause contraction, much as an area energy will. On the other hand, a repulsive second neighbor term will have the e!ect of removing regions of high curvature, as will a rigidity energy expressed in terms of angles. In our experience, it is best to avoid two or more energy functionals that have similar e!ects. It appears, in fact, to be most e!ective to use as few energy terms as possible, and to subdivide an analysis, if necessary, into several stages with di!erent energy terms. Finally, it should be noted that some constraints on a snake may best be imposed through an explicit calculation rather than through an energy term. This is illustrated below, where an area energy, which encourages contraction, can cause vertices to collapse into long "laments of small area and can result in line segments crossing each other. Both e!ects are serious and are di$cult to correct once they have appeared in an analysis. While these problems can usually be avoided by using a weak repulsive term in the elastic energy, it is still necessary to monitor for their appearance and to add a penalty term to any vertex displacement that appears to be leading to di$culties.
4. Behavior of energy terms In this section, the behavior of several energy terms is examined using arti"cial images. A comparison is also made of results obtained using the new terms and those obtained using more standard terms. Finally, some comments are made regarding the amount of user intervention required to obtain satisfactory results. Applications to realistic problems are considered in the next section. It is assumed in the following discussion that the "tting procedure occurs in several stages and that di!erent numbers of vertices and di!erent energy terms may be used in each stage. This allows the user to cause the snake to behave di!erently during successive stages of the "tting procedure. For instance, in one stage, the snake might be encouraged to lie close to, but exterior to a closed contour. In a subsequent stage, the parameters might be changed to cause the snake to lie on top of the contour to the greatest extent possible. This strategy not
1655
only speeds up the analysis by requiring fewer vertices in the initial stages, but also allows the user to minimize the number of energy functionals in each stage. 4.1. Elastic energy Consider a problem in which an initial, approximate snake is known to lie outside the contour to be "tted and it is required to cause the snake to approach the contour and eventually `huga to it in a satisfactory manner. It is assumed that the snake is closed; that is, the last vertex is joined to the "rst. When the initial snake lies far outside the contour, it is necessary that the snake collapse until it comes under the in#uence of image forces near the contour. If the contour is blurred or can be blurred, say by convolving it with a Gaussian kernel, the snake need collapse only until it reaches the tail of the blurred contour. Image forces can then be used to draw the snake into the desired parts of the image. The snake can be caused to move inward in a variety of ways. Two are considered here: (i) attractive elastic forces and (ii) minimization of the area within the snake. We have used the elastic energy formulation of expression (5). This form has both attractive and repulsive terms and allows the user to encourage the vertices to space themselves uniformly as the snake contracts. The classic formulation which uses expression (2) is the same as the one we have used with the repulsive forces set to zero (k "0 in expression (5)). C Fig. 4 shows the results of drawing a snake toward a contour using an attractive elastic force. The contours to be "tted were blurred by convolution with Gaussian kernels using p"0,1,2 and 5 pixels, respectively. The blurring was extended for 3p around the contour. Note, however, that the contours, as shown in the "gure, are 3 pixels wide and unblurred. The "tting was done in three stages with the parameters shown in the "rst part of Table 1. In the "rst stage, 15 vertices were used, and step sizes were 10 pixels. A repulsive elastic force was included to encourage the vertices to space themselves uniformly around the contour as it collapsed. Each vertex was considered in succession for a 10-pixel move in each of several directions, and the process repeated until no pixel could make such a move. In the second stage, 30 vertices were used, the repulsive force was removed and the step size was reduced to 3 pixels. Finally, in the third stage, 60 vertices were used and the step size was reduced to 2 pixels. It may be seen that the contour has not dropped into the deeper, concave feature for p"0, 1. This is because it is necessary to stretch the snake in the vicinity of a concave region. If the blurring is su$ciently large, p"5, the snake can be captured by the contour before the elastic energy becomes too large. Generally, it has been our experience that if the range of the blurring is signi"cantly less than the spacing of the vertices in the "nal
1656
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
Fig. 4. Behavior of a snake using an elastic energy to draw the snake toward the contour to be "tted. In the "rst two parts of the "gure (top row) for which p"0,1, the snake does not enter the concave feature. For the last two parts, p"2,5, the "ts are satisfactory.
stage, the snake does not reliably drop into concave features when elastic forces are used. It is also clear that if the elastic energy is too large, the snake may fall through the contour and form a roughly circular "gure with the vertex spacing determined by the ratio of the coe$cients for the attractive and repulsive elastic energies as described above. When only image and elastic energies are used, it is usually best to use a strong image energy so that there is little chance of the snake dropping through the contour.
that had to move a large distance to approach the contour, to collapse into "laments of small area. This was overcome by using a small repulsive elastic energy in the "rst stage of the analysis. It may require some experimentation to ensure that the snake continues to collapse in spite of the repulsive force between the vertices, but parameters similar to those given in Table 1 have usually been found to be satisfactory.
4.2. Area energy
To study the behavior of the symmetry energy term, we have used a contour similar to that used for the elastic and area energies, but with approximately of the points removed. The resulting contour, shown in the "rst part of Fig. 6 can be seen to have an approximate symmetry axis. No blurring of the contour was used in this study. The goal in this case is to draw the snake close to the contour, allow the snake to be `captureda by the contour where the contour is available and to use the symmetry energy to obtain a reasonable result where it is not. An
The results of using the area energy to draw the snake toward the contour, are listed in Fig. 5, and the parameters used are shown in the second part of Table 1. The results are more satisfactory than those obtained using an elastic energy, especially in the concave region with small blurring (p"0, 1, 2). The most important problem encountered with the area energy was a tendency for those parts of the snake
4.3. Symmetry energy
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664 Table 1 Parameters used in "tting arti"cial images
4.4. Summary
I: Draw snake to contour using attractive elastic energy Stage 1 2 3 Attractive 0.001 0.001 0.001 Repulsive 1.0 0.0 0.0 Symmetry 0.0 0.0 0.0 Area 0.0 0.0 0.0 Image 0.2 0.2 0.2 II: Draw snake to contour using area energy Stage 1 2 Attractive 0.0 0.0 Repulsive 1.0 0.0 Symmetry 0.0 0.0 Area 0.0001 0.0001 Image 0.2 0.2
3 0.0 0.0 0.0 0.0001 0.2
III: Use of symmetry energy Stage 1 Attractive 0.0 Repulsive 1.0 Symmetry 0.0004 Area 0.0001 Image 0.2
3 0.0 0.0 0.1 0.0 0.2
2 0.0 0.0 0.0004 0.0001 0.2
1657
We have found that several energy functions, which do not yield a simple set of matrix equations when discretized, can nevertheless be very useful. By using simple minimization algorithms as outlined below, these functionals as well as explicit constraints on the vertices can be readily incorporated into a snake analysis.
5. Applications to realistic problems
The parameters are the values used in expressions for the energy given above. Attractive: k , Eq. (5); repulsive: k , A C Eq. (5); symmetry: k , Eq. (12); area: k , Eq. (11); image: k , QWK ' Eq. (13).
initial approximate symmetry axis was given and the program was allowed to search for an improved axis. If no symmetry potential is used and the snake is drawn to the contour using the area and image energies only, the snake tends to move too far into that part of the "gure where the contour is broken. On the other hand, if the symmetry energy is too large, it may dominate the area energy and lead to unexpected results as shown in the second part of Fig. 6. In this case, the resulting snake is quite symmetric. Unfortunately, no vertex can move without the increase in the symmetry energy exceeding the decrease in the area energy. A slight reduction in the strength of the symmetry potential gave the results shown in the third part of the "gure. It may be seen that the "t is now quite satisfactory and that the symmetry axis is well constrained by the upper part of the "gure. The parameters used are shown in the third part of Table 1. Appropriate values for the ratios of the symmetry, area and image energies must be found by experimentation. This calculation is somewhat more complicated than those described above because there are now three energies (image, area and symmetry) competing for control of the snake.
In this section, the principles discussed above are applied to three speci"c problems. The emphasis is on the use of a snake model with simple energy terms and simple energy minimization procedures. Section through a human brain (magnetic resonance imaging*MRI): In the example presented, the interest is in "tting the boundary of the anterior cingulate. In an MR image, there is usually considerable detail other than the object of interest, and the initial contour is drawn by hand. The snake is to "t itself snugly against the boundary of the object of interest, where the change in the image grey level is most rapid. This calculation appeared earlier in Ref. [2], but is included here for completeness. Section through a grain kernel (optical microscopy): In this case, the image consists of a transverse section through a wheat kernel. It is usually possible to preprocess these images and produce binary images with few artifacts outside the kernel boundary, although there may be signi"cant structure within the kernel, and there may be discontinuities in the boundary of the kernel. The snake is to "t itself closely to the outer edge of the kernel while yielding a `reasonablea boundary at discontinuities. Coronal section through a fetal skull (ultrasound): Ultrasound images are often of low quality as compared to those obtained with MRI or optical microscopy. As a result, standard edge detectors do not work well. The apparent boundaries obtained by thresholding are very convoluted but the convolutions are due to noise and have little or no physical signi"cance. It is therefore necessary to impose strong regularity constraints on the "tting contour. One may proceed by adjusting the parameters of a symmetric shape (e.g., a superellipse) to give an approximate "t to the skull boundary. The resulting contour requires slight adjustments to conform better to the actual skull in the image, and this is easily carried out using a snake in which the energy terms are dependent on the degree of departure from the initial approximation as well as on image features. 5.1. Section through a human brain It is often desirable to know the shape of a portion of an organ or the area enclosed by a structure such as the
1658
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
Fig. 5. Behavior of a snake using an area energy to draw the snake toward the contour to be "tted. In this case, the "ts are satisfactory for all values of p. The values of p are the same as those used in Fig. 4.
skull. When the boundaries are distinct, it is often possible to employ only a few energy terms, and sometimes to do away with even supposedly basic terms such as the snake rigidity energy. This is illustrated by a calculation shown in Fig. 7 which was described earlier in Ref. [2]. The goal was to delineate one side of the anterior cingulate (the dark area near the snake). An open snake was used. The lower of the two contours is the initial snake, drawn fairly close to the anterior cingulate, but within the nearby light area. The "nal position of the snake is shown by the upper contour. The endpoints of the snake were "xed so that they did not move during optimization. The internal energy was taken to be E (i)"i ("x !x "!r ) (14) GLRV G G G> G for the X-coordinate with a similar term for the >coordinate. Note that no rigidity term was used. The value of i was de"ned as G i "i/"x !x ". G G G>
This meant that pairs of vertices that were close together were strongly in#uenced by the elasticity of the snake whereas those far apart were more strongly in#uenced by image forces. Such an energy term cannot be reduced to matrix form. The image energy was chosen so that the snake would be attracted to edges within the image. To allow the snake to feel the edges at some distance, a Gaussian "lter was used to blur the image. A Sobel operator was then used to extract the image gradient in the X and > directions, and "nally, the image energy was taken to be the negative square of the gradient. To simplify the calculation as much as possible, the image energy was calculated only at the vertex positions. A total of 12 vertices were used with i"50 and r "(vertex separation in G initial snake) /16. Powell's method of conjugate directions was used to minimize the energy functional by direct manipulation of the vertex positions. In spite of the simplicity of the model, the "t in Fig. 7 is quite satisfactory.
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
1659
Fig. 6. Behavior of a snake when a symmetry energy is used. The upper left part of the "gure shows the approximately symmetric contour used. In the second part, the symmetry potential is too strong, and in the third part, the snake conforms well to the contour and "lls in a satisfactory shape where the contour is not available.
5.2. Section through a grain kernel
Fig. 7. An open snake used to delineate one side of the anterior cingulate (the dark area near the snake). The rightmost of the two contours is the initial snake, and the leftmost contour shows the "nal position of the snake.
Fig. 8 shows a typical transverse section through a kernel of wheat as viewed through an optical microscope. The purpose of the analysis of grain kernels was to determine the morphological characteristics of the outer kernel boundary. Suitable thresholding allowed the greyscale image to be converted into a binary image (Fig. 9), and subsequent morphological operations allowed a band of pixels representing the boundary of the kernel to be extracted as in Fig. 10. The images shown illustrate one of the di$culties encountered in this analysis. In some places, the outermost layer of the kernel is very thin just outside the `giant cellsa (the light layer just inside the kernel boundary), and the thresholding algorithm has erroneously placed the boundary on the inside of the giant cells rather than at the outside of the kernel. This has resulted in undesirable irregularities in the kernel boundary. This analysis illustrates the use of an energy proportional to the area enclosed within the snake. The analysis was carried out using a very simple energy minimization algorithm and only two energy functionals: (i) an area energy which caused the snake to contract from its initial position and (ii) a strong image energy which initially caused the snake to avoid contact with the band of pixels
1660
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
Fig. 8. Digitally recorded television scan of section through a kernel of wheat obtained using a microscope and visible light. The section is taken approximately at the midpoint of the kernel. The supporting matrix is epoxy.
Fig. 9. The kernel in the previous "gure thresholded to produce a binary image.
representing the kernel boundary and subsequently caused it to lie within the band. The snake was moved inward from its initial position in three stages. In the "rst two stages, the image energy was repulsive so that the snake avoided any contact with the kernel boundary. The image energy was formulated
Fig. 10. The thresholded image after processing with morphological operators to remove noise and to produce a band of pixels corresponding to the boundary of the kernel.
to include not only the vertices, but also snake pixels between the vertices. As a result, the contour avoided contact not only at the vertices but also at any points on the straight line segments joining the vertices. In the third stage, the sign of the image energy was reversed so that the contour had a strong tendency to place itself within the band of pixels representing the boundary (Fig. 10). With a binary image, it is important that the initial stages of the analysis use a repulsive image force. Otherwise, a line segment may come to lie across the boundary (typically 3 pixels wide) with one vertex outside the boundary and the next one inside. Should this situation arise, the ability of the algorithm to move one vertex inwards and the other outwards so that the line segment between the vertices lies as much as possible on the boundary, depends sensitively on the spacing of the vertices and the magnitude of the image and area energies. In the "rst stage, a vertex was allowed to move 10 pixels at a time. Each vertex was considered in succession for a 10-pixel move and the process repeated until no pixel could make such a move. At the end of the "rst stage, the number of vertices in the snake was doubled, typically from 15 to 30 vertices. The spacing of the new points was chosen to make the spacings of all the vertices as nearly uniform as possible without moving any of the original vertices. In the second stage, the same energies were used and steps of 3 pixels were allowed. In the "nal stage, when the snake was close to the kernel boundary but nowhere touching it, the number of
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
1661
adding a large energy penalty to any case with intersecting segments. Note that this would be awkward to formulate in the classic snake formalism. Fig. 11 shows the snake superimposed on the wheat kernel of Fig. 8. It may be seen that the "t is excellent except in the region of the crease. A better "t in this region could have been obtained by blurring the kernel boundary slightly so that image forces could be felt at some distance from the boundary. For the kernel analysis, however, it was considered more practical to "t the crease region in a separate calculation.
5.3. Coronal section through a fetal skull
Fig. 11. The original image of the wheat kernel with the snake superimposed.
pixels was again doubled, the step size was reduced to 1 pixel and the sign of the image energy was reversed so that the snake was encouraged to lie as much as possible within the band of pixels de"ning the boundary of the kernel. The area energy tending to cause the snake to contract was retained, but the image energy was su$ciently large that the snake did not fall through the band of pixels de"ning the boundary. Note that this analysis is very forgiving as far as the parameter choices are concerned. Since the repulsive image forces come into play only when the snake actually comes in contact with the kernel boundary, the user needs only to choose a strong repulsive image force and a relatively weak area force to cause the snake to lie close to, but everywhere outside the boundary. Switching then to a strong attractive force, while retaining the relatively weak area force, causes the snake to the boundary well. Very little empirical parameter optimization is required. Although it was found that this algorithm worked well, several precautions were necessary. When the snake was initially placed far from the boundary, the algorithm sometimes caused a set of neighboring vertices to collapse into a thin "lament. When this happened, the vertices could no longer adjust themselves to reduce the area further, nor could they "t themselves closely to the kernel boundary. This was remedied by using a repulsive elastic energy, similar to that shown in Table 1, to hold vertices apart in the "rst stage of the analysis. A more serious problem was that line segments could sometimes intersect as the vertices were moved. This problem was circumvented by checking for intersecting line segments on each trial vertex displacement and
The purpose of this analysis was to use a snake to make "ne adjustments in the "t to a very noisy image in order to improve the subjective quality of the "t. An ultrasound image showing a coronal section through a fetal skull in utero is shown in Fig. 12. As may be seen from the "gure, ultrasound images tend to be of low quality, so that standard methods of "nding the boundary of the skull do not work well. In addition, the image is able to provide information as to the position of the skull over less than half the perimeter of the skull. The snake is thus essentially unconstrained over more than half its perimeter. The procedure employed was somewhat di!erent from that of the above two applications. It was noted that a coronal section through the fetal skull above the line of
Fig. 12. Ultrasound image of the fetal skull showing a superellipse with parameters adjusted to approximate the outer boundary of the skull. The superellipse is used as the initial approximation to the snake.
1662
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
the jaw is well represented by a superellipse. A superellipse was therefore placed near the fetal skull and the parameters adjusted to conform as well as possible to the position, orientation and shape of the skull. A snake was then used to improve the quality of the "t under the assumption that the shape of the snake would di!er only slightly from that of the initial superellipse. A superellipse centered at the coordinate system origin and with its major and minor axes along the X- and >-axis, has the functional form
x L y L # "1, a b
(15)
where a and b are the lengths of the semi-major and semi-minor axes, respectively, and n is an exponent, not necessarily integer. When n"2, the superellipse is an ellipse, whereas as nPR, it approaches a rectangle. The superellipse can be rotated through an angle h by transforming (x, y) to a rotated coordinate system, and the center can be displaced to an arbitrary point (x , y ) M M by replacing x by x!x and y by y!y . An arbitrary M M superellipse can therefore be described by a total of six parameters: a, b, n, x , y and h. M M The image was preprocessed in the following manner. The image gradient in the vicinity of the skull was obtained by applying a 3;3 Sobel operator and averaging the resulting gradients using 5;5 kernel. The gradients obtained were su$ciently smooth to allow the parameters of a superellipse to be adjusted to the region of maximum gradient as shown in Fig. 12. Although the superellipse provides a reasonable "t to the fetal skull, the use of a mathematical expression imposes a very high degree of symmetry. A snake was therefore used to allow the "t to be more responsive to the image while still maintaining approximate symmetry. The initial snake was obtained by sampling the superellipse at several (usually 20) equally spaced points. Since the snake was expected to deviate only slightly from the superellipse, the internal energies were chosen to re#ect this expectation. Three internal energies were used: 1. An elastic energy corresponding to a force that was repulsive at close range and attractive at long range, similar to that given in Eqs. (5) and (6). Eq. (6) was modi"ed slightly so that the distance was that between the current vertex position, (x , y ), and its initial posiG G tion in the superellipse, (xM, yM). The elastic energy G G therefore measured the distortion of the superellipse. 2. A rigidity energy (Eq. (8)) that was zero when the angles formed by the straight line segments of the snake between vertices were the same as in the sampled superellipse. 3. A symmetry energy that was zero when the snake was symmetric about a line through the vertices originally on the major axis of the superellipse. Since the image
was of poor quality, the symmetry axis was not allowed to move during the analysis. One image force was used; namely, a gradient energy encouraging the snake to maximize its overlap with edges in the image. The symmetry energy was important in this calculation because the elastic and rigidity energies did not introduce su$cient long-range correlation to overcome the e!ects of the noisy images. The symmetry energy evaluated at vertex v sampled distant points which G would not otherwise be correlated with v . As expected, G tests showed that when the symmetry energy was not employed, the snake became unphysically deformed. The Powell minimizer was used to minimize the snake energy. The energies listed above and particularly the symmetry energy, impose strong correlations on the manner in which vertices should move in order to minimize the energy. Since Powell's method is a conjugate direction method, [10], it is ideally suited for this type of calculation. The fetal skull is shown with the "nal snake in Fig. 13. As mentioned above, the initial contour was a fairly good representation of the fetal skull, and it is gratifying that there has been only a slight movement of the contour from its initial position in spite of the fact that the contour is poorly constrained by the image for much of its length. The symmetry energy has ensured that the
Fig. 13. Ultrasound image of the fetal skull, showing the "nal position of the snake. In spite of the fact that the image does little to constrain the snake over much of its perimeter, the symmetry energy has successfully retained the overall symmetry while allowing the snake to conform to the image somewhat better than did the superellipse.
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
unconstrained part of the contour tends to match the part that is constrained by the image. As mentioned above, the use of a snake in situations such as the fetal skull analysis is mainly to ensure that the resulting contour both conforms to the image and satis"es the user's expectations. If the elastic and rigidity forces were both strong, the snake would be unable to move in response to either the symmetry or image energies. The strategy of the analysis is therefore to reduce the strengths of the elastic and rigidity energies until the snake responds to the image in a manner that satis"es the user. A fairly large symmetry energy allows the elastic and rigidity energies to be reduced su$ciently far that the snake actually can respond to the image energy without losing its overall symmetry. In such an analysis, it is clear that considerable exploration may be necessary to "nd the correct balance of parameters.
6. Discussion In this paper, we have discussed a simpli"ed snake and used it to carry out several calculations spanning a wide range of image quality. The snake employed in this work has several distinct advantages, especially during the exploratory portions of an investigation: 1. The analysis is intuitive and transparent. This is especially true of the simpli"ed energy minimization procedure in which the vertices are moved one at a time. 2. Additional internal energy terms (e.g. the area and symmetry terms) may easily be added, without unduly complicating the minimization problem. In the classic snake formulation, either term would have destroyed the pentadiagonal character of the matrices. 3. The use of the area energy involves monitoring the vertices to ensure that no line segments joining the vertices intersect. In addition, it is useful to monitor whether any vertices have collapsed to form long "laments. Both of these tasks are easily included in our simpli"ed snake formulation. These snakes have been applied to several images with very di!erent characteristics, including arti"cial images, binary images and noisy images. It is encouraging that the simpli"ed snakes can treat all these images in a coherent manner.
7. Summary Active contours, or `snakesa, are often used to identify boundaries within images that are too irregular or noisy to be detected reliably by conventional means. A snake may be thought of as an elastic curve that, through minimization of an energy functional, deforms and ad-
1663
justs its initial shape on the basis of image information to provide a continuous boundary. A central problem in the use of snakes is the choice of appropriate forms for the energy functionals. The classic Kass et al. [1] formulation allows the problem to be reduced to matrix form, but places severe restrictions on the types of energy functionals which can be used. Eviatar and Somorjai [2] developed a less elaborate formalism, which employed Powell's method to carry out energy minimization on the vertex positions directly. There were no restrictions on the form of the energy functionals, which were chosen to be simple. The present paper extends this simpli"ed snake and presents additional energy functionals found to be useful in speci"c applications. We describe them and apply them to arti"cial and realistic problems, considering their advantages and disadvantages in each case. The realistic problems include an MRI section through a human brain, an optical microscopy section through a grain kernel, and an ultrasound section through a fetal skull. We conclude that the simpli"ed snake can provide satisfactory results for a wide range of applications.
Acknowledgements N.E.D. wishes to acknowledge the hospitality of the Institute for Biodiagnostics during the period that this work was carried out.
References [1] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Int. J. Comput. Vision 1 (1988) 321}331. [2] H. Eviatar, R.L. Somorjai, A fast simple active contour algorithm for biomedical images, Pattern Recognition Lett. 17 (1996) 969}974. [3] S. Ranganath, Contour extraction from cardiac MRI studies using snakes, IEEE Trans. Med. Imaging 14 (1995) 328}338. [4] T. McInerney, D. Terzopoulos, A dynamic "nite element surface model for segmentation and tracking in multidimensional medical images with application to cardiac 4D image analysis, Comput. Med. Imaging Graphics 19 (1995) 69}83. [5] D. Terzopoulos, A. Witkin, M. Kass, Constraints on deformable models: recovering 3D shape and nonrigid motion, Arti". Intell. 36 (1988) 91}123. [6] R. Samadini et al., Evaluation of an elastic curve technique for "nding the auroral oval from satellite images automatically, IEEE Trans. Geosci. Remote Sensing 28 (1990) 590}597. [7] L.D. Cohen, On active contour models and balloons, CVGIP: Image Understanding 53 (1991) 211}218. [8] I. Carlbom, D. Terzopoulos, K.M. Harris, Computerassisted registration, segmentation, and 3D reconstruction from images of neuronal tissue sections, IEEE Trans. Med. Imaging 13 (1994) 351}362.
1664
N.E. Davison et al. / Pattern Recognition 33 (2000) 1651}1664
[9] V. Chalana, D.T. Linker, D.R. Haynor, Y. Kim, A multiple active contour model for cardiac boundary detection on echocardiographic sequences, IEEE Trans. Med. Imaging 15 (1996) 290}298. [10] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery, Numerical Recipes, Cambridge University Press, Cambridge, 1992. [11] Gang Xu, Eigo Segawa, Saburo Tsuji, Robust active contours with insensitive parameters, Pattern Recognition 27 (1994) 879}884. [12] S.R. Gunn, M.S. Nixon, A robust snake implementation; a dual active contour, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 63}68.
[13] L.D. Cohen, I. Cohen, A "nite element method applied to new active contour models and 3D reconstruction from cross section, IEEE Computer Society Conference, Osaka, Japan, 1990, pp. 587}591. [14] N. Ayache et al., Steps toward the automatic interpretation of images, in: K.H. Hohne (Ed.), 3D Imaging in Medicine, NATO ASI Series, Springer, Heidelberg, 1990, pp. 107}120. [15] D. Androutsos, P.E. Trahanias, A.N. Venetsanopoulos, Application of active contours for photochromic tracer #ow extraction, IEEE Trans. Med. Imaging 16 (1997) 284}293. [16] K.P. Ngoi, J.C. Jia, A new colour image energy for active contours in natural scenes, Pattern Recognition Lett. 17 (1996) 1271}1277.
About the Author*NORM DAVISON was born in Hamilton, Canada in 1944. He received his Ph.D. in Nuclear Physics from the University of Alberta in 1969. Following two years as a Postdoctoral Fellow in Strasbourg, France, he joined the University of Manitoba in 1971 and performed research in nuclear physics until 1992, when he began to work in image processing primarily at the NRC Institute for Biodiagnostics. His current interests include image processing, object recognition and classi"cation. About the Author*HADASS EVIATAR was born in Jerusalem, Israel, in 1962. She graduated from the University of Amsterdam in 1988, and received a Ph.D. in Physics from the University of Utrecht in 1994. She joined the Informatics Group of the Institute for Biodiagnostics in 1995, where she worked on the application of active contours to biomedical images. Since 1997, she has been involved in the development of software for "eld plotting, magnet shimming and motion correction, in the Magnetic Resonance Technology Group at the Institute for Biodiagnostics. About the Author*RAY SOMORJAI was born in Budapest, Hungary in 1937. He received his Ph.D. in Theoretical Physics and Theoretical Chemistry from Princeton University in 1963. He was a NATO Postdoctoral Fellow in Cambridge, UK during 1963}65. He joined the National Research Council of Canada in 1965. Since 1992, he has been Head of the Informatics Group of the Institute for Biodiagnostics of NRC. His current research interests include supervised and unsupervised pattern recognition and image processing methods as they may apply to biomedical problems and processes.
Pattern Recognition 33 (2000) 1665}1674
Learning of view-invariant pattern recognizer with temporal context K. Inoue, K. Urahama* Faculty of Visual Communication Design, Kyushu Institute of Design, 4-9-1 Shiobaru, Minami-ku, Fukuoka-shi 815-8540, Japan Received 12 November 1998; accepted 11 June 1999
Abstract Radial basis function (RBF) neural networks have been used for view-invariant recognition of three-dimensional objects on the basis of combination of some prototypical two-dimensional views of objects. Supervised learning algorithms have been used for optimizing their parameters with the input of image set of some objects from several views. We modify its structure slightly and develop an unsupervised learning algorithm with the input of a time sequence of continuously variant view images. A feedback path is added to RBF networks for utilizing class-membership outputs at a previous time as a prior information at the current time. The network is trained unsupervisedly by a maximum likelihood scheme with the input of time-variant view images of some objects for developing view-invariant recognition capability. Robusti"cation of data distribution enables the network to reject outlier data and to extract an object expected from the recognition at previous times. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Object recognition; RBF network; Membership feedback; Robust estimation
1. Introduction An e!ective and biologically plausible method for view-invariant pattern recognition is an approach modeling three-dimensional objects by the combination of some prototypical two-dimensional views of objects. Ullman and Basri [1] have presented a linear combination model and an RBF (radial basis function) neural network developed by Poggio [2] is a more general nonlinear combination model. An RBF network is illustrated in Fig. 1. In this example, the number of objects is two, one of which is modeled by a combination of three prototypical view images r , r , r and the other object is modeled by two prototypical images r , r . The bottom plane in Fig. 1
* Corresponding author. Tel.: #81-92-553-4510; fax: #8192-553-4593. E-mail address:
[email protected] (K. Urahama).
represents the space of input images d. Neurons denoted by `Ra above it are RBF neurons whose output is a radial basis function e\?,B\PG , of the input d. Neurons denoted by `#a above them calculate linear sum of responses of `Ra neurons. The top neurons denoted by `Wa are WTA (winner take all) neurons whose response q becomes one if the corresponding output of `#a H neuron is larger than the other one, otherwise q becomes H zero. RBF networks reveal behaviors similar to psychological observations and physiological observations have also been reported about view-centered neurons and object-centered ones; the former corresponds to `Ra neurons and the latter corresponds to `#a or `Wa neurons [3,4]. A physiological data on face neurons [3] are shown in Fig. 2 where the abscissa represents direction of faces illustrated below it and the ordinate is spike frequency of neurons. Left graph shows the response of view-centered neurons whose optimal input is the right pro"le of face and the right graph shows the response of an object-centered neuron.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 6 - 6
1666
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
Poggio et al. [2] have used a supervised learning algorithm for adjusting connection weights from `Ra neurons to `#a ones with input of labeled data set of images. The label denotes a class to which each input image belongs. The necessity of label information implies that the system is not self-contained and needs a supervisor which is however hard to "nd in our brain. The neural networks in our brain are obliged to resort to extraction of some classi"cation information by themselves in place of explicit supervision. An approach has been presented for getting classi"cation information from temporal context in data streams. Images are assumed to vary continuously in time interval in which we view one object. Discontinuous change in
images can occur only when our view shifts to another object. View-invariant pattern recognizers can be devised by unsupervised learning utilizing this temporal information [5]. Such learning algorithms have been developed for modeling neural networks for view-invariant face recognition [6,7]. The learning based on the temporal context has also been observed in physiological experiments [8]. An RBF network with lateral connections between `Ra neurons whose connection weights are learned by temporal context has been proposed [9] and also RBF net models with feedback of outputs of `Wa neurons to `Ra neurons have been presented [10]. The present paper addresses a feedback-based model and an unsupervised learning algorithm based on the maximum likelihood estimation of prototype images r of `Ra neurons. Di!erG ent points in our model from Becker's model [10] are that q is directly fed back without gating nets and H attentive e!ect is attained by robustifying distribution of data.
2. RBF network Poggio et al. [2] have derived RBF networks from function approximation based on the standard regularization scheme and they have applied it to pattern recognition tasks. Here we derive it directly from the Bayes' classi"cation rule based on a mixture density distribution of data. Let the distribution of data d be represented by a mixture density
Fig. 1. RBF network.
K p(d)" p(i)p(d"i), G
Fig. 2. Responses of face neurons.
(1)
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
1667
where p(i) is a prior probability of ith cluster and p(d"i) is a component probability density of ith cluster of data d. p(i) represents a top}down information and p(d"i) is a bottom}up signal from input data. Their multiplication in each component density is results from an assumption of their mutual independence. If no top}down information exists, then p(i) becomes a uniform density p(i)"1/m. Since the posterior probability p(i"d) of ith cluster is p(i)p(d"i)/ K p(k)p(d"k), a datum d is classi"ed into I a cluster with the maximal posterior probability
arg max p(i)p(d"i). G
(2)
We next gather some clusters into a group which is a cluster of clusters; thus we get a hierarchical clustering with two stages. Let the clusters of number m be partitioned into n()m) groups and the set of clusters included in jth group be denoted by I . The posterior H probability of jth group is then expressed by H p(i)p(d"i)/ K p(k)p(d"k). A datum d is classi"ed into GZ' I a group with the largest posterior probability
Fig. 3. Temporal context network.
leads to arg max p(i)p(d"i). H GZ'H
(3)
Fuzzi"cation of this `maxa operation into a `softmaxa leads to a fuzzy membership of d in jth group e@ GZ'H NGNBG q( j)" , L e@ GZ'I NGNBG I
(4)
where b is a positive constant controlling the fuzziness. For instance m"5, n"2 and I "+1, 2, 3,, I "+4, 5, with the density of every cluster being Gaus sian p(d"i)Je\?,B\PG , ; then q in Eq. (4) can be calculated H by the RBF net in Fig. 1. In Poggio's model [2], connection weights, which correspond to probability of clusters in each group, from `Ra neurons to `#a ones are variable and learnt under supervisedly. In our model these weights are "xed to an equal value. Temporal information has been dropped in these RBF net models by Poggio's group. Test datum is classi"ed only by its current value irrespective of the history of presentation of data. Class of datum is estimated by using current input d and current top}down information p(i). Here we assume that the classi"cation is in#uenced by data history, particularly the classi"cation result just before the current image becomes top}down information at the current time, i.e. the prior probability p(i) at the time t is the membership qR\( j) at the previous time t!1. In a more precise expression, for all clusters in jth group i3I , we set pR(i)"qR\( j). Then Eq. (4) H
e@ GZ'H OR\HNBRG qR( j)" , (5) L e@ GZ'I OR\INBRG I where dR is input datum at time t. Classi"cation result at time t is given by qR( j). This classi"cation scheme is illustrated in Fig. 3 where feedback paths are added to the network in Fig. 1. Rectangles on feedback paths represent one time delay and circles marked with `;a are neurons outputting product of their two inputs. In this model, classi"cation result at each time exerts in#uence on classi"cation at the next time. At the initial time t"0, there is no prior information, hence p(i) is set uniform, i.e. every cluster is equiprobable.
3. Learning algorithm Similar to the learning in hidden Markov models, prototype points r"[r ,2, r ] of clusters are estimated K from time sequence of data dR(t"0, 1, 2,2) by using the maximum likelihood formula max ln p(dR). P R
(6)
r is adjusted according to steepest ascent rule G * ln p(dR), rR>"rR#h G G *r G
(7)
where h is a small positive constant. Since p(dR)" L qR\( j)e\ ,\ ,, Eq. (7) is explicitly H GZ'H
1668
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
written as rR>"rR#h G G
qR\( j(i)) e\?,BR\PRG , ) (dR!rR), G p(dR)
(8)
where j(i) denotes a group including cluster i, i.e. j such that i3I . H Let us consider some parameters in this learning scheme. Firstly, h should be decreased gradually according to stochastic approximation rules; however it is "xed to 0.01 here for simplicity. Next b being moderate should be set to an appropriately medium value for the fuzziness of WTA in Eq. (5). If b is too large, q approaches close to H zero or one and context in#uence becomes too strong to switch classi"cation at borders between sequences of each patterns. On the contrary if b is too small, every q approaches 1/n and context e!ect disappears. We set H b"2 in the experiments of this paper. Finally, as for the parameter a which is the inverse of the variance of Gaussian density of clusters, we vary a according to the deterministic annealing technique in order to avoid the steepest ascent process being trapped to a local optimum. Initially, a is set to a su$ciently small value and gradually increased to a large value. When a is small initially, all prototypes centralize to the mean of data, and as a increases they depart from it, thus we can obtain "nal arrangement of prototypes independent of initial states. This annealing process will be analyzed in the appendix. An nearly optimal r prototypes is obtained by this annealing which "nishes at some large value of a. The value of a used in the classi"cation of test data after the learning is set to the inverse of the variance of obtained clusters. Since initial value of prototypes is arbitrary, we set it uniformly random. Since no information is given about the label of data, entire framework of this learning is unsupervised, but at each time point the learning can be regarded as a supervised one with the classi"cation result at the previous time point as a teaching signal for the current time. This learning process results in hierarchical clustering where the "rst stage is usual clustering based on the similarity of feature values d while at the second stage the set of clusters of the "rst stage are partitioned into groups on the basis of their temporal proximity. Successively presented clusters are gathered into a group. When input data d are two-dimensional images of various views of some three-dimensional objects, images of one object with various views are displayed throughout a time interval and then images of another object are presented in next time interval (in each time interval, the view is not necessarily changed continuously). Iteration of these display sequences leads to above-mentioned two-stage grouping of entire data. In the "rst stage, images of one object with a particular view, e.g. left pro"le faces of one person, compose one cluster and at the second stage clusters of one object constitute one group. Thus
each group contains images of various views of one object; hence memberships of data into these groups become view invariant.
4. Experiments A typical and simple example which reveals contextual e!ects is shown at "rst in Fig. 4 where dimension of data d is two. Upper left data and upper right ones are presented 50 times alternately and then lower left and lower right data are inputted 50 times and this cycle is iterated. We examined a case m"n"2, i.e. data are partitioned into two groups and each group is composed of only one cluster. If we omit temporal information and perform clustering based on only data values, the prototype points of clusters are arranged as shown by right and left black rectangles; hence data are partitioned horizontally. If temporal context is incorporated, prototypes are arranged as upper and lower rhombuses; hence data are partitioned vertically. This example reveals that the learning process cannot be divided into two steps where prototypes of clusters are "rst arranged by clustering without temporal context and next these clusters are grouped on the basis of temporal context. The next example shown in Fig. 5 is a case of more entangled grouping of clusters. Data are partitioned into two groups each of which is composed of four clusters. Data denoted by white rectangles are presented 200 times randomly and next white rhombus data are presented 200 times and this cycle is iterated. Obtained prototype points of clusters are shown by black rectangles and black rhombuses corresponding to their groups. Thus, it was veri"ed that the prototypes are arranged correctly by the present learning algorithm.
Fig. 4. Example of two-dimensional data.
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
We next examined translation-invariant image recognition by using a simple example of gray scale images shown in Fig. 6 where upper "ve 11;11 images constitute one group and lower "ve images compose another group. Images of each group are presented 150 times alternately. Each group has three prototype images which start from random ones shown in Fig. 7 and "nally converge to the images in Fig. 8 after learning. It is obvious that these six prototype images enable the system to recognize whether a displayed bar is horizontal or vertical irrespective of its position in the image. Similarly, prototype images starting from Fig. 7 converge to those shown in Fig. 10 after learning to use face image data of three persons shown in Fig. 9. The network after learning can recognize persons invariantly from face views. These face data are similar to data in Fig. 4 because dissimilarity between faces of di!erent views of a person is larger than distance between faces of two persons of the same view. Therefore, view-invariant recognition cannot be learnt without temporal context in-
formation. The values of e\,B\P , where r is the upper center image in Fig. 10 and q( j) are plotted in Fig. 11, where abscissa represents view direction as the same as in Fig. 9. In the left "gure of Fig. 11, broken, solid and dotted lines correspond to upper, middle and lower "ve
Fig. 7. Initial values of prototype images.
Fig. 5. Second example of two-dimensional data.
1669
Fig. 8. Converged values of prototype images.
Fig. 6. Example of image data.
1670
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
Fig. 9. Example of face images.
Fig. 11. Responses of R neurons and W neurons.
Fig. 10. Prototype images.
images of Fig. 9. Broken, solid and dotted lines in the right "gure denote q(1), q(2) and q(3) respectively. Fig. 11 is similar to the physiological data in Fig. 2. 5. Robusti5cation The probability density of clusters has been assumed Gaussian till here. We extend the density function to
a more robust one for treating data including outliers and endow the system with a capability similar to attentive vision. We "rst examine in#uence of outlier data on the learning using Gaussian distribution by using image data shown in Fig. 12 where the rightmost two images are made from the second images on their row by the addition of diagonal noises which are outlier pixels. The second images are replaced by the rightmost ones in every three time presentation of this image set. This leaning process resulted in the prototype images shown in Fig. 13 where outliers remain. This is attributed to the Gaussian distribution because prototype images become the mean of images included in the cluster. The Gaussian density is written by p(d"i)Je\?,B\PG ," e\?BI \PGI , I
(9)
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
1671
resents outlier data. Let the number of outlier pixels with large "d !r " be M, then the value of Eq. (10) becomes I GI e\+?. By this value we can count the number of outlier pixels (conversely the number of inlier pixels) and the classi"cation becomes the majority vote of pixels alternative to unanimous vote with the Gaussian density. In the deterministic annealing, a is "xed and a in Eq. (10) is gradually increased.
6. Experiments
Fig. 12. Example of images with outlier pixels.
The learning based on Eq. (10) was "rstly examined by using the image data in Fig. 12. Obtained prototypes are the images shown in Fig. 13 with diagonal lines being completely removed from the two central "gures. All data images in Fig. 12 including two rightmost images can be correctly classi"ed by these prototypes. Parameters were set as a"0.1, b"2, a"0.1. The image shown in Fig. 14 was tested by using these prototypes. Two cases were compared, one is a case with no prior information, i.e. qR\"qR\"0.5 and Fig. 14 is dis played at time t, the other is a case with a prior information, i.e. Fig. 14 is displayed after presentation of one of images in Fig. 12. If the Gauss distribution was used as in Eq. (9), the outputs became qR"0.5, qR"0.5 for both of these cases, i.e. we cannot decide a group to which the image of Fig. 14 belongs irrespective of prior information. On the other hand if we adopt Eq. (10), the classi"cation result is as follows:
Fig. 13. Prototype images.
where d and r are kth pixel of datum image d and that I GI of prototype image r . This expression reveals that classiG "cation of entire image is the result of `ANDa operation of classi"cation of every pixel, i.e. if and only if d r at I GI every pixel, p(d"i) 1 but otherwise p(d"i) 0. For instance in Fig. 12, left six images become prototypes if learning is done with these six images, however these prototype images cannot classify the rightmost images because memberships of these images in each group becomes 0.5 as the rightmost images are far from the left six ones in the Euclid distance and Gaussian density becomes almost zero. This is the result of `ANDa operation because diagonal pixels are di!erent although all the remaining pixels in the rightmost images coincide with those in the second ones. To alleviate this de"ciency of the Gaussian density, we modify it into a more robust form GI
p(d"i)J e\? \\?B \P . I I
(1)
In the case of no prior information, i.e. after qR\"qR\"0.5, then we get qR"0.55, qR"0.45; hence Fig. 14 is classi"ed into the group of horizontal lines although the decision is weak. (2) In the case with prior information: (2-1) If Fig. 14 is displayed after the upper-left image in Fig. 12 which belongs to the "rst class, i.e. qR\"0.9, qR\"0.1, then we get qR"0.73, qR"0.27, hence Fig. 14 is judged to belong to the class of horizontal lines. Conversely, (2-2) If Fig. 14 is displayed after the lower-left image in Fig. 12 which belongs to the second class, i.e. qR\"0.1, qR\"0.9, then we get qR"0.29, qR"0.71, hence Fig. 14 is judged to belong to the class of vertical lines.
(10)
In contrast to this the Gaussian density becomes zero when "d !r " increases in"nitely, the function in I GI Eq. (10) approaching a positive value e\?. This property shows that this density is nearly equal to the sum of the Gaussian density and a uniform distribution which rep-
Fig. 14. Test image.
1672
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
Fig. 17. Background of images in the lower row of Fig. 15.
Fig. 15. Images of three-dimensional objects.
value which each pixel takes when the pixel is included in the background of the group. This common value is the most probable one among those the pixel takes, hence it is the mode of the probability density of the value d of I each pixel in jth group, which is expressed by a mixture density GI
(11) p(d )J e\? \\?B \P
I GZ'H whose mode is given by the most likelihood estimation I
Fig. 16. Test image.
Thus when an image including image parts of both groups such as Fig. 14 is presented, the presented image is classi"ed to a group to which the previous image belongs. This is an e!ect similar to attention by temporal prediction. Such e!ect was examined by using another example of bottles and toy cars shown in Fig. 15. Three prototypes were learnt for each group, i.e. every image in Fig. 15 became prototypes. An image shown in Fig. 16 was displayed after the presentation of the lower-right image in Fig. 15; then Fig. 16 was classi"ed to the group of toy cars. Thus, we can classify an object correctly even if view direction of the object is varied from the previous time and the object is occluded partly by other objects. Robustness against occlusion is attributed to the fact that the classi"cation is performed by the majority vote of pixels. As shown by these examples, although the present classi"er deals with an entire image as datum and extraction of objects is not explicitly done, the system is capable of attending to an object in an image including multiple objects based on the prior information and the attended object is classi"ed. In such a classi"cation, attended region in an image can be speci"ed as follows. At "rst a common image corresponding to background is speci"ed for each group by calculating for every pixel the value which each pixel takes when the pixel is included in the background. The value of a pixel varies from image to image. A value common in many images of a group is the
arg max p(d ) I BI
(12)
which is the value this pixel takes most frequently and is the value of the pixel in the background. For instance, background of lower three images in Fig. 15 is shown in Fig. 17. Notice that background means, here, an image appearing common to all images in a group and hence brings no discriminative information useful for classi"cation of each image. This de"nition di!ers from the usual meaning of the background which does not include objects; however we use the terminology background here for abbreviation. Attended region can be speci"ed by using this background image. At "rst a group to which an input image belongs can be speci"ed by q( j). Next a cluster to which the input image belongs can be speci"ed by calculating the value of Eq. (10) for every prototype of clusters in the group. The image is classi"ed into a cluster with the maximal value of Eq. (10). The input image is then processed on the basis of this membership information. Firstly, the common region between the input image and the background of the group is speci"ed and this common region is removed from the input image. Next the membership e\? \\?BI\PGI of the input image in each cluster of the group is calculated for remaining pixels and then a region where this membership takes a value close to one is the portion important for classi"cation of the input image and this is the attended region. The attended region thus obtained is a residue after removal of background and other objects. For example, of the image in Fig. 14, the attended region after the presentation of the lower-left image in Fig. 12 is the upper vertical line in
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
1673
7. Conclusion
Fig. 18. Attended region in Fig. 16.
Fig. 14. For another example of Fig. 16, the attended region after the presentation of the lower-right image in Fig. 15 is the white region in Fig. 18, i.e. visible portion of the toy car. Finally, let us consider a role of the robust estimation in the temporal prediction based on top}down informations. In all of the above models, the magnitude of input d has been assumed to be "xed to an equal value one. Input signals in neural networks have usually their strength s accompanying their feature value d. s"0 means no input signal. In the case of the Gauss distribution, introduction of this strength s leads to p(d"i)J se\?,B\PG , hence p(i)p(d"i)Jsp(i)e\?,B\PG ,. Therefore if no signal is inputted, i.e. s"0, no response is outputted. This shows that the top}down information p(i) cannot produce any response by itself, it can merely modify a response produced by an input d. Let us examine a case where p(d"i) is a robust distribution such as Eq. (10). When #d!r # increases, the value G of p(d"i)J e\ \\a,\ , approaches e\,? (N is the G total number of pixels) which is independent of d. Hence if we express as p(d"i)Je\,?#f (#d!r #), then G f (#d!r #) becomes a function similar to the Gaussian G distribution. Since the "rst term of this expression is independent of the input d, the introduction of the input magnitude becomes p(d"i)Je\,?#sf (#d!r #) which G leads to p(i)p(d"i)Jp(i)[e\,?#sf (#d!r #)]. This equaG tion reveals that the top}down information p(i) modi"es the response sf (#d!r #) produced by the input d and G can produce response p(i)e\,? by itself without input (s"0). e\,? is a spontaneous response of `Ra neurons. This is a distinctive property of the `Ra neurons with the robust distribution compared to the Gaussian distribution which has no spontaneous response. The spontaneous response in the robust distribution enables the system to produce response only by top}down signal. Visual perception produced by top}down intention is called mental images and active response of V1 neurons was observed accompanying mental images [11]. Another observation was reported that neurons in the anterior temporal cortex produced response to the presentation of an image at the previous time with no signal inputted at the current time [8]. These physiological experiments support the robust distribution model instead of the Gaussian distribution.
An RBF neural network model with feedback paths has been presented and its learning ability has been exempli"ed for view invariant pattern recognition. By using robust distributions, the system can reject outlier pixels in images. This rejection is similar to visual attention process which is performed in our system without explicit segmentation of objects in images. This attention process is however very primitive and more elaborate function accompanying object segmentation and motion tracking will improve the performance of the model. Such improved model approaches actual attention behavior observed in our visual perception. This advancement needs extraction of structured features from images instead of using raw images as in the present model. Another improvement desired for the model is the incorporation of topology in to neurons. The present model possesses no topology, i.e. the order of neurons is free. In the physiological observations, however, an active point in the cortex moves continuously with variation in the view of an input image. Thus adjacent neurons respond to similar views of objects. Incorporation of lateral connections in addition to feedback paths will yield such topology preservation property in the model. These extensions of the model are under study.
Appendix A For simplicity, we remove and the logarithm from R Eq. (6) which then becomes K max p(i)e\?,B\PG ,. P G This equation is equivalent to
(A.1)
K 1 (A.2) min p(i)[x #d!r ## x (ln x !1)]. G G G a G V P G Note here the change from `maxa in Eq. (A.1) to `mina in Eq. (A.2). This equivalence can be easily shown by di!erentiating Eq. (A.2) by x and equating the derivative to G zero, then we get x "e\?,B\PG , and substitution of this G expression into Eq. (A.2) yields Eq. (A.1). Eq. (A.2) is the objective function for fuzzy clustering and it means that the quantization error #d!r # is minimized for datum G d with large membership x , i.e. belonging to r . The G G second term x (ln x !1) is an entropy for fuzzi"cation. G G Thus, we know that the learning in the present model is equivalent to fuzzy clustering whose annealing process has been proved to have the following property [12]: Theorem. If a is increased monotonically, then total quantization error K p(i)p(dR"i) decreases monotoniR G cally.
1674
K. Inoue, K. Urahama / Pattern Recognition 33 (2000) 1665}1674
This theorem ensures high possibility of global convergence of r to the centroid of ith cluster. G References [1] S. Ullman, R. Basri, Recognition by linear combination of models, IEEE Trans. Pattern Anal. Mach. Intell. 13 (10) (1991) 992}1006. [2] T. Poggio, S. Edelman, A network that learns to recognize three dimensional objects, Nature 343 (1990) 263}266. [3] D.I. Perret, M.W. Oram, The neurophysiology of shape processing, Vision Image Comput. 11 (6) (1993) 317}333. [4] N.K. Logothetis, J. Pauls, H.H. Bultho!, T. Poggio, Viewdependent object recognition by monkeys, Curr. Biol. 4 (1994) 401}414. [5] P. Foldiak, Learning invariance from transformation sequences, Neural Comput. 3 (1991) 194}200. [6] E. Rolls, Learning mechanisms in the temporal lobe visual cortex, Behav. Brain Res. 66 (1995) 177}185.
[7] M. Bartlett, M. Stewart, T. Sejnowski, Unsupervised learning of invariant representations of faces through temporal association, in: J.M. Bower (Ed.), Computational Neuroscience: International Review of Neurobiology Suppl., Academic Press, San Diego, 1996, pp. 317}322. [8] K. Sakai, Y. Miyashita, Neural organization for the longterm memory of paired associations, Nature 354 (1991) 152}155. [9] D. Weinshall, S. Edelman, A self-organizing multiple view representation of 3D objects, Biol. Cybernet. 64 (3) (1991) 209}219. [10] S. Becker, Learning temporally persistent hierarchical representations, Adv. Neural Inform. Process. 9 (1996) 824}830. [11] S.M. Kosslyn, N.M. Alpert, W.L. Thompson, V. Maljkovic, S.B. Weise, C.F. Chabris, S.E. Hamilton, S.L. Rauch, F.S. Buonanno, Visual mental imagery activates topographically organized visual cortex: PET investigations, J. Cognitive Neurosci. 5 (1993) 263}287. [12] K. Urahama, Mathematical programming formulations for neural combinatorial optimization algorithms, J. Artif. Neural Networks 2 (4) (1995) 353}364.
About the Author*K. INOUE received the B.S. and M.S. degrees from the Kyushu Institute of Design, Japan, in 1996 and 1998, respectively. He is now a graduate student in the department of Visual Communication Design, Kyushu Institute of Design. His research interests include neural networks, image processing and computer vision. About the Author*K. URAHAMA received his B.S., M.S. and Ph.D. in Electronic Engineering from Kyushu University in Japan in 1976, 1978 and 1981, respectively. From 1981 to 1989 he has been with Kyushu University, and from 1989 to 1995 he has been an associate professor of electronic engineering at the Kyushu Institute of Technology. He is now professor of Visual Communication Design at the Kyushu Institute of Design. His research interests includes image analysis, pattern recognition and neural networks.
Pattern Recognition 33 (2000) 1675}1681
Recognition of a solid shape from its single perspective image obtained by a calibrated camera Zygmunt Pizlo*, Kirk Loubier Department of Psychological Sciences, Purdue University, West Lafayette, IN 47907-1364, USA Received 24 June 1998; received in revised form 2 March 1999; accepted 7 July 1999
Abstract A new model-based method for object shape recognition is presented. This method consists in solving a set of linear equations and thus is computationally simple. We show how it leads to a hierarchy of geometrical representations of an object, involving similarity, a$ne and projective properties. Results of our tests with synthetic images illustrate that this method works reliably in the presence of large amount of image noise. Finally, we show that the method allows for speeding up the memory search for the correct model, which makes it more e$cient as compared to conventional model-based methods. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Three-dimensional shape; Model-based recognition; Calibrated camera
1. Introduction One of the fundamental problems in visual perception is that of recognizing the shape of an object given its single retinal (or camera) image. Shape is de"ned here conventionally as this property of an object, which is invariant under rigid motions and size scaling. Thus, shape is an invariant of a similarity transformation, which preserves angles and ratios of distances. To recognize a shape one has to determine whether a given retinal image could have been produced by a 3-D object (model) with a known shape. Since a single image does not allow unique reconstruction of the shape, one has to perform matching between the shape of a model and the given retinal image. Note that the size of the model is irrelevant in recognition from a single image: if a given retinal image was produced by a given object, it also could have been produced by any of the in"nite number of objects having the same shapes and di!ering only in size: larger objects being proportionally farther away.
* Corresponding author. Tel.: #1-765-494-6930; fax: #1765-496-1264. E-mail address:
[email protected] (Z. Pizlo).
We consider here objects that have distinctive points and we assume that the correspondence of the object and the image points is known. We further assume that the image was taken by a calibrated camera, that is, a camera whose intrinsic parameters, such as principal point and focal distance, are known. This case is relevant to both human vision and machine vision, but it received less attention than the uncalibrated camera [1}4]. An uncalibrated camera is mathematically much more tractable, since its geometry can be represented in a form of linear equations involving homogeneous coordinates [5]. In such a case, elimination of transformation parameters is equivalent to solving a set of linear equations. In a calibrated camera, elimination of transformation parameters is much more di$cult because it leads to polynomials of high degree for which analytical solutions are not known [6]. In this paper we describe a method for recognizing the shape of an object from a single image obtained by a calibrated camera. We present this method in a more general framework of recognizing an object up to an arbitrary projective, a$ne or similarity transformation. Our method is computationally simple: it involves solving a set of linear equations. The main element of the method is matching the a$ne structure of the model to a perspective image "rst, and then verifying the
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 7 0 - 3
1676
Z. Pizlo, K. Loubier / Pattern Recognition 33 (2000) 1675}1681
non-linear constraints representing similarity structure. We present results of testing this method using synthetic images with a zero mean Gaussian noise added to each image point. We also present a real example.
2. Presentation of the new method
m"PHKM,
2.1. Projective vs. similarity structure In this section we show how homogeneous coordinates can be used for shape recognition with a calibrated camera. Let M be the model object and m the perspective image obtained by a camera. The question is whether m could have been produced by M. Let the homogeneous coordinates of a point from the model form a vector M(X, >, Z, ¹), and the homogeneous coordinates of the corresponding image point form a vector m(;, <, S). If m is a camera image of M, the coordinates of these points must satisfy the following equation [7]: m"PM,
M (i.e. its similarity structure) has been recognized. Otherwise, one can conclude that the shape of the object in front of the camera is di!erent from the shape of the model M. Speci"cally, the shape in front of the camera could be a projective transformation of the model shape M. In other words, there exists a 3-D projective transformation K such that
(1)
where P represents the perspective projection of M " to m. We assume "rst that the camera is not calibrated. Given six (or more) non-coplanar points representing M and the corresponding image points, one can use Eq. (1) to estimate the 11 independent parameters of P [7]. This estimation can be performed by minimizing a norm ""Aq"", where A is a 2N;12 matrix whose elements depend on the 3-D and 2-D coordinates of N points, and q is the 12;1 vector with elements of P. If the minimum of this norm is much greater than zero, then there is no P which satis"es Eq. (1) (even approximately). This, in turn, means that m could not have been produced by M. In fact, m could not have been produced by any 3-D projective transformation of M. This can be shown as follows: Let K represent a 3-D projective " transformation. Assume that there exists a camera represented by P so that m"PKM. But this implies that P"PK satis"es Eq. (1), which contradicts our assumption that there is no 3-D to 2-D perspective transformation satisfying Eq. (1). Thus, we conclude that the object in front of the camera, which produced image m, is a non-projective transformation of the model M. If the minimum of the norm ""Aq"" is close to zero, then the image m is a perspective transformation of the model M. In other words, there is a camera (not necessarily the camera that was used to obtain m), which could obtain this image when presented with the model M. From the 11 parameters of matrix P, one can compute all intrinsic and extrinsic parameters of such a camera. If these intrinsic parameters are (approximately) identical to the intrinsic parameters of the camera that actually obtained the image, one can conclude that the shape of the model
(2)
where PH involves intrinsic camera parameters identical to the intrinsic parameters of the given camera. The fact that such K exists is obvious because we can always choose K as a solution of the following equation: PHK"P.
(3)
Note that Eq. (3) represents 12 linear equations with 16 unknowns (elements of K) (more exactly, there are 11 independent equations with 15 unknowns). Clearly, Eq. (3) has an in"nite number of solutions K. Thus, if there is P which satis"es Eq. (1), but which does not represent the given camera, we can conclude that the object in front of the camera is projectively di!erent from the model M. However, we cannot determine this projective transformation uniquely. To summarize, we have shown that Eq. (1) can be used in the case of a calibrated camera. Speci"cally, it allows, by solving a set of linear equations, to decide whether the given image: (i) could have been produced by the model (in such a case, one concludes that the shape of the model has been recognized); (ii) could have been produced by an object, which is a projective transformation of the model; (iii) was produced by an object, which is a non-projective transformation of the model. We want to point out that Eq. (1) does not seem to have been used in this way in the past. One possible reason is that a linear method of estimating P leads to
We want to point out that the given camera image, like any 2-D image of a 3-D scene, could also be obtained by any of an in"nite number of non-projective transformations of the model. Such non-projective transformations are obtained by moving any point of the model to an arbitrary position on its projecting ray. This transformation of the model may not even be a topological transformation. This implies that even though Eq. (1) does represent a 3-D to 2-D perspective transformation, this transformation is more general than a projective transformation. This fact further implies that a projectivity is not an adequate model of camera image formation [3,4].
Z. Pizlo, K. Loubier / Pattern Recognition 33 (2000) 1675}1681
results that are quite sensitive to noise [7]. It is possible, however, that for a given set of models, such a linear method is good enough, that is, it may allow reliable discrimination among the models despite poor reconstruction of the camera parameters. Next, we describe a method of recognizing an a$ne and similarity structure of the object using a calibrated camera. 2.2. Azne vs. similarity structure In this section, we maintain our notation for the model object and its image as M and m, respectively. Note, however, that the object and image points are expressed in this section by cartesian, not homogeneous coordinates. We assume that the origin O(0, 0, 0) of the camera coordinate system coincides with the center of projection of the camera, and that the image plane is at the distance c from O along the line of sight (i.e. m "c for all i). G Without restricting the generality we set c to !1. Assume that the object M is in front of the camera at some unknown orientation and distance. This fact is represented by the following equation: M "RM #T, G G
(4)
where M is the ith point of the object in a known G position and orientation in the camera coordinate system, M is the ith point of the object after applying G rotation R and translation T to M . G The image m of M can be calculated from the rules of G G perspective projection m "!M /M , m "!M /M . G G G G G G
(5)
The recognition problem is equivalent to verifying whether there is a rotation R and translation T so " " that m "m , m "m G G G G
(6)
for i"1,2, n. Since rigid motion in 3-D has six degrees of freedom, shape recognition involves six free parameters. First, we show that the vector T can, in principle, be eliminated from the shape recognition problem. We rewrite Eq. (4) as M "R(M !M )#M . G G
(7)
Eq. (7) simply says that the rigid motion in 3-D can be performed by "rst rotating the object around M and then translating it so that M coincides with M . Next, assume that conditions (6) are satis"ed for m . This assumption does not restrict the generality because it merely says that the point M can be translated in 3-D so that its perspective image m coincides with image point
1677
m , which is always true. From Eqs. (5) and (7) we obtain M "R(M !M )#[m M , m M , M ]2. (8) G G Note that m and m in Eq. (5) are not a!ected by multiG G plicative scaling of all three coordinates M ( j"1, 2, 3) GH by the same factor. This means that the distance of the object M from the observer (represented in Eq. (8) by M ) is not important as long as changes in distance are associated by proportional changes in the object's size. Since we are interested here in recognizing shape of an object, but not its size, and since the actual distance is usually unknown, we can set the distance of the object from the observer to any (positive) number. Let `#1a be this assumed distance. To incorporate this assumption we divide both sides of Eq. (8) by M : M"R(M !M )#[m , m , 1]2 (9) G G where M"(1/M )M , G G R"(1/M )R. R in Eq. (9) represents a rotation followed by size scaling. Note that if M is substituted for M in Eq. (5), these G G equations will still be satis"ed. This means that both M and M give rise to the same retinal image (as stated in the previous paragraph). The positions of image points m are never known G exactly because of noise. This means that Eq. (6) are never satis"ed exactly and, therefore, there is no R satisfying Eq. (9). Thus, one should apply the least-squares method. That is, one "nds R[ that minimizes Q"&[(m !m )#(m !m )], (10) G G G G where the sum is computed over all points. In Eq. (10) m is given, m is obtained from Eq. (5) after substituting M from Eq. (9) for M. If there is R[ for which Q is su$ciently small, one can conclude that the shape of the model M has been recognized. The di$culty with this approach is that the unknown elements of R are both in the numerator and denominator in Eq. (5) and, as a result, Q is not a linear function of R. Hence, the
Note that we assume here that m is known exactly. If there is a large error in m , our method will work suboptimally because this error will propagate to all remaining points. In the case of human vision, this assumption is quite reasonable. The positional acuity is best at the "xation point (standard deviation of eye "xation is 2}3 minutes of arc [8]). So, "xating the eye at one of the object's points leads to very small errors for the image of this point. However, if this error is not small or is not known, one can keep T in the method. The only implication is that the least squares solution will involve 11, rather than 9 parameters.
1678
Z. Pizlo, K. Loubier / Pattern Recognition 33 (2000) 1675}1681
minimization problem can be solved only by iterative methods. We will show next that the minimization problem simpli"es if recognition involves a$ne, rather than similarity structure. From the rules of image formation it is clear that a point M is on the line connecting points G O and m . This is equivalent to the point M being on all G G planes de"ned by the triplets of points O, m and m for G H any j. Since we do not know the points m and m , but we G H do know the points m and m (although not exactly, if G H noise is present), we expect that a point M is close to all G planes p de"ned by the triplets of points O, m and GH G m for any j. Let the plane p be represented by the H GH following equation: A x#B y#C z#D "0. (11) GH GH GH GH The distance d of a point P (x , y , z ) from a plane (11) can be computed from the following equation: d"(A x #B y #C z #D )/(A #B #C ). GH GH GH GH GH GH GH (12) First note that since every plane p contains the origin of GH the coordinate system O(0, 0, 0), the constants D are GH zero for all i and j. Second, the denominator in Eq. (12) is the length of a vector (A , B , C ), which is a vector GH GH GH normal to the plane p . Eq. (12) can be simpli"ed by "rst GH using vector N (N , N , N ) normal to p whose GH GH GH GH GH length is one, and second by incorporating the fact that D "0 GH d"(N x #N y #N z ). (13) GH GH GH Vector N can be found from a cross product of vectors GH Om and Om : G H N "(Om ;Om )/""Om ;Om "". (14) GH G H G H Now the recognition problem is expressed as follows: is there a matrix R[ for which the sum Q of squared distances d of the object's points M from the correGGH G sponding planes p is su$ciently small? Let GH Q"&d , (15) GGH where d "N M #N M #N M . (16) GGH GH G GH G GH G Since M is a linear function of R (see Eq. (9)) Eq. (16) G IJ can be written as follows: d "x2 r!y , GGH GGH GGH
(16a)
where r is a vector containing the nine unknown elements of matrix R, x is a vector whose coe$cients are comGGH puted from the vector N and from the model's point M, GH G and y is a scalar representing the assumed translation GGH vector T. Let X be a matrix with x2 as rows, and Y be a vector GGH whose elements are y . Then, Eq. (15) has the following GGH form: Q"&d """Xr!Y"". GGH
The recognition problem requires verifying whether min Q is su$ciently small. Finding min Q is a standard least-squares problem whose solution is r\ "(X2X)\X2Y.
(18)
After substituting r\ from Eq. (18) to Eq. (17) we obtain min Q"""X(X2X)\X2Y!Y"".
(19)
If min Q is close to zero, one can conclude that the image m could have been produced by the model M or by any other model which is equivalent to M under a 3-D a$ne transformation. The ambiguity related to the a$ne structure is related to the fact that elements of vector r are all free parameters. To perform recognition of a similarity structure, one has to verify whether the a$ne transformation is a rigid motion. A matrix A represents a rigid motion i! A2A"I "
(20a)
and det (A)"1,
(20b)
where I is an identity matrix " Constraints (20) are non-linear with respect to A , IJ therefore these constraints cannot be incorporated at the stage of solving the minimization problem because then the resulting equations would not be linear. So, we "rst solve the unconstrained minimization whose result is represented by Eq. (19). Since r\ contains estimates of elements of R, we have to "rst estimate the scaling factor 1/M . This can be done by "nding k (an estimate of 1/M ) which minimizes &[k(R[ 2R[ ) !I ], where GH GH (R[ 2R[ ) and I are the corresponding elements of the GH GH matrices (R[ 2R[ ) and I , respectively. It is easy to show " that k"[&(R[ 2R[ ) ]/[&(R[ 2R[ ) ]. Now, the estimated GH GH matrix R[ can be re-scaled to eliminate the size scaling factor R[ "(1/k)R[ ,
We use here distances of object points from the projecting planes, rather than from the projecting lines, because a plane in 3-D has a simpler parametric representation, as compared to a line in 3-D.
(17)
(21)
where R[ is an estimate of R. The departure of R[ from orthonormality can be measured by Q"&[(R[ 2R[ ) !I ]. GH GH
(22)
Z. Pizlo, K. Loubier / Pattern Recognition 33 (2000) 1675}1681
1679
If a given image m was produced by the model M, both min(Q) and Q should be close to zero. Otherwise, either min(Q) or Q or both should be large and positive. To obtain a single measure of dissimilarity between the model and the image, one can use the following expression: d"log [Q min(Q)].
(23)
Before we describe simulation results, we establish the smallest number of points for which our new method can be used. Determining whether there exist a pose of an object (i.e. the matrix R and vector T), for which min Q is small, involves in our method 11 independent parameters. Each image point generates two constraints. Thus, similarly to the projective case, at least six points are needed to obtain a unique solution. In the next section we present simulation results illustrating the use of our method.
3. Testing the new method We "rst describe simulation experiment with a large number of synthetic images and then we present one real example. In the simulations each object was a set of eight points generated randomly within the volume of a cube whose side a"1. Perspective images were computed for the viewing distance M "6, after the object was randomly rotated in 3-D. After an image was computed, the positions of image points were perturbed by imposing a zero-mean Gaussian noise on both image coordinates with standard deviation of 1, 2 or 5% of the projected distance of a line segment of length 2a. For each noise level, there were 100 `samea trials, in which the minimization problem was solved for an object in the reference coordinate system and an image of this object. In another set of `di!erenta trials, the minimization problem was solved for an object in the reference coordinate system and an image of another random object. We "rst tested this method on same trials with no noise on the image. This resulted in min (Q)"0, Q"0, as well as in accurate estimates of M and R. In Fig. 1 we show results of simulations with noise 1, 2 and 5%. This "gure shows frequency histograms of the dissimilarity measure d. The histograms for `di!erenta trials are shifted to the right relative to the histograms for `samea trials. This means that by setting a criterion for smallness of d, shapes can be correctly recognized with probability above chance level even in the case of 5% of noise.
It may be more e!ective to use min Q and Q as two independent measures with two criteria for smallness.
Fig. 1. Frequency histograms of the dissimilarity measure d illustrating performance of our method in shape recognition task for di!erent levels of noise on the image.
It has to be pointed out that the value of d is likely to be a!ected by the size of the retinal image and the number of points. These parameters were kept constant in our simulations. In general, however, they are not constant. But since these two parameters are known to the observer, it should be possible to incorporate them in deciding about the criterion for smallness of d. In our test involving a real example, we used a table as the standard object and its camera image with 160;120 pixels (Fig. 2). The center of the image was assumed to be the principal point. The focal length of the camera was estimated from the image of a segment of known length placed orthogonally to the visual axis of the camera at known distance. The "eld of view of the camera was about 503. The size of the image (in units of the focal distance) was about 2.5 times larger than the sizes of images in our simulation experiment. The image coordinates of the eight vertices of the table were measured by hand. We obtained min Q"259 and Q" 0.027. Note that before we can compare this result with the histograms shown in Fig. 1, we have to take into account the fact that the image size in the current
1680
Z. Pizlo, K. Loubier / Pattern Recognition 33 (2000) 1675}1681
Fig. 2. A real image used in our experiment (see text for more details).
example was larger as compared to the sizes of images in our simulation experiment. This di!erence in the sizes of the images implies that in this experiment the size of the backprojected object (i.e. of the object whose coordinates are represented by Eq. (9)) was, on the average, 2.5 larger than the sizes of the backprojected objects in the simulation experiment. As a result, a given mismatch between the shape of the object and the shape of the image leads, in this experiment, to the sum of squared distances Q being (2.5) larger as compared to Q in our simulation experiment. Therefore, in order to be able to compare min Q (and hence, the value of d) in these two experiments, we divided min Q from this experiment by (2.5). This resulted in d"0.11. This value falls well within the range of values representing `samea trials in our simulation experiments. This means that a criterion for smallness of d derived from the simulation experiment would be likely to lead to a correct recognition in this experiment, as well.
4. Summary and conclusions We presented a method for object recognition from a single perspective image obtained by a calibrated camera. Speci"cally, the method allows recognizing an object up to (i) a 3-D a$ne transformation, and (ii) a 3-D similarity transformation. Thus, our method, together with the projective method described in Section 2.1, leads to a hierarchy of structures of the object to be recognized, which is analogous to the hierarchy described by Faugeras for the case of binocular reconstruction [9]. Our method represents a model-based approach, as opposed to invariant-based approach [6]. Model-based methods have been traditionally considered less interesting because they are computationally intensive. Having N models stored in a memory, a recognition of an object requires, in the worst case, performing N comparisons. Invariant-based methods are much faster, since they involve intrinsic properties of an object (and its image). As a result, the value of the invariant property serves as an
address of the model in memory. It is known, however, that invariants have some fundamental limitations. Speci"cally, there are no general case invariants for 3-D to 2-D transformation, which excludes many solid objects [10,11]. Model-based methods, on the other hand, do not have such limitations; they can be applied to any object. The method proposed here tries to resolve this traditional dichotomy between computational e$ciency and generality. Speci"cally, our model-based method allows for faster memory search, as compared to traditional model-based methods. This fact is explained next. Consider a set of M families, each family having N models and the models being projectively equivalent to one another within a family. Assume that the object to be recognized is a member of one of these families. Clearly, one does not have to perform all M)N comparisons. It is su$cient to apply our method to only one arbitrarily chosen model from each family and thus perform at most M comparisons. If one "nds that min ""Aq"" is close to zero for a given family, one does not have to consider any other family of models. Recognition process may either stop (if a projective structure is su$cient) or it may proceed with the next step in order to obtain recognition at the next level corresponding to a$ne structure. In this case one would need to check one member from each a$ne equivalent families of models corresponding to the given projective family. Again, if for a given family min Q is close to zero, one does not have to consider any other family. The recognition process may either stop (if an a$ne structure is su$cient) or it may proceed with the next step in order to obtain recognition at the next level corresponding to similarity structure. In fact, this last step can even be skipped since the matrix R[ (Eq. (21)) provides the information about the 3-D a$ne transformation that should be applied to this model in order to obtain (approximately) the similarity structure of the presented object. It follows that one needs to store in the memory only one member of the family of 3-D a$ne equivalent objects and recognition of any object from this family is possible regardless of the fact that this family may contain an arbitrarily large number of objects. Clearly, even though our method does not eliminate memory search altogether (one has to store one example for every family of projective and a$ne equivalent objects), it does speed up the memory search and, furthermore, it allows reducing the size of the memory (since it requires that only one member of each a$ne family be stored).
5. Summary In this paper, recognition of 3-D objects that have distinctive points, is considered. The paper begins with a brief presentation of the use of homogeneous coordinates in recognition of an object (up to an arbitrary
Z. Pizlo, K. Loubier / Pattern Recognition 33 (2000) 1675}1681
projective transformation) in the case of an uncalibrated camera. It is shown how this approach can be extended to a calibrated camera in the problem of recognition of an object up to an arbitrary similarity or projective transformation. Next, a new method is presented, which allows recognition of an object up to a similarity or a$ne transformation. This new method consists in minimizing the sum of squared distances of the object's points from the projecting planes formed by pairs of image points and the center of projection of the camera. This minimization problem reduces to solving a set of linear equations and thus is computationally simple. This method "rst leads to recognition of an a$ne structure of the object. Then, in order to determine the similarity structure, one has to check the non-linear constraints of an orthonormal matrix representing a 3-D linear transformation. This new method was tested on a set of synthetic images in the presence of image noise and a real example is presented. In the last section, this new method is compared to conventional model-based and invariant-based methods: the new method (unlike the invariant-based methods) is not object speci"c. That is, it can be applied to any 3D object, as long as the object has distinctive points. At the same time the new method does not require (as conventional model-based methods do) comparing the retinal image to all models stored in the memory. References [1] Z. Pizlo, A theory of shape constancy based on perspective invariants, Vision Res. 34 (1994) 1637}1658.
1681
[2] Z. Pizlo, A. Rosenfeld, Recognition of planar shapes from perspective images using contour-based invariants, CVGIP: Image Understanding 56 (1992) 330}350. [3] Z. Pizlo, A. Rosenfeld, I. Weiss, The geometry of visual space: about the incompatibility between science and mathematics, Comput. Vision Image Understanding 65 (1997) 425}433. [4] Z. Pizlo, A. Rosenfeld, I. Weiss, Visual space: mathematics, engineering and science, Comput. Vision Image Understanding 65 (1997) 450}454. [5] L.G. Roberts, Machine perception of three-dimensional solids, in: J.T. Tippett, D.A. Berkowitz, L.C. Clapp, C.J. Koester, A. Vanderburgh (Eds.), Optical and electrooptical information processing, MIT Press, Cambridge, MA, 1965, pp. 159}197. [6] J.L. Mundy, A. Zisserman, Geometric invariance in computer vision, MIT Press, Cambridge, MA, 1992. [7] O. Faugeras, Three-dimensional computer vision: a geometric viewpoint, MIT Press, Cambridge, MA, 1993. [8] R.M. Steinman, E!ect of target size, luminance and color on monocular "xation, J. Opt. Soc. Am. 55 (1965) 1158}1165. [9] O. Faugeras, Strati"cation of three-dimensional vision: projective, a$ne, and metric representations, J. Opt. Soc. Am. A 12 (1995) 465}484. [10] J.B. Burns, R. Weiss, E.M. Riseman, View variation of point set and line segment features, Proceedings of DARPA Image Understanding Workshop, 1990, pp. 650}659. [11] C.A. Rothwell, D.A. Forsyth, A. Zisserman, J.L. Mundy, Extracting projective structure from single perspective views of 3D point sets, Proceedings of the IEEE International Conference on Computer Vision, 1993, pp. 573}582.
About the Author*ZYGMUNT PIZLO received the M.Sc. degree in electronic engineering from the Technical University of Warsaw, Poland, in 1978, the Ph.D. degree in engineering from the Institute of Electron Technology, Warsaw, in 1982, and the Ph.D. degree in Psychology from the University of Maryland, College Park, in 1991. Currently he is with the Department of Psychological Sciences, Purdue University, West Lafayette, IN. His research interests are in computational vision. About the Author*KIRK LOUBIER received a BA degree in Psychology from Purdue University in December 1997. Currently he is with the Biological Psychiatry Branch of the National Institute of Mental Health. His research interests include bipolar illness and post-traumatic stress disorder, as well as the mathematical modeling of visual perception.
Pattern Recognition 33 (2000) 1683}1699
Planar shape recognition by shape morphing Rahul Singh, Nikolaos P. Papanikolopoulos* Department of Computer Science and Engineering, University of Minnesota, 4-192 EE/CS Building, 200 Union Street SE, Minneapolis, MN 55455, USA Received 16 December 1998; received in revised form 18 June 1999; accepted 18 June 1999
Abstract A novel method based on shape morphing is proposed for 2D shape recognition. In this framework, the shape of objects is described by using their contour. Shape recognition involves a morph between the contours of the objects being compared. The morph is quanti"ed by using a physics-based formulation. This quanti"cation is used as a dissimilarity measure to "nd the reference shape most similar to the input. The dissimilarity measure is shown to have the properties of a metric as well as invariance to Euclidean transformations. The recognition paradigm is applicable to both convex and non-convex shapes. Moreover, the applicability of the method is not constrained to closed shapes. Based on the metric properties of the dissimilarity method, a search strategy is described that obviates an exhaustive search of the template database during recognition experiments. Experimental results on the recognition of various types of shapes are presented. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Shape recognition; Shape morphing; Content-based retrieval; Pen-based computing
1. Introduction Shape recognition is a fundamental problem of pattern recognition and machine vision. This problem arises in a variety of contexts, examples of which include retrieval by content from image databases, document image analysis, automated industrial inspection, analysis of medical imagery as well as vision-based robotics and visual tracking. A large number of shape recognition techniques have been proposed in the literature. Broadly speaking, these may be classi"ed by the shape representation framework and the dissimilarity measure used. Various schemes proposed for shape representation include representation using global features like moments [1], Fourier descriptors [2], autoregressive coe$cients [3], texture [4], and color [5,6]. Other representational techniques have employed local features [7], eigenmode representations [8], subspace representations [9], skeletons [10], part-
* Corresponding author. Tel.: #001-612-625-0163; fax: #001-612-625-0572. E-mail addresses:
[email protected] (R. Singh), npapas@cs. umn.edu (N.P. Papanikolopoulos).
based descriptions [11], and boundary-based (contour) representations [12}14]. Some examples of various dissimilarity measures used in shape recognition include deformation-based measures like modal deformation energy [8], applications of standard metrics either directly on geometric shape descriptors [15] or on transformed shape representation [9], and measures de"ned on non-geometric shape attributes like color [5]. The problem of recognition when shapes are encoded by their contours is interesting for a multitude of reasons. First, contours are easy to obtain and can be used for shape representation regardless of shape convexity or the lack thereof. Second, recognition techniques based on contour analysis have broad applicability since contours are one of the most commonly used shape descriptors [16]. Finally, due to the fact that contours may be considered as high order curves, algorithms for contour-based recognition are potentially extendible to more generic shape classes including cursive words and handdrawn sketches. In the context of contour-based shape representation, some of the dissimilarity measures that have been used include the sum of squares of the Euclidean distances from each vertex of a polygon to the convex hull of the other polygon [15], the ¸ distance between
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 3 - 0
1684
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
the turning functions of two polygons [17], the Hausdor! metric [18], and elastic deformation energy [19,20]. Generally speaking, in order to be e!ective, a recognition measure should satisfy the following properties proposed by Arkin et al. [17]: E The measure should be a metric. E It should be invariant under translation, rotation, and scale change. E It should be easy to compute. E It should match intuitive notions of shape resemblance. From an applied perspective, certain additional properties are desirable in a generic shape recognition technique. These may include the applicability of a method regardless of shape convexity as well as its ability to deal with closed as well as open shapes. The latter requirement is of primary importance in applications like OCR and recognition problems in pen-based computing. Additionally, it contributes to the robustness of a recognition system since in real images, noise and errors in edge linking may lead to non-closure of object contours. Another important property is the ability of a recognition system to handle shape deformations. Many contemporary applications like content-based retrieval by matching image contours with hand-drawn sketches, recognition of articulate shapes, and handwriting recognition require a recognition methodology to capture the perceptual similarity of two shapes. Often this means that two shapes have to be placed in the same similarity class even if they are deformed versions of each other. In such cases, many conventional distance measures perform poorly since the mathematical description of the deformed shapes may not exactly match each other [21]. In recent literature, di!erent solutions have been proposed to address the above issues. Bookstein [22] proposed the use of thin-plate splines to model shape deformations. Sclaro! et al. [8] describe a closed shape in terms of the eigenvectors of its sti!ness matrix. Shape similarity is de"ned as the amount of modal deformation energy required to align two shapes. Yuille et al. [23] have used deformable templates to identify and track facial features. In the area of image retrieval, the QVE system [24] involves computing the correlation (with limited horizontal and vertical shifts) between a user sketch and an edge image in the database. Bimbo et al. [20] propose an elastic matching technique where the degree of matching along with the deformation energy spent is used to rank the similarity of hand-drawn sketches with database images. A similar idea has been used by Azencott et al. [19] for recognition of generic planar shapes. The method described in this paper attempts to address this problem while conforming to the theoretical and applied criteria mentioned above. It di!ers from the
techniques mentioned in the previous paragraph in that the identity of a shape is established by morphing its contour to templates stored in a database and using a quanti"cation of the morph as a dissimilarity measure. The quanti"cation is formulated in terms of the stretching and bending of the contours and is invariant to similarity transformations. Unlike modal matching [8], this formulation is not restricted to closed contours. Neither does it require extensive a priori shape modeling. Furthermore, the approach does not seek to model deformations based on simple horizontal and vertical shifts during the convolution of the template with the image. While the underlying idea of the proposed method is conceptually similar to the work of Azencott et al. [19] and Bimbo et al. [20] in that it uses deformations for matching shapes, it is (unlike the aforementioned works) invariant to rotations. Additionally, the morph provides, via the synthetic images, an image plane representation of the shape and pose transformation between the input and the template. In Refs. [25,26] we have shown that these synthetic images can be interleaved with real images of an object and used as visual feedback for an eye-in-hand manipulator. Based on the apparent motion described by the virtual images, the trajectory of the manipulator is controlled to perform positioning [26] and grasping tasks [25]. Thus, the approach described can form the basis of a uni"ed framework for addressing the problem of shape recognition as well as that of using recognition to control purposive robotic actions. We organize this paper by looking at shape representation issues in Section 2. The recognition method is described in Section 3. In Section 4, we propose a method to reduce the number of direct shape comparisons required for recognition. The experimental results are presented in Section 5. Finally, in Section 6, the paper is summarized, conclusions are drawn, and future work is outlined.
2. Shape representation and modeling Shapes are represented in our approach by their contours. This choice is based on the ability of shape contours to e!ectively capture the visual form, as well as the applicability of contours for the representation of di!erent types of shapes. The contour of each shape is modeled piecewise by virtual wires. Shape morphing occurs through deformation (stretching and bending) of the arti"cial wires. In this formulation, the shape recognition problem can be treated as an energy minimization one, where shape similarity is quanti"ed by the energy consumed for stretching and bending one wire-form contour model to another. Shape morphing is guided by a few key points, which are determined by segmenting the object contour.
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
1685
2.1. Contour segmentation Contour segmentation is a well-established area in computer vision and many segmentation algorithms have been proposed. A taxonomy of these techniques can be provided by broadly dividing them into two classes; methods [27}29] that place segmentation points by minimizing some error norm and methods [27,30}33] based on the identi"cation of perceptually important points (corners). Contour segmentation can, in general, signi"cantly reduce the number of contour points while maintaining a su$ciently accurate shape description. In the current context, the two issues that are of importance are representational accuracy and representational consistency. For rigid polygonal shapes both error-based segmentation as well as dominant point detection techniques can provide adequate representational accuracy. The problem lies in that the results of polygonal approximation may be di!erent (in terms of the number of segmentation points and their placement), especially in the presence of noise and orientation changes. On the other hand, in deformable shapes like hand-drawn line "gures, sections of the contour are often characterized by slowly varying curvature. Segmentation of such contours using dominant point detection techniques leads to poor reconstruction [34]. Moreover, small deformations can signi"cantly alter the number of segmentation points obtained with an error-based method. In this work, depending on the problem context, we have used two di!erent segmentation techniques. For problems like recognition of hand-drawn "gures, where the shapes from the same class may vary due to deformations, we use the segmentation algorithm proposed by Pavlidis et al. [35]. For the recognition of rigid shapes, we use an algorithm that is based on modi"cation to the error-based segmentation technique of Ray et al. [29]. The modi"cations are primarily designed to obtain consistent segmentation results. In the following, we provide a brief description of the above techniques. The basic idea of the algorithm proposed by Pavlidis et al. [35] is to represent a contour as a succession of high curvature points (corners) and relatively low curvature regions, that are represented by a single point called a key low curvature point (see Fig. 1). The actual algorithm consists of two parts. In the "rst part, corner points are detected by using the algorithm proposed by Brault and Plamondon [30]. These are then interleaved with low curvature points which are computed by using a criteria that is conjugate to the one used for computing the corners. The primary advantages of this algorithm lie in the consistent segmentation pattern2corner } key low curvature point } corner2and the automatic identi"cation of the region of support for computing both types of curvature extrema. Fig. 1 shows the segmentation and reconstruction of some hand-drawn shapes by using this technique.
Fig. 1. Segmentation-point placement and B-spline reconstruction using the Pavlidis' algorithm [35]. Curvature maxima are indicated by small squares and curvature minima are shown as small discs.
The segmentation of rigid objects is done by using a modi"ed version of the error-based algorithm of Ray et al. [29]. This algorithm produces a piecewise linear approximation of a contour by determining the longest possible line segment that can "t a set of contour points with the smallest possible error. In our modi"cations, this algorithm is extended by computing the curvature at each segmentation point. Points having extremely low curvature are then suppressed. As a result of this operation, redundant points are removed while preserving the signi"cant corner points. The segmentation list may however contain points, that are due to noise or quantization errors. An additional merging procedure, similar to that suggested by Huang and Wang [13], is then applied to remove any such points. The strategy is based on computing the deviation of each segmentation point from the chord joining its neighbors. If this deviation is less then a prede"ned threshold, the corresponding point is removed. This merging procedure is repeated till no segmentation point having a deviation less than the threshold remains. In Fig. 2 we provide some examples of the algorithm's performance on rigid shapes at di!erent orientations and positions.
3. Shape morphing and recognition Let S' and S2 be the input and the target shapes, respectively. The morph of S' to S2 is a transformation of the shape, pose and other available image attributes (like color or texture) of S' to those of S2. The morph is characterized by a sequence of intermediate images that depict this transformation. In the current
1686
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
formulation, the shape of an object is described by its contour. The morph between two objects is therefore de"ned as the morph between their respective contours. In Fig. 3, two morphs are shown. In the "rst case (top row), the morph occurs between two instances of the same object di!ering in terms of their pose with respect to the camera. The intermediate images synthesized during the morph show, predominantly, the progressive transformation of the input pose to that of the target. In the lower row, we present an example where the morph occurs between two di!erent shapes. Let the input and the target shapes, as represented by their segmentation points be denoted as S'"[S' ,2, S'] L and S2"[S2,2, S2], respectively. One possible way to L morph S' and S2 is through a cross-dissolve operation on the corresponding segmentation points of the
two contours [36]: S(t)"uS'#tS2 "[uS' #tS2, uS' #tS2,2, uS'#tS2] L L "[S (t), S (t),2, S (t)], (1) L where u"1!t and S (t) is the ith contour point in the G intermediate shape formed at time t. The time parameter t is normalized to the interval [0, 1]. The contours S' and S2 will, in general, have a di!erent number of segmentation points. Therefore, for a morph as de"ned by the cross-dissolve operation of Eq. (1) to occur, a point correspondence between the segmentation points in the input and the target is needed, wherein every segmentation point on the input contour corresponds to at least one segmentation point on the target contour and vice versa.
Fig. 2. Segmentation results for some shapes.
Fig. 3. Shape morphing.
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
The following formulation for the computation of point correspondences is motivated by the physics-based approach of Sederberg et al. [36]. The cost of a point correspondence is de"ned as the sum of stretching and bending required to deform the wireform contour so as to bring about the required correspondence. The stretching energy is computed for every segment (pair of points) and is de"ned as E "k Q Q
"(¸ !¸ )!(¸ !¸ )" 2 ' - , (1!c )¸ #c ¸ Q Q
(2)
where ¸ "min(¸ ,2, ¸ , ¸ )
' 2 and ¸ "max(¸ ,2, ¸ , ¸ ).
' 2 In Eq. (2), E denotes the stretching energy spent in the Q current deformation, ¸ , ¸ , and ¸ denote the segment - ' 2 lengths at the beginning, before the current deformation, and after the current deformation, respectively. The term c corresponds to the penalty for segments collapsing to Q points and k is the stretching sti!ness parameter. The Q bending energy E is computed for point triplets and @ denotes the cost of angular deformation. It is de"ned as E "k "[( ! )!( ! )]". @ @ 2 ' -
(3)
In the above equation, k indicates bending sti!ness, @
represents the original angle, and and denote ' 2 the angle before the current deformation and the angle after the current deformation, respectively. The optimal morph between two contours is determined by the correspondence requiring the least stretching and bending energy. By constraining the deformations at the segmentation points, the following optimal substructure property may be observed: The optimum cost of the point correspondence (S', S2) equals the optimum cost of G H the prior point correspondence (S' , S2) or (S' , S2 ) G\ H G\ H\ or (S', S2 ) and the cost of establishing the correspondG H\ ence (S', S2). Based on the above, an e$cient (O(mn)) G H dynamic programming scheme can be constructed for
1687
morphing a contour CA with m points to another CB having n points. Since the energy computation described above, requires a starting point correspondence, we de"ne the optimal morphing between two contours CA and CB as (4) D (C , C )"min E(CA, CB). KMPNF A B Here, ) denotes the set of all starting point correspondences between the contours CA and CB. D (C , C ) KMPNF A B (hereafter called the degree of morphing), denotes the cost of the optimal morph between the contours CA and CB. Since the formulation of the morph is based on a linear cross-dissolve operation (see Eq. (1)), physically valid intermediate shapes in the morph, even between instances of the same object, are not guaranteed unless the input and the target shapes are rotationally aligned. Such lack of physical validity is expressed by a crossover of the object contours during the intermediate stages of the morph. An example of a morph with physically invalid intermediate images is presented in Fig. 4. Physically invalid morphing is inconsistent in the sense of our formulation, because the deformations caused by the crossovers are due to alignment di!erences and not shape di!erences. A solution to this problem can be obtained by warping the contours prior to the application of the cross-dissolve operation. This warp can be de"ned on the basis of the observation that the point correspondence obtained during the computation of the optimal morph is invariant to translation and rotation of the objects. Based on this correspondence, elongation axes can be computed for each shape. The rotation transformation between the two shapes can be estimated by computing the angle between the elongation axes. Similarly, the translational discrepancy between the shapes may be obtained by computing the vector joining the centroids of the two shapes. The shape morphing process is thus divided into two stages. The sequence of intermediate images generated during the "rst stage (warping) exhibit a progressive recti"cation of the rigid transformations (translation and rotation) between the input and the target shapes. This recti"cation involves an update of the coordinates describing the input contour. The
Fig. 4. A physically invalid morph.
1688
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
correspondences between the input and the target contours, however, remain una!ected. These correspondences along with the updated values of the contour points are then used to rectify the shape deformations between the input and the target using the cross-dissolve operation. We would like to point out to the reader that in our formulation, the cross-dissolve operation is only de"ned on the geometric description of the shapes (coordinates of the segmentation points). The recognition process thus consists of the following three steps: 1. Shape recognition: The template closest to the input shape, in terms of the stretching and bending energies is identi"ed. The correspondences between the segmentation points of the input and the target are determined by using the optimal substructure property described above. 2. Rotational and translational alignment: The elongation axis and the centroids are computed for the input and its corresponding template. Based on the translation vector between the shape centroids and the angle between the elongation vectors, the input coordinates are updated to align the shapes. 3. Rectixcation of deformations: The updated input coordinates along with the correspondences computed during shape recognition are used to deform the input contour (by stretching and bending) till it becomes identical to the template. 3.1. Analysis of the recognition technique Invariance of the dissimilarity measure to translation and rotation follows from the fact that the contours are described in object-centered coordinate systems and not absolute coordinate systems. Invariance to scale changes is obtained by normalizing the contour length. Furthermore, the formulation as well as the computation of the dissimilarity measure does not involve any assumptions about closure or convexity of contours. The intuition behind the proof for the metric properties of the measure comes from the fact that we model the shape changes as a conservative system. A formal proof of these properties is presented in Appendix A.
4. Reducing shape comparisons by using the triangle inequality and the primordial shape The shape recognition framework described thus far assumes an exhaustive search of the database involving the comparison of each template to the input shape in order to "nd the closest template to the input. E!orts to avoid an exhaustive search of the database so as to answer a similarity query are needed because of the
following two reasons: 1. Relative high costs of computing on-line the dissimilarity function. 2. Potential increase in the size of the image database. In this section, we examine the use of reference shapes in conjunction with the metric properties of the proposed dissimilarity measure to avoid an exhaustive search of the image database. The basic idea [37}39] lies in determining how shapes in the database are related to a predetermined reference shape. In addition to this, if the similarity between the input (query) shape and this reference shape can be computed, then the templates in the database that are highly dissimilar from the query can be excluded from contention, without resorting to costly on-line comparisons. Let S' be the input shape (query). Further, let D"+S2, S2,2, S2, be the image database. Denote by L CA, CB, and CC three arbitrary shapes (contours). Let D (C , C ) be the dissimilarity measure between the KMPNF A B shapes CA and CB and let S2 be the reference shape, 0 S23D. 0 Since D (C , C ) is a metric, it has the following KMPNF A B properties: D (C , C )*0, (5) KMPNF A B D (C , C )"0 0 CA,CB, (6) KMPNF A B (C , C ), (7) D (C , C )"D KMPNF A B KMPNF B A D (C , C )#D (C , C )*D (C , C ), ∀CC. KMPNF A B KMPNF B C KMPNF A C (8) Given an arbitrary shape CX in the image database, it follows from the triangle inequality (8) that D (S', CX)*D (S', S2)!D (C , S2). (9) KMPNF KMPNF 0 KMPNF X 0 Let e be some threshold on shape similarity. For instance, e may be selected as the distance between the input shape S' and a template shape that is closest to the input after a partial search through the database. It follows from Eq. (9), that if D (S', S2)!D (C , S2)*e, (10) KMPNF 0 KMPNF X 0 then the comparison between S' and CX does not need to be considered. This is because the de"nition of e guarantees the existence of at least one shape in the database that is closer to S' than CX with respect to the dissimilarity measure D (. , .). Similarly, we also have from the KMPNF triangle inequality D (C , S')#D (S', S2)*D (C , S2). (11) KMPNF X KMPNF 0 KMPNF X 0 But from the symmetry of the dissimilarity measure, we have D (C , S')"D (S', CX). KMPNF X KMPNF
(12)
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
Substituting the above in Eq. (11) and simplifying we get D (S', CX)*D (C , S2)!D (S', S2). KMPNF KMPNF X 0 KMPNF 0 Once again if
(13)
D (C , S2)!D (S', S2)*e, (14) KMPNF 0 KMPNF X 0 then the comparison between S' and CX does not need to be carried out. The constraints in Eqs. (10) and (14) describe the criteria that can be used to exclude shapes in the database without in#uencing the correctness of the recognition process [37]. The comparison of the database shapes with the reference shape, D (C , S2), is computed o!KMPNF X 0 line. The comparison between the input shape and the reference shape, D (S', S2), is done on-line for every KMPNF 0 new input. If during the comparisons, a template shape is found that is closer to the query shape than the best match obtained till that point in the search, then the identity of the best match is updated to this template. The speedup obtained by using the above idea depends on the computation of the threshold e. Computing a good value of e can substantially ameliorate the search complexity. However, the di$culty lies in that computing a good value of e itself involves searching large parts of the image database. A possible solution to these mutually con#icting goals can be obtained by using the idea of a primordial shape. Every shape CY in the database is morphed to a primordial shape S2. The primordial shape . can be, for instance, a point or a line. The value of the morph D (C , S2), then becomes an indicator of KMPNF Y . the complexity of the shape CY. An unknown input S' is "rst morphed to the primordial shape S2. The value of . this morph is used to identify a subset DS of the image database D. The subset DS consists of shapes having the same order of similarity as S' with respect to S2. . Formally, DS"+CY: CY3D, "D (C , S2)!D (S', S2)"(d,. KMPNF Y . KMPNF . (15) In the above equation, d is a parameter the value for which is provided. The reference shape can then be selected either as the primordial shape or as the shape closest to the input S' in terms of the cost of the morph from S2. The search space is pruned by computing the . dissimilarity measure over the shapes in DS. An alternative to providing the parameter d in Eq. (15), is to use a K-nearest neighbor rule to select the templates for which the shape comparison is carried out.
5. Experimental results Three di!erent sets of experiments were used to test the performance of the proposed system. In each of the experiments, a di!erent data set was employed to test
1689
the applicability of the method in di!erent domains. In the "rst experiment the method was used for the recognition of rigid objects. In this case the test images di!ered from the templates in terms of Euclidean transformations as well as due to poor thresholding and/or partial occlusions. The second experiment was designed to test the recognition performance on shapes which di!er from their corresponding templates both by Euclidean transformations and deformations. The applicability of the method for the recognition of open shapes (handwritten cursive words in an on-line setting), was studied in the "nal experiment. 5.1. Recognition of rigid objects The e!ectiveness of the proposed method was validated in an experiment with 16 objects. An image of each object, taken at an arbitrary location and orientation with respect to the camera was stored as a template. The imaging plane was assumed to be perpendicular to the optical axis of the camera. In Fig. 5, the objects used in the experiment are shown. For each object, "ve test images were captured by arbitrarily varying the object pose in terms of translation, rotation and scale. The object contours were extracted after automatic momentpreserving thresholding. The recognition results are summarized in Table 1. In Fig. 6, examples of some noisy shapes from the test set are shown (bottom row), along with their corresponding templates (top row). For the cases shown in columns (a) and (b), the translation of the camera lead the object to partially move out of the imaging region. In the "rst instance (column (a)), the recognition was not a!ected. For the instance shown in column (b), a misrecognition occurred. The shape variations shown in columns (c) and (d) were due to poor thresholding. Both the instances were identi"ed correctly. The average values of the dissimilarity measure over the test cases for each shape are presented in Tables 2 and 3. The lowest value for the morph for each test set is underlined. In Table 4, we present the results for pruning the database search based on the use of the triangle inequality and the primordial shape. The test set used in this experiment is the same as the one for which the recognition values are presented in Tables 2 and 3. In this experiment, the value of the morph D (S2, S2), beKMPNF G . tween each template shape S2 and the primordial shape G S2 was computed o%ine along with the morph values . between the templates. In the on-line recognition phase, each input was initially morphed to the primordial shape. The value of this morph D (S', S2) was then comKMPNF . pared with the value of the morphs D (S2, S2). A seKMPNF G . lection criteria (K-nearest neighbors, denoted henceforth as k-NN) was used to de"ne the subset of template shapes from which the identity of the input was established using the constraints de"ned in Eq. (10) and Eq. (14). The initial reference shape S2 for each subset of templates was 0
1690
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
Fig. 5. Examples of rigid shapes. Table 1 Recognition results for rigid objects Templates Test Correctly Misrecognized Recognition shapes recognized rate 16
80
79
1
98.75%
de"ned to be the template which was closest to the input in terms of the distance from the primordial shape. For the test set considered in Table 4, an average number of 2.50 shape comparison operations of complexity O(mn) were required for a recognition rate of 100%. In comparison, the general case where the query is compared to each of the templates requires 15 such compari-
sons. In the given experiment, the correct template for each test set was in the list of templates obtained by using the 3-NN rule. This along with the use of Eqs. (10) and (14) lead to the reduction in the number of comparisons. With increase in the size of the image database, a rigid clustering criteria like k-NN may not guarantee the inclusion of the correct template in the subset of shapes on which the search is performed. The use of #exible clustering criteria is likely to provide more robust results than by using "xed criteria or thresholds. 5.2. Recognition of deformable shapes A collection of 15 shapes was used in this experiment. Amongst these shapes were hand-drawn outlines of industrial tools and objects, sketches of natural objects,
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
1691
Fig. 6. Examples of noisy shapes used in the recognition experiment.
Table 2 Average degree of morphing values (templates 1}8) Templates
1
2
3
4
5
6
7
8
Test shapes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
257.87 955.50 705.86 512.91 612.82 997.98 579.06 560.71 454.38 637.74 705.45 919.27 761.71 728.36 815.73 771.66
748.86 123.77 458.35 561.29 272.64 788.57 432.03 392.93 517.94 622.64 233.01 265.60 473.08 347.26 355.02 210.32
598.12 413.30 218.21 656.64 414.44 651.96 513.92 446.17 556.84 658.76 325.90 331.16 439.39 444.69 602.02 324.55
672.65 705.64 691.19 64.45 341.25 1269.90 207.46 725.53 119.63 193.19 639.79 813.49 1034.40 394.56 404.27 689.60
710.90 311.31 418.22 302.67 91.00 966.67 325.58 411.54 353.85 474.10 210.34 390.15 592.99 377.66 581.37 270.57
862.61 789.61 810.12 1296.70 1038.10 249.26 1044.50 717.71 1169.20 1282.70 767.89 710.22 534.53 986.59 918.63 864.95
648.74 576.60 535.62 125.87 360.85 1079.30 90.73 624.52 155.61 199.74 478.65 641.96 838.42 323.42 278.25 530.77
510.66 487.28 457.82 605.22 448.81 804.86 536.59 146.35 523.22 558.12 472.54 543.37 495.18 412.02 380.51 396.49
and contours from medical images. A database of 60 samples sketches (four per template) was collected from a group of users. The aim of the experiment was to test the ability of the system to retrieve objects having the same visual form as the sketch irrespective of the shape variations introduced during the drawing. The users were shown each template prior to the collection of the respective test samples. However, during the actual drawing phase no references were allowed to the template. This step curtailed excessive drawing variations while ensuring that the recognition problem was not trivialized. No other constraints were imposed on the drawings. Some examples of hand-drawn sketches used as tem-
plates (in black borders), and corresponding sample sketches used as test shapes are shown in Fig. 7. The recognition results for the shapes used in this experiment are reported in Table 5. For most of the test shapes, the proposed system exhibited good tolerance to shape variations. In Fig. 8, we present an example of content-based retrieval based on user sketches. The template is an industrial object and the query image is its hand-drawn sketch. In this particular example, the query image di!ered from the template in terms of translations, rotations and deformations. In the "gure, the intermediate images of the morph show the recti"cation of rotations and deformations.
1692
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
Table 3 Average degree of morphing values (templates 9}16) Templates
9
10
11
12
13
14
15
16
Test shapes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
585.11 677.40 621.80 148.10 375.89 1137.70 205.02 636.38 39.08 129.43 573.85 815.06 908.83 331.88 425.85 642.01
686.62 745.02 663.35 215.85 475.28 1178.50 253.35 636.61 101.14 64.38 631.73 868.29 969.32 240.05 407.73 697.38
671.31 247.05 301.18 531.18 199.65 771.63 422.09 344.73 467.97 589.40 137.77 282.65 420.80 413.58 501.38 223.61
791.66 231.22 290.54 755.18 381.70 628.56 604.21 437.92 761.95 852.05 258.50 92.01 364.95 566.94 553.58 244.67
744.59 517.49 395.78 988.87 586.18 467.02 768.88 399.27 892.65 952.88 405.93 373.43 180.63 674.91 592.91 438.63
687.49 428.31 485.72 342.45 338.07 938.76 288.72 477.11 281.38 257.22 407.27 549.18 687.73 70.94 331.76 379.38
872.89 443.33 617.66 378.89 568.63 924.05 341.82 442.34 410.54 358.76 527.82 569.08 614.99 280.58 72.19 502.87
753.69 208.07 331.91 609.38 275.92 776.21 415.04 353.25 546.85 645.72 197.93 247.04 504.04 420.11 413.18 109.32
Table 4 Performance of the search pruning strategy Test shape
D (S', S2) KMPNF .
Clustering rule
1 1465.28 3-NN 2 499.55 3-NN 3 927.31 3-NN 4 781.96 3-NN 5 485.80 3-NN 6 1608.85 3-NN 7 785.70 3-NN 8 956.54 3-NN 9 800.26 3-NN 10 850.76 3-NN 11 616.07 3-NN 12 730.78 3-NN 13 1106.57 3-NN 14 680.74 3-NN 15 731.17 3-NN 16 566.52 3-NN Average number of comparisons
Template subset
Shape identity
Number of comparisons
Correct recognition
+1, 6, 13, +5, 2, 16, +3, 8, 10, +4, 7, 9, +5, 2, 16, +6, 1, 13, +4, 9, 7, +3, 8, 10, +9, 4, 7, +10, 9, 4, +11, 16, 14, +15, 12, 7, +13, 8, 3, +14, 15, 12, +15, 12, 7, +16, 11, 5,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
2 3 3 3 2 2 3 3 3 1 3 3 3 1 3 2 2.50
Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
5.3. Recognition of open shapes In this experiment, the method was used for the recognition of unconstrained cursive handwriting in a userdependent, on-line setting. Four users participated in the experiment. A vocabulary of ten randomly selected words was used (see Table 6). Each user provided one template sample and four test samples for each word. The data was collected using a WACOM UD-0608R tablet
with an inking stylus. Table 7 contains the recognition results for each of the users. Examples of correct morphing for the words adventure and guard both taken from user 4 are shown in Fig. 9. In case of the word guard, the reader may note the poor segmentation of the descender stroke for the letter g. The resultant loss of information in this case is similar to instances where parts of a shape get occluded. Correct recognition in the given example was facilitated by the
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
1693
Fig. 7. Examples of deformable shapes.
Table 5 Experimental results
Table 7 Test results for cursive words
Templates Test Correct Misclassi"cations Success shapes classi"cations rate
Users
Reference words
Test words
Correctly recognized
Recognition rate
15
1 2 3 4
10 10 10 10
40 40 40 40
40 36 37 39
100.0% 90.0% 92.5% 97.5%
60
56
4
93.33%
Table 6 List of words used in the handwriting recognition experiment adventure guard
banana landmark
bookshelf mad
cannon peach
#ywheel tooth
features detected in the rest of the word as well as the ability of the recognition method to handle deformations. It is worthwhile to contrast this case with the one in Fig. 10, where a misrecognition occurred. The test sample in this case was written at a high speed and with insu$cient pressure on the tablet. The curve describing this word, therefore, consisted of a small number of unevenly distributed points. This caused the segmentation algorithm to perform poorly in terms of capturing important feature points. Consequently, the cost of the morph to the template cannon was lower than that to the correct template (banana), which was the second lowest. The results presented in this section are conceptually similar to the more detailed experiments described in a previous paper [34]. The basic di!erences lie in the use of a di!erent dissimilarity measure, a di!erent formulation
of the shape-morphing process, and in that the preprocessing steps of rotation and slant normalization were omitted in the present case. While the proposed method is invariant to rotations, changes in the slant of letters within a word were treated as deformations during the recognition process.
6. Conclusions and future work In this paper, we have described a technique for comparing shape similarity based on quantifying the morph of one shape to another. Each shape is represented by a polygonal approximation of its contour. Shape morphing occurs by the stretching and bending of the contours. Quanti"cation of the morph is obtained by computing the incremental energy spent in deforming one shape to another as given by a physics-based model. The recognition methodology uses a pruning scheme, based on the mathematical properties of the morph, to signi"cantly
1694
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
Fig. 8. Content-based retrieval by morphing a user sketch to the correct database shape.
Fig. 9. Morphs of the words adventure and guard to their respective templates kept in the database (user 4). The input words are shown with the segmentation points superimposed.
Fig. 10. Morph of the word banana to the template for the word cannon (user 3) produces erroneously the best match. The input word is shown with the segmentation points superimposed.
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
1695
reduce the number of on-line shape comparisons during the recognition phase. The proposed approach has been applied for the recognition of both rigid and deformable shapes in di!erent application domains. The salient properties and advantages of the proposed method include:
or by using multiple templates for each shape class. Extending the present framework to incorporate image attributes like color and texture as well as recognition of 3D objects by morphing 3D shapes are other possible directions of future work.
1. Invariance of the method to translations, rotations and scale changes. 2. The method has the properties of a metric. 3. Its applicability to convex as well as non-convex shapes. 4. It can be used for the recognition of open shapes like letters and cursive words. 5. The method has low computational complexity and it is intuitive.
7. Summary
Due to its ability to handle deformations, the proposed recognition paradigm can be used in retrieval-by-content systems or in applications like pen-based computing where on-line recognition of deformable shapes is required. Another signi"cant attribute of the proposed method is that the intermediate images in the morph describe the shape and pose transformation needed to align the input and the target images. The transformation in the morph plane is relative to a static virtual camera. They can however be interpreted in terms of a mobile real camera. In such a case the pose transformation described by the virtual images can be used to control the camera motion so that the real views of the object being servoed correspond to the virtual views generated by the morph. Based on the above idea, we have obtained promising results in controlling robotic interactions like positioning and grasping by using image morphing [25,26]. An important constraint on the performance of the system is its dependence on the segmentation algorithm. In particular, we have observed that segmentation algorithms that lead to inconsistent point placement may lead to degradation of the recognition performance even though the error during shape segmentation is small. We have looked at two segmentation strategies that have reasonable performance in terms of consistent point placement. Better segmentation strategies, in the above sense, will improve the recognition performance. Since the proposed approach is based on matching contour descriptions, it performs poorly for shapes where the primary di!erence in appearance is due to internal topological features or other attributes like texture or color. Furthermore, since contour description is inherently sensitive to illumination and/or shadowing, the performance of the proposed technique will degrade in the presence of shadows and poor control of illumination. In the current experimentations, a single arbitrarily selected image of each object was used as a template. Improvements in the recognition performance can be expected either by optimizing the choice of the template
A novel method based on shape morphing is proposed for 2D shape recognition. In this framework, the shape of objects is described by using their contours. Shape recognition involves a morph between the contours of the objects being compared. The morph is quanti"ed by using a physics-based formulation. This quanti"cation serves as a dissimilarity measure to "nd the reference shape most similar to the input. The proposed dissimilarity measure is shown to have the properties of a metric as well as invariance to Euclidean transformations. The recognition paradigm is applicable to both convex and non-convex shapes. Moreover, the applicability of the method is not constrained to closed shapes. Based on the metric properties of the dissimilarity method, a search strategy is described that obviates an exhaustive search of the template database during recognition experiments. Experimental results on the recognition of various types of shapes are presented. Acknowledgements Ioannis Pavlidis had participated in the initial phase of this research. The proof of the metric properties has bene"ted from numerous discussions with Soumyendu Raha. The critique of Richard Voyles was instrumental in developing the segmentation algorithm used for rigid shapes. The presentation of this paper has also improved due to the comments provided by the anonymous reviewer. The authors wish to express their gratitude to each of the aforementioned. The research of Rahul Singh on this project was supported by the National Science Foundation through grants CIRI-9410003 and CIRI9502245. Appendix A. Proof of the metric property To prove the metric properties of the dissimilarity measure we consider the stretching energy and the bending energy separately. We have from the de"nition of the stretching energy (see Eq. (2)), E ""= ""k Q Q Q "(¸ !¸ )!(¸ !¸ )" 2 ' ; (1!c ) min(¸ ,2, ¸ , ¸ )#c max(¸ ,2, ¸ , ¸ ) Q ' 2 Q ' 2 (A.1)
1696
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
By assuming a hierarchy of shape complexity, as discussed in this paper and the related references, the shape recognition problem can be solved without an exhaustive search of the shape database. The basic idea lies in considering shapes in the database to be derived, by stretching and bending, from a primordial shape, like a point or a line. Given a query shape, the measure of its dissimilarity from the primordial shape can be used to identify a subset of templates similar to the input. Since the original length ¸ and the original angle are de"ned to be the length and angle prior to any deformations they essentially are parameters of the primordial shape. Conceptually, the primordial shape should lie at the origin of the shape hierarchy. Its parameters ¸ and
are therefore de"ned as ¸ ,min(¸ , ¸ ,2, ¸ ) (A.2) - I
,min( , ,2, ) (A.3) - I where ¸ and refer to the length and the angle after the G G i-th deformation respectively. We base the proof of the metric property on the above construction. Consider the case of the stretching energy; Let the original length be denoted by ¸ . The lengths after deformation (stretching) at the stages i, j, and k are correspondingly denoted by ¸ , ¸ , and ¸ respectively. G H I The stretching energy spent in compressing or expanding a wire of length ¸ to the length ¸ is represented as G H E (i, j). Q From the formula of the stretching energy (see Eq. (2), it may be veri"ed trivially that E (i, j)""= (i, j)"*0 Q Q and
(A.4)
E (i, j)"E ( j, i). Q Q
(A.5)
To prove the triangle inequality, we consider the following three types of length changes: monotonic decrease, monotonic increase, and non-monotonic length change. 1. Monotonic decrease: Let the lengths of the virtual wire at stages i, j, and k be ¸ , ¸ , and ¸ respectively. G H I Furthermore, ¸ *¸ *¸ (monotone decrease in length). G H I To prove the triangle inequality we need to show that E (i, j)#E ( j, k)*E (i, k) (A.6) Q Q Q Expanding each of the above terms by using the formula for stretching energy, the inequality to be proved becomes "(¸ !¸ )!(¸ !¸ )" G k H Q (1!c )¸ #c ¸ Q Q G "(¸ !¸ )!(¸ !¸ )" I H #k Q (1!c )¸ #c ¸ Q Q G "(¸ !¸ )!(¸ !¸ )" I G - . *k Q (1!c )¸ #c ¸ Q Q G
The above inequality can be simpli"ed to the following form: (¸ !¸ )!(¸ !¸ ) (¸ !¸ )!(¸ !¸ ) G H - # H I (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q G Q Q G (¸ !¸ )!(¸ !¸ ) I - . * G (1!c )¸ #c ¸ Q Q G
(A.7)
The right-hand side of the above inequality may be rewritten as: (¸ !¸ )!(¸ !¸ ) (¸ !¸ )!(¸ !¸ ) G I - , G H (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q G Q Q G (¸ !¸ )!(¸ !¸ ) I - . # H (1!c )¸ #c ¸ Q Q G
(A.8)
By substituting the above in the RHS of (A.7), we obtain an equality and thus the proof. 2. Monotonic increase: If the lengths of the virtual wire at stages i, j, and k are ¸ , ¸ , and ¸ respectively. Then in G H I this case, ¸ )¸ )¸ . To prove the triangle inequality G H I we need to show that E (i, j)#E ( j, k))E (i, k). Q Q Q
(A.9)
Expanding the above by using the formula for stretching energy and simplifying for the absolute values, we get k Q
(¸ !¸ )!(¸ !¸ ) (¸ !¸ )!(¸ !¸ ) H G - #k I H Q (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q I Q Q H *k Q
(¸ !¸ )!(¸ !¸ ) I G - . (1!c )¸ #c ¸ Q Q I
(A.10)
Rewriting the right-hand side of the above inequality, we get k Q
(¸ !¸ )!(¸ !¸ ) (¸ !¸ )!(¸ !¸ ) I G - ,k I H Q (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q I Q Q I #k Q
(¸ !¸ )!(¸ !¸ ) H G - . (1!c )¸ #c ¸ Q Q I
(A.11)
Substituting the above on the right-hand side of inequality (A.10), we have k Q
(¸ !¸ )!(¸ !¸ ) (¸ !¸ )!(¸ !¸ ) H G - #k I H Q (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q H Q Q I *k Q
(¸ !¸ )!(¸ !¸ ) I H (1!c )¸ #c ¸ Q Q I
#k Q
(¸ !¸ )!(¸ !¸ ) H G - . (1!c )¸ #c ¸ Q Q I
(A.12)
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
1697
The above inequality is valid iw
The above may be further simpli"ed as
(¸ !¸ )!(¸ !¸ ) (¸ !¸ )!(¸ !¸ ) H G - * H G - . (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q H Q Q I (A.13)
[(¸ !¸ )!(¸ !¸ )] H G (1!c )¸ #c ¸ !(1!c )¸ !c ¸ Q Q G Q Q H ; ((1!c )¸ #c ¸ )((1!c )¸ #c ¸ ) Q Q H Q Q G *[(¸ !¸ )!(¸ !¸ )] H I (1!c )¸ #c ¸ !(1!c )¸ !c ¸ Q Q H Q Q G . ; ((1!c )¸ #c ¸ )((1!c )¸ #c ¸ ) Q Q G Q Q H from where it follows that
Clearly the above is true because ¸ (¸ . 䊐 H I 3. Non-monotonic length changes: Consider the case where the lengths of the virtual wire at stages i, j, and k are ¸ , ¸ , and ¸ respectively. Furthermore, let G H I ¸ )¸ )¸ . Rewriting the triangle inequality in terms I G H of the stretching energy, we obtain the following inequality which needs to be proved: k Q
"(¸ !¸ )!(¸ !¸ )" "(¸ !¸ )!(¸ !¸ )" H G - #k I H Q (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q H Q Q H *k Q
"(¸ !¸ )!(¸ !¸ )" I G - . (1!c )¸ #c ¸ Q Q G
(A.14)
For the term on the RHS of the above inequality, we have
"(¸ !¸ )!(¸ !¸ )" I H (1!c )¸ #c ¸ Q Q G
#k Q
"(¸ !¸ )!(¸ !¸ )" H G - . (1!c )¸ #c ¸ Q Q G
(A.15)
"(¸ !¸ )!(¸ !¸ )" "(¸ !¸ )!(¸ !¸ )" H G - #k I H k Q Q (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q H Q Q H "(¸ !¸ )!(¸ !¸ )" I H *k Q (1!c )¸ #c ¸ Q Q G "(¸ !¸ )!(¸ !¸ )" H G - . (1!c )¸ #c ¸ Q Q G
(A.16)
The above inequality, after simpli"cation for the absolute values and regrouping of terms reduces to (¸ !¸ )!(¸ !¸ ) (¸ !¸ )!(¸ !¸ ) H G - ! H G (1!c )¸ #c ¸ (1!c )¸ #c ¸ Q Q H Q Q G
(A.18)
c (¸ !¸ )[(¸ !¸ )!(¸ !¸ )] Q G H H G ((1!c )¸ #c ¸ ) ((1!c )¸ #c ¸ ) Q Q H Q Q G c (¸ !¸ )[(¸ !¸ )!(¸ !¸ )] G H I - . * Q H ((1!c )¸ #c ¸ )((1!c )¸ #c ¸ ) Q Q H Q Q G For the above to hold true, we must have (¸ !¸ )!(¸ !¸ )*(¸ !¸ )!(¸ !¸ ) H G I H (A.19)
(A.20)
To prove the above, we note that from the initial conditions of the non-monotonic length changes we have ¸ *¸ , therefore G I (¸ !¸ ))(¸ !¸ ). (A.21) I G Substituting (¸ !¸ ) for (¸ !¸ ) on the RHS of G I inequality (A.20), we get 2(¸ !¸ ) - . (¸ !¸ )* G H 2
(A.22)
Clearly, the above inequality is true because ¸ )¸ , G H hence the proof. For other cases of non-monotonic changes, the triangle inequality may be proved likewise. The proof of the metric property for the bending energy is based on similar ideas. In the following, we denote the angle before any deformations by and the angles after bending at the i, j, and k instances by , , and G H
respectively. The bending energy spent in changing the I angle to is denoted by E ( , ). G H @ G H From the formulation of the bending energy in Eq. (3), it is straightforward to note that 1. E ( , )*0. @ G H 2. E ( , )"E ( , ). @ G H @ H G
(¸ !¸ )!(¸ !¸ ) I * H (1!c )¸ #c ¸ Q Q G "(¸ !¸ )!(¸ !¸ )" I - . ! H (1!c )¸ #c ¸ Q Q H
(¸ !¸ )#(¸ !¸ ) G - . (¸ !¸ )* I H 2
Substituting the above in the RHS of inequality (A.14), we obtain the following inequality, the validity of which we need to prove:
#k Q
or, equivalently,
"(¸ !¸ )!(¸ !¸ )" I G k Q (1!c )¸ #c ¸ Q Q G )k Q
(A.17)
For the proof of the triangle inequality, we proceed on lines similar to the one followed for the stretching energy by considering angle changes that are monotonic and
1698
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699
non-monotonic. For monotonic changes, we present the proofs for both decreasing and increasing angular changes. For non-monotonic angular changes, we present the proofs for two cases. The other cases may be proved analogously. 1. Monotonically decreasing angle changes: Let the angles at the instances i, j, and k be , , and , G H I respectively. Further let * * . For the triangle G H I inequality we need to prove that E ( , )#E ( , )*E ( , ) @ G H @ H I @ G I Nk "( ! )!( ! )" @ H G #k "( ! )!( ! ) @ I H *k "( ! )!( ! )". @ I G Simplifying, we have
(A.23)
(A.24)
2. Monotonically increasing angle changes: For this case we consider the angle changes to have the following relationship: ) ) . We thus have, for the triangle G H I inequality, k "( ! )!( ! )" @ H G #k "( ! )!( ! )" @ I H *k "( ! )!( ! )". (A.26) @ I G Taking into consideration the relationships between the angles and accounting for the absolute values, we get ( ! )!( ! )#( ! )!( ! ) H G I H *( ! )!( ! ). (A.27) I G On simpli"cation, we get an equality. 䊐 3. Non-monotonically angle changes
(A.28)
( ! )!( ! )#( ! )!( ! ) H G H I *( ! )!( ! ). (A.29) I G This inequality then simpli"es to the inequality ( ! )*( ! ). H I Since ) , the above inequality holds. I H
( ! )!( ! )#( ! )!( ! ) H G H I *( ! )!( ! ). G I -
(A.31)
Simplifying, we have ( ! )*( ! ), H G which is true, since * . H G
(A.32) 䊐
References
( ! )!( ! )#( ! )!( ! ) G H H I *( ! )!( ! ). (A.25) G I The above gives us an equality, thus proving the validity of the triangle inequality.
䊊 Let ) ) , The triangle inequality G I H E ( , )#E ( , )*E ( , ) @ G H @ H I @ G I may then be expanded as
䊊 Consider the case where ) ) . The triangle I G H inequality then takes the following form:
(A.30)
[1] R.L. Anderson, Real-time gray-scale video processing using a moment generating chip, IEEE J. Robotics Automat. 1 (1985) 70}85. [2] C.C. Lin, R. Chellapa, Classi"cation of partial 2D shapes using Fourier descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 686}690. [3] R.L. Kashyap, R. Chellapa, Stochastic models for closed boundary analysis: representation and reconstruction, IEEE Trans. Inform. Theory 27 (5) (1981) 627}637. [4] A. Pentland, R.W. Picard, S. Sclaro!, Photobook: content-based manipulation of image databases, Int. J. Comput. Vision 18 (3) (1996) 233}254. [5] M.J. Swain, D.H. Ballard, Color indexing, Int. J. Comput. Vision 7 (1) (1991) 11}32. [6] T.F. Syeda-Mahmood, Data and model-driven selection using color regions, Int. J. Comput. Vision 21 (1/2) (1997) 9}36. [7] R.C. Bolles, R.A. Cain, Recognizing and locating partially visible objects: the local-feature focus method, Int. J. Robotics Res. 1 (1982) 57}82. [8] S. Sclaro!, A.P. Pentland, Modal matching for correspondence and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 17 (6) (1995) 545}561. [9] H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from appearance, Int. J. Comput. Vision 14 (1995) 5}24. [10] M. Brady, H. Asada, Smoothed local symmetries and their implementation, Int. J. Robotics Res. 3 (1984) 36}61. [11] T. Phillips, A shrinking technique for complex object decomposition, Pattern Recognition Lett. 3 (1985) 271}277. [12] J. Chen, J.A. Ventura, Optimization models for shape matching of nonconvex polygons, Pattern Recognition 28 (6) (1995) 863}877. [13] L. Huang, M.J. Wang, E$cient shape matching through model-based shape recognition, Pattern Recognition 29 (2) (1996) 207}215. [14] I. Tchoukanov, R. Safaee-Rad, B. Benhabib, K.C. Smith, A new boundary-based shape recognition technique, in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1992, pp. 1030}1037. [15] P. Cox, H. Maitre, M. Minoux, C. Ribeiro, Optimal matching of convex polygons, Pattern Recognition Lett. 9 (1989) 327}334. [16] P.J. van Otterloo, A Contour-Oriented Approach to Shape Analysis, Prentice-Hall, Hemel Hampstead, 1991.
R. Singh, N.P. Papanikolopoulos / Pattern Recognition 33 (2000) 1683}1699 [17] E.M. Arkin, L.P. Chew, D.P. Huttenlocher, K. Kedem, J.S.B. Mitchell, An e$ciently computable metric for comparing polygonal shapes, IEEE Trans. Pattern Anal. Mach. Intell. 13 (3) (1991) 209}216. [18] W. Rucklidge, E$cient Visual Recognition Using the Hausdor! Distancep, Springer, Berlin, 1996. [19] R. Azencott, F. Coldefy, L. Younes, A distance for elastic matching in object recognition, in: Proceedings of the 13th International Conference on Pattern Recognition, Vol. 1, 1996, pp. 687}691. [20] A.D. Bimbo, P. Pala, Visual image retrieval by elastic matching of user sketches, IEEE Trans. Pattern Anal. Mach. Intell. 19 (2) (1997) 121}132. [21] S.E. Sclaro!, Modal matching: a method for describing, comparing, and manipulating digital signals, Ph.D. Thesis, School of Architecture and Planning, Massachusetts Institute of Technology, 1995. [22] F.L. Bookstein, Principal warps: thin-plate splines and the decomposition of deformations, IEEE Trans. Pattern Anal. Mach. Intell. 11 (6) (1989) 567}585. [23] A. Yuille, P. Hallinan, Deformable templates, in: A. Blake, A. Yuille (Eds.), Active Vision, MIT Press, Cambridge, MA, 1992, pp. 21}38. [24] K. Hirata, T. Kato, Query by visual example, contentbased image retrieval, in: A. Pirotte, C. Delobel, G. Gottlob (Eds.), Advances in Database Technology-EDBT92, Springer, Berlin, 1992. [25] R. Singh, R.M. Voyles, D. Littau, N.P. Papanikolopoulos, Grasping real objects using virtual images, Proceedings of the IEEE Conference on Decision and Control, 1998. [26] R. Singh, R.M. Voyles, D. Littau, N.P. Papanikolopoulos, Pose alignment of an eye-in-hand system using image morphing, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1998. [27] H. Freeman, L. Davis, A corner-"nding algorithm for chain-coded curves, IEEE Trans. Comput. 26 (1977) 297}303.
1699
[28] T. Pavlidis, S.T. Horowitz, Segmentation of plane curves, IEEE Trans. Comput. 23 (1974) 860}870. [29] B.K. Ray, K.S. Ray, Determination of optimal polygon from digital curve using L norm, Pattern Recognition 26 (4) (1993) 505}509. [30] J.J. Brault, R. Plamondon, Segmenting handwritten signatures at their perceptually important points, IEEE Trans. Pattern Anal. Mach. Intell. 15 (9) (1993) 953}957. [31] M.A. Fischler, R.C. Bolles, Perceptual organization and curve partitioning, IEEE Trans. Pattern Anal. Mach. Intell. 8 (1) (1986) 100}105. [32] B.K. Ray, K.S. Ray, An algorithm for detection of dominant points and polygonal approximation of digitized curves, Pattern Recognition Lett. 13 (12) (1992) 849}856. [33] P. Zhu, P.M. Chirlian, On critical point detection of digital shapes, IEEE Trans. Pattern Anal. Mach. Intell. 17 (8) (1995) 737}748. [34] I. Pavlidis, R. Singh, N.P. Papanikolopoulos, On-line handwriting recognition using physics-based shape metamorphosis, Pattern Recognition 31 (11) (1998) 1589}1600. [35] I. Pavlidis, N.P. Papanikolopoulos, A curve segmentation algorithm that automates deformable-model based target tracking, Technical Report TR 96-041, University of Minnesota, 1996. [36] T.W. Sederberg, E. Greenwood, A physically based approach to 2D shape blending, Comput. Graphics 26 (2) (1992) 25}34. [37] J. Barros, J. French, W. Martin, P. Kelly, M. Cannon, Using the triangle inequality to reduce the number of comparisons required for similarity-based retrieval, in: SPIE, Storage and Retrieval for Still Images and Video Databases, Vol. 2670, 1996, pp. 392}403. [38] W.A. Burkhard, R.M. Keller, Some approaches to bestmatch "le searching, Commun. ACM 16 (4) (1973) 230}236. [39] M. Shapiro, The choice of reference points in best-match "le searching, Commun. ACM 20 (5) (1977) 339}343.
About the Author*RAHUL SINGH received his Master of Science in Engineering Degree in Computer Science (with excellence) from the Moscow Power Engineering Institute in 1993, the M.S. in Computer Science from the University of Minnesota in 1997 and the Ph.D. in Computer Science from the University of Minnesota in 1999. Currently he is a scientist in Exelixis Inc., in San-Francisco where he is working on the area of molecular shape recognition and its applications in computational prediction of pharmacologically relevant molecular properties. In addition to the above areas, his research interests include computer vision, image morphing, document image analysis, and applications of virtual reality in vision-based robotics.
About the Author*NIKOLAOS P. PAPANIKOLOPOULOS (S'88}M'93) was born in Piraeus, Greece, in 1964. He received the Diploma Degree in Electrical and Computer Engineering from the National Technical University of Athens, Athens, Greece, in 1987, the M.S.E.E. in Electrical Engineering from Carnegie Mellon University (CMU), Pittsburgh, PA, in 1988, and the Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University, Pittsburgh, PA, in 1992. Currently, he is an Associate Professor in the Department of Computer Science at the University of Minnesota. His research interests include Computer Vision, Pattern Recognition, and Robotics. He has authored or coauthored more than 90 journal and conference papers in the above areas. He was "nalist for the Anton Philips Award for Best Student Paper in the 1991 IEEE Robotics and Automation Conference. Furthermore, he was recipient of the Kritski fellowship in 1986 and 1987. He is a McKnight Land-Grant Professor at the University of Minnesota for the period 1995}1997 and has received the NSF Research Initiation and Early Career Development Awards.
Pattern Recognition 33 (2000) 1701}1712
Fast face detection via morphology-based pre-processing夽 Chin-Chuan Han *, Hong-Yuan Mark Liao , Gwo-Jong Yu, Liang-Hua Chen Institute of Information Science, Academia Sinica, Nankang, Taipei, Taiwan Institute of Computer Science and Information Engineering, National Central University, Chung-Li, Taiwan Department of Computer Science and Information Engineering, Fu Jen University, Taiwan Received 5 November 1998
Abstract An e$cient face detection algorithm which can detect multiple faces oriented in any directions in a cluttered environment is proposed. In this paper, a morphology-based technique is "rst devised to perform eye-analogue segmentation. Next, the previously located eye-analogue segments are used as guides to search for potential face regions. Then, each of these potential face images is normalized to a standard size and fed into a trained backpropagation neural network for identi"cation. In this detection system, the morphology-based eye-analogue segmentation process is able to reduce the background part of a cluttered image by up to 95%. This process signi"cantly speeds up the subsequent face detection procedure because only 5}10% of the regions of the original image remain for further processing. Experiments demonstrate that an approximately 94% success rate is reached, and that the relative false detection rate is very low. 2000 Published by Elsevier Science Ltd on behalf of Pattern Recognition Society. Keywords: Face detection; Backpropagation neural network; Morphological opening/closing operation
1. Introduction Human face detection and recognition have long been a di$cult research topic. In the last two decades, researchers have devoted much e!ort to these two problems and have obtained some satisfactory results. Some of these previous e!orts were focused on face recognition [1}3]. Turk and Pentland [3] have successfully employed the eigenface approach to recognize a human face. However, an accurate and e$cient method for human face detection is still lacking. Govindaraju et al. [4,5] presented a system which could locate human faces in photographs of newspapers, but the approximate size
夽 This work was supported by the National Science Council of Taiwan under grant no. NSC86-2213-E-001-023. * Corresponding author. Present address: Applied Research Laboratory, Telecommunication Laboratories, 12, Lane 551, Min-Tsu Road Sec. 5, Yang-Mei, Taoyuan, Taiwan 326, ROC. Tel.: #886-3-4244186; fax: #886-3-4244167. E-mail address:
[email protected] (C.-C. Han).
and expected number of faces must be known in advance. Sirohey [6] utilized an elliptical structure to segment human heads from cluttered images. Yang and Huang [7] utilized a three-level hierarchical knowledge-based method to locate human faces in complex backgrounds. Sung and Poggio [8}10] applied two distance metrics to measure the distance between the input image and the cluster center. Twelve clusters including six face and six non-face clusters were trained using a modi"ed k-mean clustering technique. A feature vector consisting of 12 values was input into a multi-layer perceptron network for the veri"cation task. Pentland et al. [11}16] applied principal component analysis (PCA) to describe face patterns with lower dimensional features. They designed a distance function called distance from feature space (d+s) as a metric to evaluate the di!erence between the input face image and the reconstructed face image. The system can be considered a special case of Sung and Poggio's system. Face detection can also be achieved by detecting geometrical relationships among facial components such as the nose, eyes, and mouth. Juell and Marsh [17] proposed a hierarchical neural network to detect human
0031-3203/00/$20.00 2000 Published by Elsevier Science Ltd on behalf of Pattern Recognition Society. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 1 - 7
1702
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
faces in gray scale images. An edge enhancing pre-processor and four backpropagation neural networks arranged in a hierarchical structure were devised to "nd multiple faces in a scene. Leung et al. [18}20] combined a set of local feature detectors via a statistical model to "nd facial components for face locating. Their approach was invariant with respect to translation, rotation, and scale. In addition, they could also handle partial occlusion of faces. Instead of the gray scale images, Sobottka and Pitas [21], and Chen et al. [22}25] located the poses of human faces and facial features from color images. In Ref. [21], the oval shape of a face could be approximated by an ellipse in Hue-Saturation-Value (HSV) color space. Chen et al. [22}25] proposed a skin color distribution function on perceptually uniform color space to detect a face-like region. The skin color regions in color images were modeled as several 2-D patterns and veri"ed with a built face model by using a fuzzy pattern matching approach. Most of the above-mentioned systems limit themselves to dealing with human faces in frontal view. That is, the orientation problem, which is a potential trouble in this type of system, was not seriously considered. Furthermore, some previous approaches slid a small window (20;20) over all the portions of an image at various scales. This brute force search is, no doubt, a timeconsuming procedure. Jeng et al. [26] used a geometrical face model to solve the above-mentioned problem. In their method, the geometrical face template was used to precisely locate faces in a cluttered image. The average execution time used in locating a face using a Sparc-20 machine was less than 5 s. However, the major drawbacks of their approach are: (1) the size of a face must be larger than 80;80 and (2) the accuracy of using a geometrical face model may be signi"cantly in#uenced by changes of the lighting conditions. In this paper, an e$cient face detection system is proposed. The proposed system consists of three main steps. In the "rst step, a morphology-based technique is devised to perform eye-analogue segmentation. Morphological operations are applied to locate eye-analogue pixels in the original image. Then, a labeling process is executed to generate eye-analogue segments. In the second step, the previously located eye-analogue segments are used as guides to search for potential face regions. A face region is temporarily marked if a possible geometrical combination of eyes, nose, eyebrows, and mouth exists. Of course, this step is a relatively loose process because we do not intend to miss any candidate faces. Therefore, this step may result in numerous face candidates, including both true faces and false ones. The last step of the proposed system is to perform face veri"cation. Face veri"cation here includes identifying faces and locating their corresponding poses. In this step, every face candidate obtained from the previous step is normalized to a 20;20 image. Then, each of these nor-
malized potential face images is fed into a trained backpropagation neural network for identi"cation. Among the identi"ed true faces, it is possible that a face is simultaneously covered by multiple windows which are partially overlapped and oriented in various directions. Under the circumstances, the correct pose of the face is located by optimizing a cost function. The proposed face detection technique may locate multiple faces oriented in any direction. Furthermore, the morphology-based eye-analogue segmentation process is able to reduce the background part of a cluttered image by up to 95%. This process signi"cantly speeds up the subsequent face detection procedure because only 5}10% of the regions of the original image remain for further processing. In the experiments, we used 122 cluttered images which contained a total of 130 faces to test the e!ectiveness of the proposed method. 122 faces (approximately 94%) were successfully located, and the false detection rate was very low. On average, our algorithm required 20 s to "nish the detection task on a 512;340 test image. Therefore, the proposed approach is indeed an e$cient technique for face detection. The rest of this paper is organized as follows. The extraction of eye-analogue regions using morphological operations is described in Section 2. In Section 3, eyepairs and normalized images are generated based on some geometric rules. A neural network-based veri"er and a d+s function are presented in Section 4 to verify the face region. Some experimental results to show the validity of our detection system are given in Section 5. Finally, concluding remarks are given in Section 6.
2. Eye-analogue segmentation As mentioned in Refs. [26,27], the eyebrows, eyes, nostrils and mouth always look darker than the rest of the face. Among these facial features, the degree of darkness of the eyebrows is dependent on the density or color, and that of the nostrils depends on how they are occluded. As to the shape of a mouth, it frequently varies in appearance and size due to facial expressions. In comparison with the above-mentioned unstable facial features, the eyes can be considered a salient and relatively stable feature of faces. Therefore, in this section we propose a morphology-based method to locate the eyeanalogue segments as the "rst step of our system. The eye-analogue pixels in a cluttered image are segmented "rst. Then, the small segments of the eye-analogue regions are grouped together using conditional morphological dilation operations and labeled using a traditional labeling process. In 1993, Chow and Li [27] employed a morphological opening residue operation to extract the intensity valleys as potential eye-analogue pixels via a 5;5 circle structuring element. In our approach, we apply morphological
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
closing and clipped di!erent operations to "nd the candidate eye-analogue pixels. Let X be the original image and XM be the image with scale o. A horizontal structuring element S with size 1;7 and a vertical F structuring element S with size 7;1 are, respectively, T operated on X and X. It is known that [27] eyeanalogue pixels are located in intensity valleys whose sizes are smaller than a preset value. Therefore, by mixing closing (z), clipped di!erent ( ), thresholding (T), and OR (R) into di!erent operations, the eye-analogue pixels in an image can be located. The set of operations which is able to identify eye-analogue pixels is expressed as follows: E "¹ (X E "¹ (X E "¹ (X E "¹ (X
(X z S )), F (X z S )), F (X z S )), T (X z S )), T
E"E RERE RE,
(1)
where the superscript, 2, of E and E is used to enlarge them to twice the original size. ¹ , ¹ , ¹ , and ¹ are four threshold functions whose values are the average values of the images E , E , E , and E , respectively. As mentioned in the previous paragraph, the eyeanalogue pixels are located in the intensity valleys. However, when the scene is complex, some pixels of this kind may not be eye-analogue pixels. For instance, text characters on a notice board, frames of windows, edges among a set of text books, etc., are frequently segmented as eye-analogue pixels via the operations in Eq. (1). Basically, these pixels can be considered as noise in facial images. Therefore, we apply the conditional morphological dilation operation at this stage to remove these unwanted pixels from the background. Then, a conventional labeling process is performed to locate the eye-analogue segments. The eye-analogue detection algorithm is described in detail below. Eye-analogue detection algorithm: Step 1: Perform a labeling process on image E, and compute a set of geometrical data from each segment including the lengths of the major and minor axes, the orientation, the center point, and the minimal bounding rectangle. Step 2: If the length of the major axis of segment i is larger than 0.6N (where N is the width of the smallest face regions which we can detect), terminate the conditional dilation operation for segment i, and eliminate segment i from image E. Otherwise, go to the next step. Step 3: Perform a conditional dilation operation on every segment i using the structuring element
1703
SE"+1 "(x, y)3Z,, where V W 1. if segment i is a horizontal segment, i.e., the orientation of segment i is located at [!p/8 to p/8], the structuring element SE is assigned as +1 , \ 1 ,1 ,; 2. if segment i is a left slant segment, i.e., the orientation of segment i is located at (!p/8 to !3p/8], choose the element +1 ,1 ,1 ,1 ,1 , as the \ \ \ structuring element SE; 3. if segment i is a right slant segment, i.e., the orientation of segment i is located at (p/8 to 3p/8], the structuring element SE is de"ned as +1 ,1 ,1 ,1 , \\ \ 1 ,; 4. if segment i is a vertical segment, i.e., the orientation of segment i is located at (!3p/8 to !4p/8] or (3p/8 to 4p/8], choose the element +1 ,1 ,1 , as the \ structuring element SE. Step 4: Repeat steps 1}3 for N/5 times. Here, N/5 is an approximately estimated value of the nearest distance between two eyes in the smallest detected face regions. An example demonstrating the eye-analogue segmentation process is shown in Figs. 1 and 2. Fig. 1(a) is a gray scale image. The pro"le signals of the dashed line in Fig. 1(a) are shown in Fig. 1(b). After the erosion, dilation, and closing operations, the resultant signals are displayed in Figs. 1(c), (d), and (e), respectively. Finally, the signals obtained by executing the clipped di!erent operation between the original signal (Fig. 1(a)) and the closing signal (Fig. 1(e)) are depicted in Fig. 1(f ). In sum, after performing the operations shown in Eq. (1), the potential eye-analogue regions of Fig. 1(a) are shown in Fig. 2(a). After performing the conditional dilation and a labeling process on the potential eye-analogue segments in Fig. 2(a), the located eye-analogue segments with bounding rectangles are those shown in Fig. 2(b). One thing to be noted is that the text characters in the notice board (Fig. 2(a)) merged together to form fewer segments. The main advantage of the two above-mentioned processes is that those non-eye-analogue segments whose sizes are small are combined. These processes reduce the number of potential eye-analogue segments.
3. Finding potential eye-pairs from eye-analogue segments As stated in Section 2, the eyes are considered the primary features in a face image. After performing the segmentation process, each eye-analogue segment can be considered a candidate of one of the eyes. In this section, we propose four matching rules to guide the merging of potential eye-analogue segments into pairs. Then, based on these potential eye pairs, the potential poses of the face regions are determined.
1704
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
Fig. 1. The morphological operation process. (a) The original image, (b) the pro"le signals of the dash line in (a), (c) the signals after performing the erosion operation, (d) the signals after performing the dilation operation, (e) the signals after performing the closing operation, and (f) the signals after performing the closing and clipped di!erent operations.
Suppose that there are M eye-analogue segments in image E; at most M potential combinations of face regions are generated via the matched eye pairs. In order to reduce the execution time, the geometrical constraint on a real eye pair is introduced to screen out some impossible pairs. In Ref. [17], Juell and Marsh used a 19;11 window to "nd the location of the eyes. They assumed that the ratio between the width and the height of an eye is roughly about 2. In what follows, a set of rules
is proposed to merge those previously desired eye-analogue segments into potential eye pairs. Matching rules (a) The length ratio between the major and the minor axes of segment i must be smaller than 10. Here, the threshold value of the length ratio is set to 10 and should be larger than 2 to tolerate segmentation errors.
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
1705
Fig. 2. The eye-analogue segmentation results. (a) The eye-analogue pixels and segments after the closing and clipped di!erent operations, (b) the eye-analogue segments after the conditional dilation and labeling processes.
(b) The distance between the center points of two eyeanalogue segments must be larger than 0.6N. Here, the value 0.6 is a rough estimate of the ratio between the distance between the two pupils of eyes and the width of a face, and N is the shortest width of a face. This value can be derived from the training samples. (c) Each eye must be located by extending a small range from the other eye as shown in Fig. 3(a). Since two eyes can be located using their base line, they will be oriented co-linearly even if the face is rotated. (d) The area of segment i should be larger than 0.01N 2 pixels. Generally, the area of an eye pair in comparison with the whole face region is about 0.1. Here, the threshold value 0.01 is smaller than 0.1. for the toleration of segmentation error. Once the potential eye pairs are determined, their corresponding face regions can be easily extracted and each of these faces will be normalized into the 20;20 standard size. As shown in Fig. 3(b), (x , y ) and (x , y ) G G H H are two center points of segment i and j, respectively. (x , y ), (x , y ), (x , y ), and (x , y ) are four corner points of a normalized face region. Let x #x "A, G H x !x "B, and y !y "C. The coordinates of the four G H G H corner points can be calculated as follows: 1 c c x " A! B# C, 2 c c 1 c c y " A! C! B, 2 c c 1 c c x " A# B# C, 2 c c 1 c c y " A# C! B, 2 c c
1 c c x " A! B# C, 2 c c 1 c c y " A! C! B, 2 c c 1 c c x " A# B# C, 2 c c c c 1 (2) y " A# C! B, 2 c c where c is the average distance between the two pupils of the eyes, and c is one half of the width of a face region. Furthermore, c and c are, respectively, the distances from the base line of the two eyes to the top and bottom boundaries of a face region (see Fig. 3(b).) From the training samples used in the experiments, c , c , c , and c are, respectively, 12.5, 10, 4, and 16 in a normalized 20;20 face image. Based on these geometrical relations, a face region can be easily rotated and normalized. In Fig. 3(c), the potential eye pairs which satisfy the matching rules are linked via solid line segments. According to Eq. (2), the potential face regions which cover the potential eye pairs in Fig. 3(c) have been extracted and normalized to the 20;20 standard size. These normalized potential face regions are shown in Fig. 3(d).
4. Face veri5cation In the previous section, we have selected a set of potential face regions in an image. These potential face regions are allowed to have di!erent sizes and orientations. However, after the normalization process, all potential faces are normalized to a "xed size, i.e., 20;20, and rotated into frontal position. In this section, we
1706
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
Fig. 3. The extraction of face region. (a) The matching rule (c), (b) the geometrical relationship of face region, (c) the potential eye pairs of Fig. 1(c), (d) the potential face regions.
propose a coarse-to-"ne face veri"cation process to locate the real positions of faces in an image. In coarse veri"cation, a trained backpropagation neural network [28] is applied to decide whether a potential region contains a face. Before the neural network is applied to execute coarse veri"cation, a preprocessing step which is similar to the work of Sung and Poggio [8], and Rowley et al. [28] has to be performed "rst. The preprocessing step consists of masking, illumination gradient correction, and histogram equalization. The trained backpropagation net is then applied to "lter out those non-face regions. If the neural network outputs a positive response, the potential region may contain a face. Otherwise, this region will be eliminated from the potential candidate face list. Basically, this step can "lter out part of the non-face regions and we thus call it coarse veri-
"cation. In the "ne veri"cation process, we apply a cost function, d+s, proposed by Moghaddam and Pentland [14,15] to perform "nal selection. In what follows, the detailed procedure, from coarse to "ne, will be described. To train the neural networks, a technique similar to the one reported in Ref. [28] is adopted. A set of 11 200 face images generated from 700 face samples are collected as the positive samples by randomly, slightly rotating (up to 103), scaling (90}110%), translating (up to half a pixel), and mirroring. As mentioned by Rowley et al. [28], it is di$cult to characterize the prototype of non-face images because of the huge size of non-face images. Therefore, the network training task on non-face images is basically a di$cult one. Sung and Poggio's work [8] conformed this point. They collected 4000 positive (face) samples and 47 500 negative (non-face) samples to train their
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
1707
network. It is obvious that the latter is much larger than the former. For reducing the number of non-face training samples, Rowley et al. [28] used the bootstrap algorithm to train the network. In their experiment, 16 000 positive and only 8000 negative face samples were used. For the non-face training samples, they repeated a bootstrap algorithm to collect the wanted data from 146 212 178 subimages that were available from all locations and scales in the training scenery images. Therefore, it requires signi"cant computation time spent on training. Here, we modify their bootstrap algorithm as follows:
where N is the size of an image x which is to be checked, and M is the number of principal components used to reconstruct the original images. Since the d+s value denotes the di!erence between the input image and the reconstructed image via PCA, one has to choose the region with positive response and the local smallest d+s value as a face region. Once a face region is con"rmed, the last step is to eliminate those regions that overlap with the chosen face region.
1. Create an initial set of non-face images by generating 1000 images with random intensities. Apply the preprocessing steps to each of these images. Initially, the network's weights are randomly initialized. Train a neural network to produce an output of 1 for the face samples and !1 for the non-face samples. 2. Run the system on 20 scenery images which contain no faces. Collect the subimages that the network incorrectly identi"es as faces. Select 250 of these subimages as non-face samples. 3. Apply the preprocessing steps to the collected face and non-face samples. Then, retrain the network to obtain a new version of the face veri"er.
A set of experimental results will be used to show the e!ectiveness and e$ciency of the proposed system. 122 test images containing totally 130 faces were used to test the validity of our system. All the test images were of size 512;339. In these test images, all the human faces were oriented in various directions and positioned arbitrarily in cluttered backgrounds. In this research, the minimum size of a face which could be detected was 50;50. Fig. 4(a)}(j) show 10 test images which have correct detection results. The bounding rectangle that bounds a face region is used to justify whether the detection is correct or not. Among the successful cases, Fig. 4(b) demonstrates that the proposed method could detect faces with facial expressions. The case shown in Fig. 4(c) shows that the illumination problem could also be solved by our approach. One thing worth noticing is that the test image shown in Fig. 4(f ) contains two faces with a nearly 1803 di!erence; it is obvious that our system worked perfectly in dealing with this kind of problem. For an overall evaluation, 122 faces were detected successfully out of the total of 130 faces. Therefore, the success rate for detecting face was roughly about 94%. On the other hand, the proposed system also detected a total of 25 false faces from the cluttered backgrounds of the test images. Fig. 5 shows unsuccessful detection including false faces (the left bounding rectangle of Fig. 5(a)), missed faces (the right-hand side face of Fig. 5(b)), and inclined detected faces (the right bounding rectangle of Fig. 5(a)). There are three possible causes of the mis-detection problem. First, the size of a face to be detected has to be larger than 50;50. If the size of a face region was smaller than 50;50 as shown in Fig. 5(a), the system failed to detect it. The second cause of mis-detection is missing eye pairs. If a potential eye pair which corresponds to a true face was not found, it was impossible to further locate the exact pose of this face. Fortunately, the above-mentioned situation did not happen very often. In our experiments, only the misdetection of the face in the right-hand side of Fig. 5(b) occurred for this reason. There is another reason for the mis-detection problem. As shown in Fig. 5(c), a partially occluded face was mis-detected. In the future, we shall extend the capability of the system to deal with partially
The major di!erence between our training process and Rowley's is the non-face sample collection process. In our system, 5000 non-face images are directly collected from 20 scenery images as negative samples. Therefore, only one training phase is needed in our scheme. This is the major di!erence between Rowley's approach and ours. In the recall process, if the trained neural network outputs a positive response, it means that there exists a face in the region being checked. If a negative response is received, the region being checked is considered a non-face region. In terms of performance, our neural network may not be as good as that of Rowley's. However, the coarse veri"cation process using neural networks can "lter out a signi"cant number of non-face regions. Although some of the non-face regions are retained due to the loose constraints used in the coarse veri"cation process, they can be deleted in the "ne veri"cation process. Further, some overlapping candidate faces are retained simultaneously and, therefore, require further checking. In the "ne veri"cation process, an evaluation function proposed by Moghaddam and Pentland [14,15] is applied to eliminate the above-mentioned overlapping detections as well as those previously retained non-face regions. The evaluation function is called the residual reconstruction error, which is de"ned as follows: , + e" y"""x!x ""! y, G G G+> G
(3)
5. Experimental results and discussion
1708
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
occluded faces as well as to handle faces with sizes smaller than 50;50. As to the execution time problem, the time required to locate the precise locations of the faces in the test image set was dependent upon the size and complexity of the images. For example, a 512;340 image shown in Fig. 4(e) required less than 18 s for location of the correct face position using a Sparc 20 SUN workstation. For the
case shown in Fig. 4(f ), the execution time under the same environment was about 23 s.
6. Concluding remarks In this paper, we have proposed an e$cient face detection algorithm to "nd multiple faces in cluttered images.
Fig. 4. Testing examples.
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
1709
Fig. 4. (Continued.)
In the "rst stage, morphological operations and a labeling process are performed to obtain the eye-analogue segments. Based on some matching rules and the geometrical relationship of the parts of a face, eye-analogue segments are grouped into pairs and used to locate po-
tential face regions. Finally, the potential face regions are veri"ed via a trained neural network, and true faces are determined by optimizing a distance function. Since the morphology-based eye-analogue segmentation process can e$ciently locate potential eye-analogue regions,
1710
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
Fig. 5. The mis-detected examples.
the subsequent processing has to deal only with 5}10% of the area of the original image. Therefore, the execution time is much less than that of most existing systems. Furthermore, the proposed system can detect faces in arbitrary orientations and scales as long as the images are larger than 50;50.
7. Summary An e$cient face detection algorithm which can detect multiple faces in a cluttered environment is proposed. The proposed system consists of three main steps. In the "rst step, a morphology-based technique is devised to perform eye-analogue segmentation. In a gray image, the eyebrows, eyes, nostrils and mouth always look darker than the rest of a face. Among these facial features, the eyes can be considered a salient and relatively stable feature than the other facial organs. Morphological operations are applied to locate eye-analogue pixels in the original image. Then, a labeling process is executed to generate the eye-analogue segments. In the second step, the previously located eye-analogue segments are used as
guides to search for potential face regions. Suppose that there are M eye-analogue segments in image E; at most M potential combinations of face regions are generated via the matched eye pairs. In order to reduce the execution time, the geometrical constraint on a real eye pair is introduced to screen out some impossible pairs. Once the potential eye pairs are determined, their corresponding face regions can be easily extracted. The last step of the proposed system is to perform face veri"cation. In this step, every face candidate obtained from the previous step is normalized to a standard size. Then, each of these normalized potential face images is fed into a trained backpropagation neural network for identi"cation. After all the true faces are identi"ed, their corresponding poses are located based on guidance obtained by optimizing a cost function. The proposed face detection technique can locate multiple faces oriented in any directions. Furthermore, the morphology-based eye-analogue segmentation process is able to reduce the background part of a cluttered image by up to 95%. This process signi"cantly speeds up the subsequent face detection procedure because only 5}10% of the regions of the original image remain for further processing. Experiments demonstrate
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
that approximately 94% success rate is reached, and that the relative false detection rate is very low. [15]
References [16] [1] M.A. Shackleton, W.J. Welsh, Classi"cation of facial features for recognition, Proceedings of Computer Vision and Pattern Recognition, Lahaina, Maui, Hawaii, June 1991, pp. 573}579. [2] R. Brunelli, T. Poggio, Face recognition: features versus templates, IEEE Trans. Pattern Anal. Mach. Intell. 15 (10) (1993) 1042}1052. [3] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive Neurosci. 3 (1) (1991) 71}86. [4] V. Govindaraju, D.B. Sher, R.K. Srihari, Locating human face in newspaper photographs, Proceedings of Computer Vision and Pattern Recognition, San Diego, CA, June 1989. [5] V. Govindaraju, S.N. Srihari, D.B. Sher, A computational model for face location, Proceedings of Computer Vision and Pattern Recognition, 1990, pp. 718}721. [6] S.A. Sirohey, Human face segmentation and identi"cation, Master's Thesis, University of Maryland, 1993. [7] G. Yang, T.S. Huang, Human face detection in a complex background, Pattern Recognition 27 (1) (1994) 53}63. [8] K.K. Sung, T. Poggio, Example-based learning for viewbased human face detection, Technical Report, Arti"cial Intelligence Laboratory, Massachusetts Institute of Technology, December 1994. [9] K.K. Sung, Learning and example selection for object and pattern detection, Ph.D. Thesis, MIT, 1996. anonymous ftp from publications.ai.mit.edu. [10] K.K. Sung, T. Poggio, Example-based learning for viewbased human face detection, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 39}51. [11] A. Pentland, B. Moghaddam, T. Starner, View-based and modular eigenspaces for face recognition, IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, June 1994, pp. 84}91. [12] B. Moghaddam, A. Pentland, Face recognition using view-based and modular eigenspaces, Proceedings of Automatic Systems for the Identi"cation and Inspection of Humans, SPIE, July 1994. [13] B. Moghaddam, A. Pentland, Probabilistic visual learning for object detection, Proceedings of the "fth International Conference on Computer Vision, Cambridge, MA, June 1995, pp. 786}793. [14] B. Moghaddam, A. Pentland, An automatic system for model-based coding of faces, Proceedings of IEEE
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
1711
Data Compression Conference, Snowbird, Utah, March 1995. B. Moghaddam, A. Pentland, A subspace method for maximum likelihood target detection, Proceedings of IEEE International Conference on Image Processing, Washington DC, October 1995. T. Darrel, B. Moghaddam, A.P. Pentland, Active face tracking and pose estimation in an interactive room, Proceedings of Computer Vision and Pattern Recognition, San Francisco, CA, June 1996, pp. 67}72. P. Juell, R. Marsh, A hierarchical neural network for human face detection, Pattern Recognition 29 (5) (1996) 781}787. T.K. Leung, M.C. Burl, P. Perona, Finding faces in clustered scenes using random labeled graph matching, Proceedings of Computer Vision and Pattern Recognition, Cambridge, MA, June 1995, pp. 637}644. M.C. Burl, T.K. Leung, P. Perona, Face localization via shape statistics, Proceedings of Internal Workshop on Automatic Face and Gesture Recognition, 1995. M.C. Burl, P. Perona, Recognition of planar object classes, Proceedings of Computer Vision and Pattern Recognition, San Francisco, CA, June 1996, pp. 223}230. K. Sobottka, I. Pitas, Extraction of facial regions and features using color and shape information, Proceedings of 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, pp. 421}425. Q. Chen, H. Wu, M. Yachida, Face detection by fuzzy pattern matching, Proceedings of Computer Vision and Pattern Recognition, Cambridge, MA, June 1995, pp. 591}595. S. Chen, M. Haralick, Recursive erosion, dilation, opening, and closing transforms, IEEE Trans. Image Process. 4 (1995) 335}345. H. Wu, Q. Chen, M. Yachida, A fuzzy-theory-based face detector, Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, pp. 406}410. H. Wu, Q. Chen, M. Yachida, Facial feature extraction and face recognition, Proceedings of the 13th International Conference on Pattern Recognition, Vienna, Austria, August 1996, pp. 484}488. S.H. Jeng, H.Y.M. Liao, C.C. Han, M.Y. Chern, Y.T. Liu, Facial feature detection using geometrical face model: an e$cient approach, Pattern Recognition 31 (1998) 273}282. G. Chow, X. Li, Towards a system for automatic facial feature detection, Pattern Recognition 26 (12) (1993) 1739}1755. H.A. Rowley, S. Baluja, T. Kanade, Neural network-based face detection, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 23}38.
About the Author*CHIN-CHUAN HAN received the B.S. degree in Computer Engineering from National Chiao-Tung University in 1989, and an M.S. and a Ph.D. degree in Computer Science and Electronic Engineering from National Central University in 1991 and 1994, respectively. From 1995 to 1998, he was a Postdoctoral Fellow in the Institute of Information Science, Academia Sinica, Taipei, Taiwan. He is currently an Assistant Research Fellow in the Applied Research Lab., Telecommunication Laboratories, Chunghwa Telecom Co. His research interests are in the areas of 2D image analysis, computer vision, and pattern recognition. About the Author*HONG-YUAN MARK LIAO received a B.S. degree in Physics from National Tsing-Hua University in 1981, and an M.S. and a Ph.D. degree in Electrical Engineering from Northwestern University in 1985 and 1990, respectively. He is currently
1712
C.-C. Han et al. / Pattern Recognition 33 (2000) 1701}1712
a Research Fellow in the Institute of Information Science, Academia Sinica, Taipei, Taiwan. He is also a member of the IEEE Computer Society and the International Neural Network Society (INNS). His current research interest includes neural networks, computer vision, fuzzy logic, and image analysis. About the Author*GWO-JONG YU was born in Keelung, Taiwan in 1967. He received the B.S. Degree in Information Computer Engineering from the Chung-Yuan Christian University, Chung-Li, Taiwan in 1989. He is currently working toward the Ph.D. Degree in Computer Science. His research interests include face recognition, statistical pattern recognition and neural networks. About the Author*LIANG-HUA CHEN received the B.S. degree in Information Engineering from National Taiwan University, Taipei, Taiwan, ROC, in 1983, the M.S. degree in Computer Science from Columbia University, New York, NY, in 1988, and the Ph.D. degree in computer science from Northwestern University, Evanston, IL, in 1992. He is currently an Associate Professor in the Department of Computer Science and Information Engineering, Fu Jen University, Taipei. His research interests include computer vision and pattern recognition.
Pattern Recognition 33 (2000) 1713}1726
A new LDA-based face recognition system which can solve the small sample size problem Li-Fen Chen , Hong-Yuan Mark Liao*, Ming-Tat Ko, Ja-Chen Lin , Gwo-Jong Yu Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan Institute of Information Science, Academia Sinica, Taiwan Institute of Computer Science and Information Engineering, National Central University, Chung-Li, Taiwan Received 16 June 1998; received in revised form 21 June 1999; accepted 21 June 1999
Abstract A new LDA-based face recognition system is presented in this paper. Linear discriminant analysis (LDA) is one of the most popular linear projection techniques for feature extraction. The major drawback of applying LDA is that it may encounter the small sample size problem. In this paper, we propose a new LDA-based technique which can solve the small sample size problem. We also prove that the most expressive vectors derived in the null space of the within-class scatter matrix using principal component analysis (PCA) are equal to the optimal discriminant vectors derived in the original space using LDA. The experimental results show that the new LDA process improves the performance of a face recognition system signi"cantly. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Face recognition; Feature extraction; Linear discriminant analysis; Linear algebra
1. Introduction Face recognition has been a very hot research topic in recent years [1}4]. A complete face recognition system includes two steps, i.e., face detection [5,6] and face recognition [7,8]. In this paper, attention will be focused on the face recognition part. In the last 10 years, a great number of successful face recognition systems have been developed and reported in the literature [7}13]. Among these works, the systems reported in Refs. [7,10}13] all adopted the linear discriminant analysis (LDA) approach to enhance class separability of all sample images for recognition purposes. LDA is one of the most popular linear projection techniques for feature extraction. It "nds the set of the most discriminant projection
* Corresponding author. Tel.: #886-2-788-3799x1811; fax: #886-2-782-4814. E-mail address:
[email protected] (H.-Y.M. Liao).
vectors which can map high-dimensional samples onto a low-dimensional space. Using the set of projection vectors determined by LDA as the projection axes, all projected samples will form the maximum between-class scatter and the minimum within-class scatter simultaneously in the projective feature space. The major drawback of applying LDA is that it may encounter the so-called small sample size problem [14]. This problem arises whenever the number of samples is smaller than the dimensionality of the samples. Under these circumstances, the sample scatter matrix may become singular, and the execution of LDA may encounter computational di$culty. In recent years, many researchers have noticed this problem and tried to solve it using di!erent methods. In Ref. [11], Goudail et al. proposed a technique which calculated 25 local autocorrelation coe$cients from each sample image to achieve dimensionality reduction. Similarly, Swets and Weng [12] applied the PCA approach to accomplish reduction of image dimensionality. Besides image dimensionality reduction, some researchers have
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 3 9 - 9
1714
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
tried to overcome the computational di$culty directly using linear algebra. Instead of calculating eigenvalues and eigenvectors from an n;n matrix, Fukunaga [14] proposed a more e$cient algorithm and calculated eigenvalues and eigenvectors from an m;m matrix, where n is the dimensionality of the samples and m is the rank of the within-class scatter matrix S . In Ref. [15], Tian et al. U used a positive pseudoinverse matrix S> instead of calcuU lating the matrix S\. For the same purpose, Hong and U Yang [16] tried to add the singular value perturbation in S and made S a nonsingular matrix. In Ref. [17], U U Cheng et al. proposed another method based on the principle of rank decomposition of matrices. The above three methods are all based on the conventional Fisher's criterion function. In 1992, Liu et al. [18] modi"ed the conventional Fisher's criterion function and conducted a number of researches [10,18,19] based on the new criterion function. They used the total scatter matrix S ("S #S ) as the divisor of the original Fisher's funcR @ U tion instead of merely using the within-class scatter matrix. They then proposed another algorithm based on the Foley}Sammon transform [20] to select the set of the most discriminant projection vectors. It is known that the purpose of an LDA process is to maximize the between-class scatter while simultaneously minimizing the within-class scatter. When the small sample size problem occurs, the within-class scatter matrix S is singular. The U theory of linear algebra tells us that it is possible to "nd some projection vectors qs such that qRS q"0 and U qRS qO0. Under the above special circumstances, the @ modi"ed Fisher's criterion function proposed by Liu et al. [10] will de"nitely reach its maximum value, i.e., 1. However, an arbitrary projection vector q satisfying the maximum value of the modi"ed Fisher's criterion cannot guarantee maximum class separability unless qRS q is @ further maximized. Liu et al.'s [10] approach also su!ers from the stability problem because the eigenvalues determined using their method may be very close to each other. This problem will result in instability of the projection vector determination process. Another drawback of Liu et al.'s approach is that their method still has to calculate an inverse matrix. Most of the time, calculation of an inverse matrix is believed to be a bottleneck which reduces e$ciency. In this paper, a more e$cient, accurate, and stable method is proposed to calculate the most discriminant projection vectors based on the modi"ed Fisher's criterion. For feature extraction, a two-stage procedure is devised. In the "rst stage, the homogeneous regions of a face image are grouped into the same partition based on geometric characteristics, such as the eyes, nose, and mouth. For each partition, we use the mean gray value of all the pixels within the partition to represent it. Therefore, every face image is reduced to a feature vector. In the second stage, we use the feature vectors extracted in the "rst stage to determine the set of the most dis-
criminant projection axes based on a new LDA process. The proposed new LDA process starts by calculating the projection vectors in the null space of the within-class scatter matrix S . This null space can be spanned by U those eigenvectors corresponding to the set of zero eigenvalues of S . If this subspace does not exist, i.e., S is U U nonsingular, then S is also nonsingular. Under these R circumstances, we choose those eigenvectors corresponding to the set of the largest eigenvalues of the matrix (S #S )\S as the most discriminant vector set; other@ U @ wise, the small sample size problem will occur, in which case we will choose the vector set that maximizes the between-class scatter of the transformed samples as the projection axes. Since the within-class scatter of all the samples is zero in the null space of S , the projection U vector that can satisfy the objective of an LDA process is the one that can maximize the between-class scatter. A similar concept has been mentioned in Ref. [13]. However, they did not show any investigation results, nor did they draw any conclusions concerning the concept. We have conducted a series of experiments and compared our results with those of Liu et al.'s approach [10] and the template matching approach. The experimental results have shown that our method is superior to both Liu et al.'s approach and the template matching approach in terms of recognition accuracy. Furthermore, we have also proved that our method is better than Liu et al.'s approach in terms of training e$ciency as well as stability. This indicates that the new LDA process signi"cantly improves the performance of a face recognition system. The organization of the rest of this paper is as follows: In Section 2, the complete two-phase feature extraction procedure will be introduced. Experimental results including those of database construction, experiments on the small sample size problem, and comparisons with two well-known approaches, will be presented in Section 3. Finally, concluding remarks will be given in Section 4.
2. Feature extraction In this section, we shall describe in detail the proposed feature extraction technique, which includes two phases: pixel grouping and generalized LDA based on the modi"ed Fisher's function. 2.1. Pixel grouping According to the conclusion drawn in Ref. [21], a statistics-based face recognition system should base its recognition solely on the `purea face portion. In order to ful"ll this requirement, we have built a face-only database using a previously developed morphologybased "lter [6]. Using this morphological "lter, the eyeanalogue segments are grouped into pairs and used to locate potential face regions. Thus, every constituent of
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
the face-only database is the face portion containing only the eyes, nose and mouth. Some examples of this face database are shown in Fig. 1. In order to execute pixel grouping, the above-mentioned face-only images are transformed into normalized sizes. Let the training database be comprised of N normalized face-only images of size P;P. We pile up these N images and align them into the same orientation, as shown in Fig. 2. Therefore, we obtain P N-dimensional vectors whose elements are the gray values of the pixels. These P N-dimensional vectors are then clustered into m groups using the kmeans clustering method, where m is the resolution of the transformed images. After clustering, each image is partitioned into m groups, and each pixel is assigned to one of the groups. For each image, we calculate the average gray value of each group and use these m mean values to represent the whole image. Thus, the P-dimensional images are now reduced to m-dimensional with m;P. Fig. 3 shows some examples of the transformed images. The images in the leftmost column are the original images of size 60;60, and the others are the transformed images with increasing resolutions of 2, 2, 2, and 2, respectively, from left to right. After pixel grouping, we use the transformed images to execute the second phase } generalized LDA.
1715
Fig. 2. Illustration of the pixel grouping process. N normalized face images are piled up and aligned into the same orientation. Suppose the image size is P;P; then P N-dimensional vectors are obtained, and the elements of a vector are the gray values of the pixels in N di!erent images.
2.2. Generalized LDA The purpose of pixel grouping is to reduce the dimensionality of the samples and to extract geometric features; however, it does not take class separability into consideration at all. In the literature [10}12], LDA is a wellknown technique for dealing with the class separability problem. LDA can be used to determine the set of the
most discriminant projection axes. After projecting all the samples onto these axes, the projected samples will form the maximum between-class scatter and the minimum within-class scatter in the projective feature space. In what follows, we shall "rst introduce the LDA approach and some related works. In the second subsection, we shall describe our approach in detail. 2.2.1. Conventional LDA and its potential problem Let the training set comprise K classes, where each class contains M samples. In LDA, one has to determine the mapping xJ I "ARxI , (1) K K where xI denotes the n-dimensional feature vector K extracted from the mth sample of the kth class, and xJ I denotes the d-dimensional projective feature vector of K xI transformed by the n;d transformation matrix A. K One way to "nd the mapping A is to use Fisher's criterion [22]:
Fig. 1. Examples of normalized face-only images. The top two rows of images are of the same person, and the bottom two rows are of another person.
qRS q F(q)" @ , (2) qRS q U where q3RL, S " ) (xN I!xN )(xN I!xN )R, and S " ) @ U IN I)R are the between-class I + (xI !xN I)(xI !x scatter K K K matrix and within-class scatter matrix, respectively, where xN I"1/M + xI and xN "1/KM ) + xI . K K I K K
1716
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
dimensionality of the samples. Whenever this happens, the matrix S becomes singular, and the computation of U S\ becomes complex and di$cult. Liu et al. seriously U addressed the problem in [10,18,19]. One of their e!orts was to propose a modi"ed Fisher's criterion function, FK (q), to replace the original Fisher's function, F(q). They have proved [19] that FK (q) is exactly equivalent to F(q). That is, arg max FK (q)"arg max F(q). q RL Z OZRL
(4)
In what follows, we shall directly describe two theorems of Ref. [19] which are related to our work. Theorem 1. Suppose that R is a set in the n-dimensional space, ∀x3R, f (x)*0, g(x)*0, and f (x)#g(x)'0. Let h (x)"f (x)/g(x), and h (x)"f (x)/( f (x)#g(x)). Then, h (x) has the maximum (including positive inxnity) at point x in R iw h (x) has the maximum at point x [19]. Theorem 2. The Fisher's criterion function F(q) can be replaced by qRS q @ FK (q)" qRS q#qRS q U @
(5)
in the course of solving the discriminant vectors of the optimal set [19].
Fig. 3. Results obtained after performing pixel grouping. The images in the leftmost column are the original images, and the others are the transformed images with the increasing resolutions of 2, 2, 2, and 2, from left to right.
The column vectors of A can be chosen from the set of qJ 's, where qJ "arg max F(q). q RL Z
(3)
After projecting all the xI 's (where k"1,2, K; K m"1,2, M) onto the qJ axis, the projected samples, xJ I 's K (k"1,2, K; m"1,2, M), will form the maximum between-class scatter and the minimum within-class scatter. The vector qJ is called the optimal discriminant projection vector. According to linear algebra, all qJ s can be eigenvectors corresponding to the set of largest eigenvalues of S\S . The major drawback of applying the LDA apU @ proach is that it may encounter the small sample size problem [14]. The small sample size problem occurs whenever the number of samples is smaller than the
From the above two theorems, we know that F(q) and FK (q) are functionally equivalent in terms of solving the optimal set of projection axes (or discriminant vectors). Therefore, one can choose either F(q) or FK (q) to derive the optimal projection axes. In this paper, we propose a new method to calculate the optimal projection axes based on FK (q). According to the normal process of LDA, the solutions of maxq RL FK (q) should be the eigenvectors correZ sponding to the set of the largest eigenvalues of the matrix (S #S )\S . If the small sample size problem @ U @ occurs at this point, the eigenvectors of (S #S )\S @ U @ will be very di$cult to compute due to the singularity problem. In order to avoid direct computation of (S #S )\S , Liu et al. [19] suggested deriving the dis@ U @ criminant vectors in the complementary subspace of the null space of S (S "S #S , which denotes a total scatR R @ U ter matrix), where the null space of S is spanned by the R eigenvectors corresponding to the zero eigenvalues of S . R Since the total scatter matrix S in the complementary R subspace is nonsingular, it is feasible to follow the normal LDA process to derive the discriminant vectors in this subspace. However, there are still some critical problems associated with this approach. The "rst problem with Liu et al.'s approach is the validity of the discriminant vector set problem. It is known that the purpose of LDA is to maximize the between-class scatter while minimizing the
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
within-class scatter simultaneously. In the special case where qRS q"0 and qRS qO0, Eq. (5) will de"nitely U @ reach the maximum value of FK (q). However, an arbitrary projection vector q satisfying the above conditions cannot guarantee derivation of the maximum qRS q value. @ Under these circumstances, a correct LDA process cannot be completed because only the within-class scatter is minimized while the between-class scatter is not surely maximized. The second problem associated with Liu et al.'s approach is the stability problem. In Ref. [23], the author stated that an eigenvector will be very sensitive to small perturbation if its corresponding eigenvalue is close to another eigenvalue of the same matrix. Unfortunately, in Ref. [18], the matrix used to derive the optimal projection vector su!ers from the above-mentioned problem. In other words, their optimal projection vector determination process may be severely in#uenced whenever a small perturbation is added. The third problem associated with Liu et al.'s approach [18] is the singularity problem. This is because their approach still has to calculate the inverse of the matrix S . In this paper, we propose a more R e$cient, accurate, and stable method to derive the most discriminant vectors from LDA based on the modi"ed Fisher's criterion. In the proposed approach, we calculate the projection vectors in the null space of the within-class scatter matrix S because the projection vectors found in U this subspace can make all the projected samples form zero within-class scatter. Furthermore, we will also prove that "nding the optimal projection vector in the original sample space is equivalent to calculating the most expressive vector [12] (via principal component analysis) in the above-mentioned subspace. In what follows, we shall describe the proposed method in detail. 2.2.2. The proposed method Let the database comprise K classes, where each class contains M distinct samples, and let xI be an n-dimenK sional column vector which denotes the feature vector extracted from the mth sample of the kth class. Suppose S and S are, respectively, the within-class scatter matrix U @ and the between-class scatter matrix of xI 's (where K k"1,2, K; m"1,2, M), and suppose the total scatter matrix S "S #S . According to linear algebra [24] R U @ and the de"nitions of the matrices S , S , and R U S , rank(S ))rank(S )#rank(S ), where rank(S )" @ R @ U R min(n, KM!1), rank(S )"min(n, K!1), and rank(S ) @ U "min(n, K;(M!1)). In this paper, we shall determine a set of discriminant projection vectors from the null subspace of S . Therefore, the rank of S certainly is the U U major focus of this research. Suppose the rank of S is r, U i.e., r"min(n, K;(M!1)). If r"n, this implies that K;(M!1)*nNKM*n#KNKM!1*n#K !1*n. The above inequality means that the rank of S is equal to n. Consequently, if S is nonsingular, then R U S is nonsingular, too. Under these circumstances, there R will be no singularity problem when the matrix S\S is R @
1717
computed in the normal LDA process. On the other hand, if r is smaller than n, the small sample size problem will occur. For this case, we propose a new method to derive the optimal projection vectors. Fig. 4 illustrates graphically the process of deriving the optimal projection vectors when r(n. In the top part of Fig. 4, < stands for the original sample space, and ¹ represents a linear transformation: ¹(x)"S x, x3<. Since U the rank of S is smaller than the dimensionality of U <(r(n), there must exist a subspace < L< such that < "span+a " S a "0, for i"1,2, n!r,. < here is G U G called the null space of S . In the bottom part of Fig. 4, U the #ow chart of the discriminant vector determination process is illustrated. Let Q"[a ,2,a ]. First, all L\P samples X's are transformed from < into its subspace < through the transformation QQR. Then, the eigenvec tors corresponding to the largest eigenvalues of the between-class scatter matrix SI (a new matrix formed by the @ transformed samples) in the subspace < are selected as the most discriminant vectors. In what follows, we shall describe our approach in detail. First of all, Lemma 1 shows the subspace where we can derive the discriminant vectors based on maximizing the modi"ed Fisher's criterion. Lemma 1. Suppose < "span+a "S a "0, a 3RL, i" G U G G 1,2, n!r,, where n is the dimensionality of samples, S is U the within-class scatter matrix of the samples, and r is the rank of S . Let S denote the between-class scatter matrix of U @ the samples. For each qJ 3l which satisxes qJ RS qJ O0, it @ will maximize the function FK (q)"qRS q/(qRS q#qRS q). @ @ U Proof. 1. Since both S and S are real symmetric, @ U qRS q*0 and qRS q*0, for all q3RL, it follows that @ U 0)qRS q)qRS q#qRS qN0)FK (q) @ @ U qRS q @ )1. " qRS q#qRS q @ U It is obvious that FK (q)"1 if and only if qRS qO0 and @ qRS q"0. U 2. For each q( 3< , qJ can be represented as a linear combination of the set +a ,, i.e., qJ " L\P a a , where a is G G G G G the projection coe$cient of qJ with respect to a . ThereG fore, we have L\P L\P S qJ "S a a " a S a "0NqJ RS qJ "0. U U G G G U G U G G From 1. and 2., we can conclude that for each qJ 3< which satis"es qJ RS qJ O0, the function FK (q) will be maxi@ mized.
1718
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
Fig. 4. Illustration of the projection vector set determination process. At the top of the "gure, ¹ is a linear transformation from < to =: ¹(x)"S x, x3<. < is the null space of S . In the middle of the "gure, X stands for the original sample set, and XI is the transformed U U sample feature set of X obtained through the transformation QQR, where Q"[a ,2,a ], n is the dimensionality of the samples, r is the L\P rank of S , and S a "0 for each a . The most discriminant vectors for LDA can be computed from the between-class scatter matrix, SM , U U G G @ of XI .
Lemma 1 has a critical issue related to LDA. That is, when the small sample size problem occurs, an arbitrary vector qJ 3< that maximizes FK (q) is not necessarily the optimal discriminant vector of LDA. This is because under the above situation, qJ RS qJ is not guaranteed to @ reach the maximal value. Therefore, one can conclude that it is not su$cient to derive the discriminant vector set simply based on the modi"ed Fisher's criterion when the small sample size problem occurs. In what follows, Lemma 2 will show that the within-class scatter matrix of all the transformed samples in < is a complete zero matrix. Lemma 2 is very important because once it is proved correct, determination of the discriminant vector set no longer depends on the total scatter matrix. Instead, the discriminant vector set can be derived directly from the between-class scatter matrix. Lemma 2. Let QQR be a transformation which transforms the samples in < into a subspace < , where Q" [a ,2, a ] is an n;(n!r) matrix and each a satisxes L\P G S a "0, for i"1,2, n!r; and where the subspace U G < is spanned by the orthonormal set of a 's. If all the G samples are transformed into the subspace < through QQR, then the within-class scatter matrix SI of the transformed U samples in < is a complete zero matrix.
Proof. suppose xI is the feature vector extracted from K the mth sample of the kth class, and that the database comprised K classes, where each class contains M samples. Let yI denote the transformed feature vector K of xI through the transformation QQR. That is, K yI "QQRxI , yN I"QQRxN I, and yN "QQRxN , where xN I"1/ K K M + xI and x "1/KM ) + xI Thus, K K I K K ) + SI " (yI !yN I)(yI !yN I)R U K K I K ) + " (QQRxI !QQRxN I)(QQRxI !QQRxN I)R K K I K ) + "QQR (xI !xN I)(xI !xN I)QQR K K I K "QQRS QQR"0, since S Q"0. (6) U U We have mentioned earlier that the LDA process is used to determine the set of the most discriminant projection axes for all the samples. After projection, all the projected samples form the minimum within-class scatter and the maximum between-class scatter. Lemma 1 tells us that for any qJ 3< , as long as it satis"es qJ RS qJ O0, the @ modi"ed Fisher's criterion, FK (q), will be maximized to 1.
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
1719
However, Lemma 1 also tells us that we should add another criterion to perform LDA, not just depend on the Fisher's criterion. Lemma 2, on the other hand, tells us that the selection of qJ 3< enforces SI "0. That is to U say: SI "SI #SI "SI . Since SI is consistently equal to 0, R U @ @ U we have to select a set of projection axes that can maximize the between-class scatter in < . From the above two lemmas, we know that maximizing the between-class scatter in < is equal to maximizing the total scatter in < . Under these circumstances, we can apply the princi pal component analysis (PCA) method [25] to derive the set of the most discriminant projection vectors and ful"ll the requirement of LDA. The physical meaning of PCA is to "nd a set of the most expressive projection vectors such that the projected samples retain the most information about the original samples. The most expressive vectors derived from a PCA process are the l eigenvectors corresponding to the l largest eigenvalues of SI , where R ( J j / L j )*p, n is the dimensionality of samples, G G G G and j represents the eigenvalue ordered in the ith place G in SI . Basically, j is in the decreasing order from 1 to n. If R G p"0.95, a good enough representation is obtained [26]. In what follows, we shall show the proposed method in Theorem 3 based on the above two lemmas.
The proposed algorithm Input: N n-dimensional vectors. Output: The optimal discriminant vector set of all N input vectors. Algorithm:
Theorem 3. Suppose that Q"[a ,2, a ], and that a 's L\P G are the eigenvectors corresponding to the zero eigenvalues of the within-class scatter matrix S in the original feature U space <, where n is the dimensionality of the feature vectors and r is the rank of S . Let < denote the subspace spanned U by the set of eigenvectors a ,2,a . If r is smaller than n, L\P the most expressive vector qJ in < obtained through the transformation QQR will be the most discriminant vector in <.
3.1. Database construction and feature extraction
Proof. 1. From Lemma 2, we know that the within-class scatter matrix SI in < is a complete zero matrix. Thus, U the between-class scatter matrix SI in < is equal to the @ total scatter matrix SI in < . R 2. The most expressive projection vector qJ in < satis "es qJ RSI qJ '0. Suppose S "SI #SK , where S , SI , and SK @ @ @ @ @ @ @ are all real symmetric. Then, qJ RS qJ "qJ RSI qJ #qJ RSK qJ *qJ RSI qJ '0NqJ RS qJ O0. @ @ @ @ @ 3. We can show that qJ is the optimal solution within < that can maximize FK (q). Since the most expressive projection vector qJ in < can maximize the value of qJ RS qJ , @ and qJ RS qJ "0 is known, we can conclude that the most U expressive projection vector in < is the most dis criminant projection vector in < for LDA. After projecting all the samples onto the projective feature space based on Theorem 3, a Euclidean distance classi"er is used to perform classi"cation in the projective feature space.
Step 1. Calculate the within-class scatter S and the U between-class scatter S . @ Step 2. Suppose the rank of S is r. If r"n, then the U discriminant set is the eigenvectors corresponding to the set of the largest eigenvalues of matrix (S #S )\S ; @ U @ otherwise, go on to the next step. Step 3. Perform the singular value decomposition of S as S ";
L P> 2,lL]. (It has been shown in Ref. [24] that the null space of S can be spanned by l ,2,l ). U P> L Step 5. Compute SI , where SI "QQRS (QQR)R. @ @ @ Step 6. Calculate the eigenvectors corresponding to the set of the largest eigenvalues of SI and use them to form @ the most discriminant vector set for LDA.
3. Experimental results
The facial image database contained 128 persons (classes), in which for each person, 10 di!erent face images with frontal views were obtained. The process for collecting facial images was as follows: after asking the persons to sit down in front of a CCD camera, with neutral expression and slightly head moving in frontal views, a 30-s period was recorded on videotape under well-controlled lighting condition. Later, a frame grabber was used to grab 10 image frames from the videotape and stored them with resolution of 155;175 pixels. According to the conclusion drawn in Ref. [21], which stated that a statistics-based face recognition system should base its recognition solely on the `purea face portion, a face-only database was built using a previously developed morphology-based "lter [6]. Part of the database is shown in Fig. 1. For pixel grouping, each database image was transformed into a normalized size, 60;60. Then, all the 1280 database images (128;10) were piled up and aligned into the same orientation. After this process, 3600 1280-dimensional vectors were obtained. These vectors were then clustered into m groups (where m stands for the required resolution) using the K-means clustering method. For each image, the average gray value of each group was calculated, and then these m mean values were used to represent the whole image. Therefore, the dimensionality of each image was reduced from 3600 to m dimensions. Since m is a variable which stands for the dimensionality of the feature vectors of experimentation, we designed an
1720
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
experiment to decide the best value of m for subsequent experiments. For this experiment, we chose a training database containing 128 persons, with six frontal view samples for each person. For testing purposes, we used a 128-person testing database. Within the database, we obtained 10 samples for each person. Since the database used was a large database, the projection vectors for LDA could be directly computed from S\S . Table 1 R @ showed a set of experimental results obtained by applying di!erent m values (m"32, 64, 128, and 256). The data shown in the second column of Table 1 are the number of projection axes used at a certain resolution. The number of projection axes adopted was decided by checking the p value mentioned in Section 2.2.2. If p reached 0.95, then we used its corresponding number of projection axes as the maximum number of projection axes. Therefore, for m"32 and 64, the corresponding number of projection axes adopted as 29 and 53, respectively. From Table 1, we "nd that m"128 was the most suitable number of features in terms of recognition rate and training e$ciency. Therefore, in the subsequent sets of experiments, this number (m"128) was globally used. 3.2. Experiments on the small sample size problem In order to evaluate how our method interacts with the small sample size problem, including problems like the number of samples in each class and the total number of classes used, we conducted a set of experiments and show the results in Fig. 5. The horizontal axis in Fig. 5 represents the number of classes used for recognition, and the vertical axis represents the corresponding recognition rate. The ', &;', and &䊊' signs in Fig. 5 indicate there were 2, 3, and 6 samples in each class, respectively, for experimentation. The results shown in Fig. 5 re#ect that the proposed approach performed fairly well when the size of the database was small. However, when K (the number of classes) multiplied the M!1 (the number of samples minus 1) was close to n(n"128), the perfor-
Table 1 Face recognition results obtained by applying di!erent numbers of features extracted from images. The training database contains 128 persons, where each person contains six distinct samples Number of features
Number of projection axes used
Recognition rate (%)
Training time (S)
m"32 m"64 m"128 m"256
29 53 70 98
95.28 96.27 97.54 97.34
0.3039 1.1253 3.5746 17.1670
Fig. 5. Experimental results obtained using our method under the small sample size problem. The ' sign means that each class in the database contains 2 samples. The &;' sign and &䊊' sign mean each class in the database contains three and six samples, respectively.
mance dropped signi"cantly. This phenomenon was especially true for the case where M"6. In Section 2.2.2, we have mentioned that the information for deriving the most discriminant vectors depended on the null space of S , < . The dimension of < , dim(< ), was equal to U n!(KM!K), where n is equal to 128, K is the number of classes and M is the number of samples in each class. When M"6 and K approached 25, K(M!1) was very close to n(128). Under these circumstances, the recognition rate dropped signi"cantly (see Fig. 5). The reason for this phenomenon emerged was the low value of dim(< ). When the dim(< ) value was small, not many spaces were available for deriving the discriminant projection axes; hence, the recognition rate dropped. Inspecting another curve (the ' sign) in Fig. 5, it is seen that since there were only two samples in each class, the corresponding curve of the recognition rate is not as monotonous as those for the cases that contained 3 and 6 samples in a class. This part of the experiment provided a good guide for making better decisions regarding the number of samples in each class and the number of classes in a database. When one wants to solve the small sample size problem with good performance, the above experimental results can be used as a good reference. 3.3. Comparison with other approaches In order to demonstrate the e!ectiveness of our approach, we conducted a series of experiments and compared our results with those obtained using two other well-known approaches. Fig. 6 shows the experimental results obtained using our approach, Liu et al.'s approach [10] and the template matching approach. The horizontal axes and vertical axes in Fig. 6 represent the number of classes in the database and the corresponding
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
1721
Fig. 6. The experimental results obtained using our method (' sign), Liu's method (&;' sign), and template matching (&䊊' sign). The horizontal axis represents the number of classes in the database, and the vertical axis stands for the recognition rate. (a) The results obtained when each class contains only two samples; (b) the results obtained when each class contains three samples; (c) the results obtained when each class contains six samples.
recognition rate, respectively. In addition, the ',&;' and &䊊' signs shown in Fig. 6 stand for the recognition results obtained using our approach, Liu et al.'s approach, and the template-matching approach, respectively. Furthermore, the data shown in Fig. 6(a)}(c) are the experimental results obtained when each class contained, respectively, 2, 3, and 6 samples. Among the three approaches, the template-matching approach performed recognition based solely on the original feature vectors. Therefore, there was no LDA involved in this process. Furthermore, from the data shown in Fig. 6, it is obvious that Liu et al.'s approach was the worst. Basically, the most serious problem which occurred in Liu's approach was the degraded discriminating capability. Although the
derived discriminant vectors maximized the modi"ed Fisher's criterion function, the optimal class separability condition, which is the objective of an LDA process, was not surely satis"ed. Therefore, the projection axes determined by Liu et al.'s approach could not guarantee to provide the best class separability of all the database samples. Therefore, it is no wonder that the performance of Liu et al.'s approach was even worse than that of the template-matching approach. On the other hand, our approach was apparently superior because we forced the within-class scatter in the subspace to be zero. This constraint restricted the problem to a small domain, hence, it could be solved in a much easier way. Another advantage of our approach is that we do not need to
1722
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
compute the inverse matrix. In Liu et al. [10], computation of the inverse matrix is indispensable. However, since we project all the samples onto an appropriate subspace, the computation of the inverse matrix, which is considered a time bottleneck, can be avoided. Another advantage of our approach over Liu et al.'s approach is the training time requirement. Fig. 7 shows three sets of experiments; in each set of experiments we used di!erent numbers of samples in a class (2 in (a), 3 in (b), and 6 in (c)). The ' and &;' signs represent, respectively, the results obtained using our approach and Liu et al.'s approach. From Fig. 7(a)}(c), it is obvious that the training time required by Liu et al.'s approach grew exponentially when the database was augmented. The reason for this outcome was the projection axes determination process. In Liu et al.'s method, the projection
axes are determined iteratively. In each iteration, their algorithm has to derive the projection vector in a recalculated subspace. Therefore, their training time is exponentially proportional to the number of classes adopted in the database. In comparison with Liu et al.'s approach, our approach requires a constant time for training. This is because our approach only has to calculate the subspace once and then derive all the projection vectors in this subspace. The experimental results shown in Figs. 6 and 7 are comparisons between Liu et al.'s approach and ours in terms of accuracy and e$ciency. In what follows, we shall compare our method with Liu et al.'s method using another important criterion } the stability criterion. Table 2 shows a set of experimental results regarding the stability test between our method and Liu et al.'s. In this
Fig. 7. Training time required by our method (' sign) and Liu's method (&;' sign). The horizontal axis represents the number of classes in the database, and the vertical axis stands for the training time. (a) The results obtained when each class contains only two samples; (b) the results obtained when each class contains three samples; (c) the results obtained when each class contains six samples.
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
1723
Table 2 Stability test executed during the derivation of the "rst optimal projection vector. The training database comprised 10 classes, where each class contains three samples. The elements shown in the second and the fourth columns represent the orientation di!erence between the current optimal projection vector and the projection vector derived in the previous iteration Iteration
Our method Orientation di!erence (degrees)
1 2 3 4 5 6 7 8 9 10
0.0006 0.0005 0.0005 0.0005 0.0006 0.0006 0.0007 0.0006 0.0006
Liu's method Recognition rate (%)
90.56 90.56 90.56 90.56 90.56 90.56 90.56 90.56 90.56 90.56
set of experiments, we tried to compute the "rst optimal projection vector in 10 iterations. The leftmost column of Table 2 indicates the iteration number. The element shown in the second and the fourth column of Table 2 is the orientation di!erence (in degrees) between the current optimal projection vector and the projection vector derived in the previous iteration. The data shown in the second column were obtained by applying our method while the data shown in the fourth column were obtained by applying Liu et al.'s method. Theoretically, the optimal projection vector determined based on the same set of data should stay the same or only change slightly over 10 consecutive iterations. From Table 2, it is obvious that the projection vector determined by our method was very stable during the 10 consecutive iterations. On the other hand, the projection vector determined by Liu et al.'s method changed signi"cantly between every two consecutive iterations. Linear algebra [23] tells us that an eigenvector will be very sensitive to small perturbation if its corresponding eigenvalue is close to another eigenvalue of the same matrix. Table 3 shows the eigenvalues obtained by our method and by Liu et al.'s. It is obvious that the eigenvalues obtained by our method are quite di!erent from each other. However, the eigenvalues obtained by Liu et al.'s method are almost the same. These data con"rm that our method was much more stable than Liu et al.'s. Another important issue which needs to be discussed is the in#uence of the reserved percentage of dim(< ) on the recognition rate. Since the construction of < is the most time consuming task in our approach, we would like to show empirically that by using only part of the space < , our approach can still obtain good recognition results. Fig. 8 illustrates the in#uence of the reserved percentage of dim(< ) on the recognition rate when the number of
Orientation di!erence (degrees)
92.3803 98.7039 127.1341 100.4047 94.8684 97.4749 77.8006 99.7971 75.0965
Recognition rate (%)
90.56 90.56 90.56 90.56 90.56 90.56 88.33 90.56 90.56 90.56
Table 3 The eigenvalues used to derive the "rst optimal projection vector. The elements shown in the left column are the eigenvalues determined using our method. The ones shown in the right column were determined using Liu et al.'s method Eigenvalues determined using our method
Eigenvalues determined using Liu's method
3.31404839e#04 2.39240384e#04 1.67198579e#04 1.01370563e#04 6.88308959e#03 7.41289737e#03 2.70253079e#03 5.53323313e#03 3.46817376e#03
1.00000000e#00 1.00000000e#00 1.00000000e#00 1.00000000e#00 1.00000000e#00 1.00000000e#00 1.00000000e#00 1.00000000e#00 1.00000000e#00
classes is changed. The ', &;' and &䊊' signs indicate that there were 10, 20 and 30 classes in the database, respectively. In all of the above mentioned classes, each class contained three distinct samples. From the three curves shown in Fig. 8, it is obvious that by only reserving 10% of dim(< ), the recognition rate could still maintain 94%. Fig. 9, on the other hand, illustrates the in#uence of the reserved percentage of dim(< ) on the recognition rate when the number of samples in each class is changed. The ',&;' and &䊊' signs indicate that there were 2, 3, and 6 samples in each class, respectively. From Fig. 9, we can see that by only reserving 10% of dim(< ), the recognition rate could always reach 91%. Moreover, the results shown in Fig. 8 re#ect that the information retained in the space < (the null space of S ) U was more sensitive to the number of classes. This means
1724
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
percentage of dim(< ) did not in#uence the recognition results very much.
4. Concluding remarks
Fig. 8. Illustration of the in#uence of the reserved percentage of dim(< ) on the recognition rate. The ', &;', and &䊊' signs mean that there are 10, 20, and 30 classes in the database, respectively. Each class contains three distinct samples. This "gure shows that the information contained in the null space of S was more sensitive to the number of classes in the database. U
In this paper, we have proposed a new LDA-based face recognition system. It is known that the major drawback of applying LDA is that it may encounter the small sample size problem. When the small sample size problem occurs, the within-class scatter matrix S becomes U singular. We have applied a theory from linear algebra to "nd some projection vectors q's such that qRS q"0 and qRS qO0. Under the above special cirU @ cumstances, the modi"ed Fisher's criterion function proposed by Liu et al. [10] can reach its maximum value, i.e., 1. However, we have found that an arbitrary projection vector q satisfying the maximum value of the modi"ed Fisher's criterion cannot guarantee the maximum class separability unless qRS q is further maximized. Therefore, @ we have proposed a new LDA process, starting with the calculation of the projection vectors in the null space of the within-class scatter matrix S . If this subspace does U not exist, i.e., S is nonsingular, then a normal LDA U process can be used to solve the problem. Otherwise, the small sample size problem occurs, and we choose the vector set that maximizes the between-class scatter of the transformed samples as the projection axes. Since the within-class scatter of all the samples is zero in the null space of S , the projection vector that can satisfy the U objective of an LDA process is the one that can maximize the between-class scatter. The experimental results have shown that our method is superior to Liu et al.'s approach [10] in terms of recognition accuracy, training e$ciency, and stability.
References
Fig. 9. Illustration of the in#uence of the reserved percentage of dim(< ) on the recognition rate. The ', &;', and &䊊' signs mean that there are two, three, and six samples in each class, respectively. The database comprised 10 classes. This "gure shows that the information for the same person was uniformly distributed over the null space of S . Therefore, the percentage U of dim(< ) did not in#uence the recognition results very much.
that when more classes are contained in the database, a higher percentage of < should be reserved to obtain good recognition results. On the other hand, Fig. 9 shows that the information about the same person was uniformly distributed over the null space of S . Therefore, the U
[1] R. Chellappa, C. Wilson, S. Sirohey, Human and machine recognition of faces: a survey, Proc. IEEE 83 (5) (1995) 705}740. [2] D. Valentin, H. Abdi, A. O'Toole, G. Cottrell, Connectionist models of face processing: a survey, Pattern Recognition 27 (9) (1994) 1209}1230. [3] R. Brunelli, T. Poggio, Face recognition: features versus templates, IEEE Trans. Pattern Anal. Mach. Intell. 15 (10) (1993) 1042}1052. [4] A. Samal, P. Iyengar, Automatic recognition and analysis of human faces and facial expressions: a survey, Pattern Recognition 25 (1) (1992) 65}77. [5] S.H. Jeng, H.Y. Mark Liao, C.C. Han, M.Y. Chern, Y.T. Liu, Facial feature detection using geometrical face model: an e$cient approach, Pattern Recognition 31 (3) (1998) 273}282. [6] C.C. Han, H.Y. Mark Liao, G.J. Yu, L.H. Chen, Fast face detection via morphology-based pre-processing, Pattern Recognition 1999, to appear.
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726 [7] H.Y. Mark Liao, C.C. Han, G.J. Yu, H.R. Tyan, M.C. Chen, L.H. Chen, Face recognition using a face-only database: a new approach, Proceedings of the third Asian Conference on Computer Vision, Hong Kong, Lecture Notes in Computer Science, Vol. 1352, 1998, pp. 742}749. [8] B. Moghaddam, A. Pentland, Probabilistic visual learning for object representation, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 696}710. [9] M. Turk, A. Pentland, Eigenfaces for recognition, J. Cognitive Neurosci. 3 (1) (1991) 71}86. [10] K. Liu, Y. Cheng, J. Yang, Algebraic feature extraction for image recognition based on an optimal discriminant criterion, Pattern Recognition 26 (6) (1993) 903}911. [11] F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, N. Otsu, Face recognition system using local autocorrelations and multiscale integration, IEEE Trans. Pattern Anal. Mach. Intell. 18 (10) (1996) 1024}1028. [12] D. Swets, J. Weng, Using discriminant eigenfeatures for image retrieval, IEEE Trans. Pattern Anal. Mach. Intell. 18 (8) (1996) 831}836. [13] P.N. Belhumeur, J.P. Hespanha, D.J. Kiregman, Eigenfaces vs. "sherfaces: recognition using class speci"c linear projection, IEEE Trans. Pattern Anal. Mach. Intell. 19 (7) (1997) 711}720. [14] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1990. [15] Q. Tian, M. Barbero, Z.H. Gu, S.H. Lee, Image classi"cation by the Foley}Sammon transform, Opt. Eng. 25 (7) (1986) 834}840. [16] Zi-Quan Hong, Jing-Yu Yang, Optimal discriminant plane for a small number of samples and design method of
[17]
[18]
[19]
[20] [21]
[22] [23] [24] [25] [26]
1725
classi"er on the plane, Pattern Recognition 24 (4) (1991) 317}324. Y.Q. Cheng, Y.M. Zhuang, J.Y. Yang, Optimal "sher discriminant analysis using the rank decomposition, Pattern Recognition 25 (1) (1992) 101}111. K. Liu, Y. Cheng, J. Yang, A generalized optimal set of discriminant vectors, Pattern Recognition 25 (7) (1992) 731}739. K. Liu, Y.Q. Cheng, J.Y. Yang, X. Liu, An e$cient algorithm for Foley}Sammon optimal set of discriminant vectors by algebraic method, Int. J. Pattern Recog. Artif. Intell. 6 (5) (1992) 817}829. D.H. Foley, J.W. Sammon, An optimal set of discriminant vectors, IEEE Trans. Comput. 24 (1975) 281}289. L.F. Chen, H.Y.M. Liao, C.C. Han, J.C. Lin, Why a statistics-based face recognition system should base its recognition on the pure face portion: a probabilistic decision-based proof, Proceedings of the 1998 Symposium on Image, Speech, Signal Processing and Robotics, The Chinese University of Hong Kong, September 3}4, 1998 (invited), pp. 225}230. A. Fisher, The Mathematical Theory of Probabilities, Macmillan, New York, 1923. G.W. Stewart, Introduction to Matrix Computations, Academic Press, New York, 1973. B. Noble, J.W. Daniel, Applied Linear Algebra, PrenticeHall, Englewood Cli!s, NJ, 1988. R.C. Gonzalez, R.E. Woods, Digital Image Processing, Addison-Wesley, Reading, MA, 1992. A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cli!s, NJ, 1988.
About the Author*LI-FEN CHEN received the B.S. degree in computer science from the National Chiao Tung University, Hsing-Chu, Taiwan, in 1993, and she is now a Ph.D. student in the department of computer and information science at National Chiao Tung University from 1993. Her research interests include image processing, pattern recognition, computer vision, and wavelets. About the Author*MARK LIAO received his B.S. degree in physics from the National Tsing-Hua University, Hsin-Chu, Taiwan, in 1981, and the M.S. and Ph.D. degrees in electrical engineering from the North-western University in 1985 and 1990, respectively. He was a research associate in the Computer Vision and Image Processing Laboratory at the Northwestern University during 1990}1991. In July 1991, he joined the Institute of Information Science, Academia Sinica, Taiwan, as an assistant research fellow. He was promoted to associate research fellow and then research fellow in 1995 and 1998, respectively. Currently, he is the deputy director of the same institute. Dr. Liao's current research interests are in computer vision, multimedia signal processing, wavelet-based image analysis, content-based image retrieval, and image watermaking. He was the recipient of the Young Investigators' award of Academia Sinica in 1998; the best paper award of the Image Processing and Pattern Recognition Society of Taiwan in 1998; and the paper award of the above society in 1996. Dr. Liao served as the program chair of the International Symposium on Multimedia Information Processing (ISMIP), 1997. He also served on the program committees of the International Symposium on Arti"cial Neural Networks, 1994}1995; the 1996 International symposium on Multi-technology Information Processing; and the 1998 International Conference on Tools for AI. Dr. Liao is an Associate Editor of the IEEE Transactions on Multimedia (1998}2001) and the Journal of Information Science and Engineering. He is a member of the IEEE Computer Society and the International Neural Network Society (INNS). About the Author*JA-CHEN LIN was born in 1955 in Taiwan, Republic of China. He received his B.S. degree in computer science in 1977 and M.S. degree in applied mathematics in 1979, both from the National Chiao Tung University, Taiwan. In 1988 he received his Ph.D. degree in mathematics from Purdue University, USA. In 1981}1982, he was an instructor at the National Chiao Tung University. From 1984 to 1988, he was a graduate instructor at Purdue University. He joined the Department of Computer and Information Science at National Chiao Tung University in August 1988, and is currently a professor there. His recent research interests include pattern recognition and image processing. Dr. Lin is a member of the Phi-Tau-Phi Scholastic Honor Society.
1726
L.-F. Chen et al. / Pattern Recognition 33 (2000) 1713}1726
About the Author*MING-TAT KO received a B.S. and an M.S. in mathematics from the National Taiwan University in 1979 and 1982, respectively. He received a Ph.D. in computer science from the National Tsing Hua University in 1988. Since then he joined the Institute of Information Science as an associate research fellow. Dr. Ko's major research interest includes the design and analysis of algorithms, computational geometry, graph algorithms, real-time systems and computer graphics. About the Author*GWO-JONG YU was born in Keelung, Taiwan in 1967. He received the B.S. degree in Information Computer Engineering from the Chung-Yuan Christian University, Chung-Li, Taiwan in 1989. He is currently working toward the Ph.D. degree in Computer Science. His research interests include face recognition, statistical pattern recognition and neural networks.
Pattern Recognition 33 (2000) 1727}1740
O!-line arabic signature recognition and veri"cation M.A. Ismail*, Samia Gad Department of Computer Science, Faculty of Engineering, University of Alexandria, Alexandria 21544, Egypt Received 30 September 1998; received in revised form 5 November 1998; accepted 5 November 1998
Abstract O!-line signature recognition and veri"cation is an important part of many business processes. It can be used in many applications such as cheques, certi"cates, contracts and historical documents. In this paper, a system of two separate phases for signature recognition and veri"cation is developed. A recognition technique is developed based on a multistage classi"er and a combination of global and local features. New algorithms for signature veri"cation based on fuzzy concepts are also described and tested. It is concluded from the experimental results that each of the proposed techniques performs well on di!erent counts. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: O!-line recognition; Signature veri"cation; Feature extraction; Fuzzy similarity; Local centers of gravity; Critical points
1. Introduction Signatures are a special case of handwriting in which special characters and #ourishes occur. In many cases the signature may be unreadable. As a result, the signature is handled as an image. Signatures are subject to two types of forgery, simple and simulated. In the "rst type, the forger has no previous knowledge of the signature and the style of the forged signature di!ers from the original. Therefore, simple forgeries are easily identi"ed. In the second type, the forger knows the signature well and has the ability to simulate or copy it. Therefore, the simulated signature is very similar to the original one, making it much more di$cult to verify the forgery [1]. On-line handwriting recognition means that the machine recognizes the handwriting as the user writes [2]. It requires a transducer that captures the signature as it is written. O!-line handwriting recognition, in contrast, is performed after the writing is complete. The data are captured at a later time by using an optical scanner to convert the image into a bit pattern. Because far more information can be extracted from dynamic or on-line
* Corresponding author. Tel.: #203-596-0052; fax: #203597-1853. E-mail address: [email protected] (M.A. Ismail).
signatures, much less attention has been paid to o!-line processing. The on-line system produces time information like acceleration (speed of writing), retouching, pressure and pen movement. It already has recognition and veri"cation rate of 100%. Therefore, nothing of value can be added in this "eld. On the other hand, most of this information is lost in the o!-line system. However, other useful factors which can be used to di!erentiate the handwriting of one person from another still exist. O!-line signature processing remains important since it is required in o$ce automation (OA) systems. It is used for the validation of cheques, credit cards, contracts, historical documents, etc. Since the signature is processed as an image, there is no great di!erence between Arabic signatures and other signatures but experiments have shown that some features are not e!ective in Arabic signature recognition and veri"cation. The purpose of the recognition process is to identify the writer of a given sample, while the purpose of the veri"cation process is to con"rm or reject the sample. Machine recognition and veri"cation of signatures is a very special and di$cult problem. The di$culty arises from the following [3]: E The complexity of signature patterns and the wide variation in the patterns of a single person (i.e., there is no ideal signature shape for any one person).
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 7 - 3
1728
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
E The forged signatures produced by professional forgers may be very similar to the original. Even a well trained and careful eye may not be able to detect the di!erence. E The di!erent conditions under which the actual signing takes place may seriously a!ect the quality of the signature. E The existence of a large number of signatures in the database requires a rapid and e$cient searching method. E In practice, while samples are collected and classi"ed for each genuine signature, at the very beginning no pre-established class is created for forged signatures. While human beings are trained to recognize all symbols and patterns regardless of type, considerable research has been done to "nd algorithms to recognize not only numerals and letters but also special symbols and signatures. The most recent research published on this subject [4,5] is based on neural net classi"ers. In this paper, the problem of signature recognition is separated from signature veri"cation. They are treated as two separate and consecutive phases. Hence, successful veri"cation is highly dependent on successful recognition. Therefore, the features used during the recognition phase are not the same as those used later during the veri"cation phase. In a way which is not well understood at present, the human brain has a remarkable ability to assign a grade of membership to a given object without knowing how it arrived at the said grade. This fact is the basis of the work presented in this paper. The signatures will be recognized and veri"ed according to a grade of membership in the set of genuine signatures. The recognition phase is based on a three-stage classi"er with two modi"cations. The veri"cation phase applies fuzzy concepts in decision making. The remainder of the paper is divided into "ve sections. In Section 2, the signature database is described. In Section 3, preprocessing steps and feature extraction for the recognition process are discussed. The recognition classi"er is presented in Section 4. In Section 5, the veri"cation features and algorithms are described. Finally, Section 6 concludes the work and illustrates possible future extensions.
forgers, volunteers were asked to simulate the true samples of all persons. They were allowed to practice many times and correct their mistakes in the "nal version of the forgery samples. All samples were written in a limited space (3;2 in) and horizontally oriented on a white sheet of paper. The 10 signatures collected from each person were used as follows: six of these signatures were selected at random for system learning and the remaining four were used for system testing in addition to "ve forged samples. Out of the six learning samples the nearest two to the mean are selected to represent the local features of the genuine signature. The signatures were scanned into the computer in black and white using 200 dot-per-inch resolution. These binary images constituted the raw data for system development and evaluation.
3. Preprocessing and feature extraction Any image-processing application su!ers from noise like touching line segments, isolated pixels and smeared images. This noise may cause severe distortions in the digital image and hence ambiguous features and a correspondingly poor recognition and veri"cation rate. Therefore, a preprocessor is used to remove noise. Preprocessing techniques eliminate much of the variability of signature data. During this stage some features are extracted to be used in the subsequent recognition process. Indeed, a perfect preprocessing system would make the signatures of the same person uniform, removing as much noise as possible and preparing the resulting data for feature extraction and classi"cation, thus improving the performance of the recognition and veri"cation system. The primary concern is to keep the main characteristics of the signatures unchanged. Since the existing techniques for separating the signature from a noisy background have shown a high percentage of successful separation [1], we assume that the signatures have already been extracted from the background. The preprocessing and extraction are to be performed in the following order. 3.1. Area xlter
2. Signature data A set of signature data consisting of 220 true samples and 110 forged samples was used. True samples were obtained from 22 persons. Every signer was asked to sign 10 times using common types of pens (fountain pen or ball pen) without signi"cant rotation (i.e., rotation invariant). Since it was very di$cult to "nd professional
This "lter removes small dots and isolated pixels (Fig. 1). This must be done because, when the signature includes dots, the signer usually makes no e!ort to place them in their correct positions. While these dots usually do not a!ect global features, they should be removed to prevent them from interfering with the local features. A simple algorithm is proposed to extract a dot even if it is enclosed in a circle. The algorithm is based on merging
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
1729
3.6. Obtaining the partial histogram and the centers of gravity [7] 3.6.1. Vertical projection The image is projected on the vertical axis. K P [x]" black.pixel (x, y), T W where m" image width.
(1)
3.6.2. Horizontal projection: The image is projected on the horizontal axis. L P [y]" black.pixel (x, y), F V where n is the image height.
(2)
3.6.3. Centers of gravity The vertical center of gravity C is obtained from the T vertical projection as Fig. 1. Area "lter.
overlapped runlengths in one rectangle. If the resulting rectangle area is less than the signature area/100 then it is deleted. 3.2. Translation This step maps the signature to the origin point at the upper left corner. It calculates the width (=), height (H) and area(A) of the signature and the total number of black pixels (¹). 3.3. Extraction of the circularity feature This is de"ned as the ratio A/C, where A is the area of the signature and C is the area of the smallest circle that surrounds the signature and has the same central point as the signature [6]. This step also calculates the radius of the circle (Srad). 3.4. Normalization This step scales the signature to the standard size, which is the mean size of the learning samples.
L L C " (x.P [x])/ P [x]. (3a) T T T V V The horizontal center of gravity C is obtained from the F horizontal projection as K K C " (y.P [y])/ P [y]. (3b) F F F W W Similar signatures have central points of gravity which are approximately the same for their similar segments or parts. The idea of localizing global features can be very useful in overcoming the disadvantages of global and local features and bene"ting from the advantages of both. From experimental results, it has been found that the global center of gravity alone is not enough [7]. Hence, the image was divided into four parts, with four centers of gravity, whereupon the results of recognition were slightly improved (80.1%). The image was then split into 16 parts in order to localize the center-of-gravity feature. The partial histograms of the individual parts are obtained and centers of gravity are calculated for each, as shown in Fig. 2. These local centers of gravity are used during the recognition process. This technique improved the results signi"cantly (89.1%). 3.7. Extraction of the global baseline (BSL)
3.5. Image enhancement
The global baseline corresponds to the maximum point (peak) of the smoothed global vertical projection curve (Fig. 3).
Smoothing operations reduce the peaks and holes existing in the shape of the signature. They are performed using the neighborhood-averaging technique.
P "max+P [x],, K T where x"12n.
(4)
1730
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
3.11. Calculation of the global slant This feature is de"ned as follows: given a black pixel p(i, j) in the thinned image, the black p(i!1, j!1), p(i!1, j), p(i!1, j#1), p(i, j#1) are negatively (NS), vertically (VS), positively (PS) and horizontally (HS) slanted pixels, respectively. This slant feature is measured on the whole signature then normalized with respect to the total number of black pixels to get a more stable and representative slant.
4. Signature recognition
Fig. 2. Central points of gravity, partial histogram.
The recognition process classi"es a given sample as belonging to one of the known writers in the database. This section deals with the recognition process. It consists of three subsections: feature extraction, classi"cation and results. 4.1. Feature extraction Feature extraction plays a very important role in all pattern recognition systems. It is preferable to extract those features which will enable the system to correctly discriminate one class from the others. In general, the features are classi"ed into two main groups:
Fig. 3. BSL, UL, and LL.
3.8. Extraction of the upper limit (UL) This is the maximum di!erence between the smoothed curve of the vertical projection and the approximated curve of the same projection above the baseline. 3.9. Extraction of the lower limit (LL) This is the maximum di!erence between the smoothed curve of the vertical projection and the approximated curve of the same projection under the baseline. 3.10. Thinning An important approach for representing the structural shape of the signature is to reduce it to its skeleton by using a thinning algorithm. This eliminates the e!ect of di!erent line thicknesses resulting from the use of di!erent writing pens. There are many thinning algorithms that can be used to obtain the skeleton of the signature. In this paper an algorithm developed by Zhang and Suen [8] is used.
(a) Global features which describe or identify the signature as a whole (i.e., the global characteristics of the signature), e.g. the width and the baseline. Although any distortion of an isolated region of the signature will result in small changes to every global feature, global features are less sensitive to signature variation. They are also less sensitive to the e!ect of noise. (b) Local features which represent a portion or a limited region of the signature, e.g. critical points and gradients. Local features are sensitive to noise, even a small distortion, and they are not a!ected by other regions of the signature. They are computationally expensive. However, they are much more accurate. It is believed that a suitable combination of global and local features will produce more distinctive and e!ective features, and that the idea of localizing global features will allow the system to avoid the major drawbacks of both and to bene"t from the advantages of both. Hence, with some modi"cations to existing techniques, the following steps were implemented: (1) Global features were used in the "rst and second stages of the classi"er. (2) The center-of-gravity feature was localized in 16 square windows for use in the third stage of the classi"er. (3) Computationally expensive local features were used in the veri"cation phase.
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
A set of features was extracted during the preprocessing steps. By computing certain measurements (ratios) on this set, a description of the signature was obtained. Six ratios were calculated to represent the global feature vector of the input signature pattern as follows: The width-to-height ratio"W/H. The Circularity ratio"A/(pi * Srad 2). The Intensity ratio"T/A. The relative position of the baseline"BSL/H. The relative position of the lower limit"LL/H. The relative position of the upper limit"(H!UL#1)/H. 4.2. Classixcation Previous studies showed that a single-stage classi"cation algorithm generally did not yield a low error rate. Hence, many researchers have turned their attention to the use of more complex structured features. Out of the many papers presented in the "eld of Chinese character recognition and handwriting recognition systems, most used multistage classi"ers during recognition in order to achieve better results. Even if the image representation is carefully chosen, the time complexity of the sequential search in a large data set is still relatively high. Since the amount of data for this application is very large, a hierarchical recognition approach is required. Therefore, a multistage classi"er is used in which a preclassi"cation stage for a group of similar slant signatures is "rst applied. Then, a recognition scheme is applied to resolve individual identi"cation within a group. In the second stage, the distances between the global feature vector of the input sample and the mean of each class in the group are computed and compared sequentially to select the best three candidates. Finally, in the third stage, the local center points are used to choose the best candidate or to decide if the sample is not recognized. The decision is based on the corresponding threshold of each candidate class. 4.2.1. First stage Preclassi"caiton is the most important stage in signature recognition. The success of recognition depends heavily upon the success of "nding the group that matches the input signature. From standard deviation results, the slant feature appears powerful enough for use in a "rst-stage classi"er. Experiments have shown that it is quite e!ective to divide the data into four groups (negatively (NS), vertically (VS), positively (PS) and horizontally (HS) slanted) as a "rst stage resulting in 99% accuracy. It has been found from the real-life data set that it is very di$cult for the person to change the slant of his or her hand while signing or even writing. In addition, while negative slanting is predominant in Latin signatures, it does not exist in Arabic signatures.
1731
4.2.2. Second stage In accordance with the information provided by the global features vector, the input pattern will be matched with all classes in the appropriate slant group sequentially to get the key code of the best three candidate classes. This matching is done by comparing the global features vector of the input pattern to the representative global features vector in the signature class. The global features of each class are represented by a single vector [9]. This vector may simply be the statistical mean of the available learning samples known to belong to the class. By way of comparison, note that the use of the mean as a class prototype is less sensitive to noise. The mean of any global feature j is calculated using Eq. (5) in order to eliminate the minimum and maximum values of the feature.
F !min(F )!max(F ) (n!2), (5) HG HG HG H where i"12number of learning samples. The candidate will be chosen if it is nearer to the input pattern. This nearness will be judged by the Euclidean distance. Each feature is weighted using the inverse of its standard deviation. Mj"
Distance d"(1/n ((F !k )/p )), (6) G G G where i"12n, n is the number of features used, F is the G measured value of the ith feature for the unknown sample x, k is the mean of the ith feature computed over the G genuine signature learning set, and p is the standard deviG ation computed on the same set. To accelerate the sequential search process, the partial sum approach is used [10], in which the distance calculation for a pattern will be terminated if its accumulated sum exceeds the third running minimum found. 4.2.3. Third stage The unknown input signature is expected to belong to one of the three resulting candidate classes. The third stage will test the local features of the three candidates. There is no need to test the rest of the group, because the signature which fails the global test will surely fail the local test. Two ideal forms (prototypes) are taken for each person to represent the local feature. Because of the wide variation in the signatures of one person, the mean may not be a good representation in this case. The classi"er computes the distance between the input pattern and the two prototypes of each of the three candidate classes. Then it places the input pattern in the same class as its nearest neighbor. If the minimum distance between the input pattern and the best candidate is greater than the threshold of the candidate class, then the pattern is judged to be irrecognizable. Otherwise, the input pattern belongs in the class of the best candidate. This method is very useful because it rejects many forged
1732
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
Table 1 Recognition results
Without threshold With threshold
Correct classi"cation
Incorrect classi"cation
Rejected
95.00% 91.82%
5.00% 1.36%
0.00% 6.82%
samples without going through the veri"cation test. On the other hand, if no threshold exists, then the input pattern must be assigned to the class with the minimum distance (nearest neighbor). Proper selection of threshold values is important to the success of the recognition process. Furthermore, the maximum distance between the mean and the learning patterns of each class can be considered an acceptable threshold for this class. When one threshold was applied to the entire data set, as in previous related work, the recognition rate was very low (nearly 78%). The performance of the system depends on correct threshold adjustment, which is a di$cult task. Some people's signatures are very precise and uniform with few di!erences to be found between input samples. Such patterns require a small threshold in order to capture forgeries. Other people's signatures were imprecise or unstable with great variations between input samples. In such cases, the threshold must be high enough to allow genuine samples to pass while capturing forged ones. Therefore, an individual threshold was taken for each class and saved within the feature vector. 4.3. Experimental results Although the discrimination results were good, some defects exist in this method. The major defect is excessive memory usage, caused by the need to save the local feature information of two prototypes for every class (person). Adding a threshold decreases the recognition rate in exchange for decreasing the incorrect classi"cation rate. However, this is acceptable because the rejected signature can be checked manually (Table 1).
of the signer from information obtained during the execution of the signature. A good veri"cation system is expected to satisfy the following requirements: (1) Reliability: it should detect forgeries if there is adequate distinction between the input sample and the original pattern. (2) Adaptability: it should identify the genuine signature even with slight variations. (3) Practicality: it should be possible to implement it as a real-time system. A veri"cation system is proposed in this paper based on fuzzy concepts. 5.1. Feature extraction As true samples and forgeries are very similar in most cases, it is very important to extract an appropriate feature set to be used in discriminating between genuine and forged signatures. It is a fact that any forgery must contain some deviations if compared to the original model and that a genuine sample is rarely identical to the model although there must be similarities between them. Such similarities make it possible to determine the features needed for the veri"cation process. Based on experimental observations, some local features were chosen to form a primary feature set. These features are e!ective in the veri"cation process as they show a relatively high rate of correctness since they give more importance to pixel positions and are less sensitive to noise. Each proposed feature is tested independently, then di!erent combinations of these features are formed and tested to select the best feature set according to: (1) The optimum number of representative points in order to minimize the image database. (2) Best veri"cation results. Though computationally exhaustive, the selected features are very sensitive to any variation in the signature. These primary features are: central line features, corner line features, central circle features, corner curve features and critical point features. These features produce signature snap shots taken from di!erent angles. These features are explained brie#y in the next subsections (Note: the extracted points are marked by small circles).
5. Signature veri5cation The manual veri"cation of signatures has several drawbacks. The speed of veri"cation is relatively low and the correct veri"cation rate is a!ected by the technical expertise and the current mood of the veri"er. As a result, there is a real need for the automatic veri"cation of signatures. The goal of an automatic signature veri"cation (ASV) system is to con"rm or invalidate the presumed identity
5.1.1. Central line features The main purpose of feature extraction is to represent the pattern using a su$cient number of points containing as much information as possible. We propose a method to locate desired representative pixels in the image in a simpler way than slope change notation (gradient vector) and line segments. Instead of searching for the points over the entire image, the search can be done in "xed routes.
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
In this feature, a net of 24 central lines is superimposed on the signature image. The intersection points between the central lines and the corresponding points on the signature are then found. The centroid of the signature is taken as the origin. The direction of search begins from the origin point to the border of the image. It seems reasonable to convert the results into polar form such that all intersection points on the same line are de"ned by the line number and their radial displacements. Assume that the origin is (x , y ) and the intersection point coordinates are (x , y ). The radial displacement G G ¸ is given by G ¸i"((x !x )#(y !y )). (7) G G This radial displacement (length) is used as a similarity measure where the tolerance (Delta ¸) is the displacement between two consequent points on the same line. As the angle between any two consequent lines is increased, a smaller tolerance should be used since the number of signi"cant points decreases (i.e., if an intersection point is found on the line j then a displacement delta ¸ is taken to start the search for the next point). It is very useful to use the symmetry property of the lines with respect to the horizontal axis (h), the vertical axis (v), and the diagonal axes (d ) and (d ) in order to reduce the computations required to "nd line points. The image is divided into eight parts and the line points are calculated for one-eighth of the image only. The symmetry property is then applied to get the intersection points in the other seven parts, if they exist. All eight points have the same radial displacement and di!erent coordinates. Assuming that the cartesian coordinates of point p1 in Fig. 4a are (x, y), then the coordinates of the other seven points are: p2"(y, x), p3"(y, !x), p4"(!x, y), p5"(!x, !y),
p6"(!y, !x), p7"(!y, x),
p8"(x, !y). 5.1.2. Corner line features Corner line features are de"ned in the same way as central line features but the origin point is at the upper left corner. This avoids the high intensity central region that central line features su!er from (i.e., large number of points per pattern) since most signatures contain few line segments in the corners. The symmetry property of the lines with respect to the diagonal axis (d) is applied to reduce computations. The coordinates of the two symmetrical points are: p1"(x, y) and p2"(y, x) as shown in Fig. 5. 5.1.3. Central circle features This feature re#ects one of the main aspects of human vision. Physiological studies show that the macula lutea
1733
Fig. 4. (a) Central lines. (b) Central line features.
in the center of the retina is the most visually sensitive part of the eye and that the visual region is a circle. Central annular windows are used in a way which is very close to the human vision system. These circular windows are drawn over the signature in "xed increments. The intersection points between the signature and the kth circle circumference are then found. The kth circumference is divided into eight parts, and the points on oneeighth are calculated. The rest of the points are again calculated using symmetry as shown in Fig. 6. It is enough to represent each point by its polar angle. To translate the results to polar representation, assume that the origin is (x , y ), the intersection point is (x , y ), G G and h is the polar angle, given by G h "tan\((y !y )/(x !x )). G G G
(8)
This polar angle (h) is used as a similarity measure, where the tolerance (Delta h) is the angle between any two consequent points on the same track. The tolerance (Delta h) is a function of the radius of the "rst circle (i.e., if an intersection point is found on the circle j then an angle
1734
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
Fig. 5. (a) Corner lines. (b) Corner line features. Fig. 6. (a) Central circle. (b) Central circle features.
delta h is taken to start the search for the next point). The origin point is the centroid of the signature, and the direction of searching will be counter-clockwise in the "rst-eighth of the image. The eight symmetrical points have the following coordinates: p1"(x, y), p2"(y, x), p3"(y, !x), p4"(!x, y), p5"(!x, !y), p6"(!y, !x), p7"(!y, x), p8"(x, !y). The corresponding angles are calculated as follows: h1, h2"90!h1, h3"90#h1, h4"180!h1, h5"180#h1, h6"270!h1, h7"270#h1, h8"360!h1. 5.1.4. Corner curve features In order to identify the image clearly it is necessary to zoom in (i.e., increase the number of central annular windows) or change the point of view (i.e., vision angle).
The corner curve features are based on the same concept as the central circle features, except that the origin point is the upper left corner (or any other corner). The snapshot is therefore taken from a di!erent angle, allowing the detection of more intersection points. The symmetry property of the points is considered with respect to the diagonal axis (d) as shown in Fig. 7. The coordinates of the two symmetrical points are: p1"(x, y) and p2"(y, x). The corresponding angles are: h1 and h2"90!h1. However, the corner curve features do not improve the results signi"cantly because the central circle features alone is powerful enough. 5.1.5. Critical point features For the purpose of veri"cation, it has been noted that some dominant points on the signature are rich in information content and they are su$cient to verify the forgery of the signature. The idea of dominant or critical points has been applied successfully in the "eld of signature veri"cation. From a structural point of view, critical
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
1735
Fig. 8. Search arround a pixel.
Fig. 7. (a) Corner curves. (b) Corner curve features.
points can be classi"ed into two types: end points and intersection points. This approach is close to human intuition. A simple algorithm is proposed to extract these critical points. It is based on the points within highintensity regions. A square is drawn around any point of this type to detect how many times a line intersects with the square: one time means that the point is an end point, two times means that the point is a line segment, and more than two times means that the point is as intersection point. Pixels in the image plane are determined by (x, y) coordinates. The dissimilarity measure is the Euclidean distance between the corresponding points. Matching between one point and all the remaining points in the model image is computationally exhaustive. Hence, a re"nement is required. The points are divided into 16 regions according to their positions (i.e., the image is divided into 4;4 squares, with each square containing a subset of the points). Each point in the input pattern is compared to the points in its own region in the
Fig. 9. Critical point features.
prototype. If the point does not match any other point in its region, it is compared to the points in neighboring regions only, not in all regions. 5.1.6. Advantages and disadvantages of the features While matching between images using critical points requires an exhaustive combinatorial search, the other features are constrained by the route (line or curve) number, where the search is performed route by route (Fig. 8). However, critical point features produce the best veri"cation results when used alone (Fig. 9). Corner line features and central circle features have a reasonable number of representative points. Since central line features su!er from a high-intensity central region and corner curve features have the lowest veri"cation results, the appropriate discriminating feature set consists of a combination of the corner line features, central circle features and critical point features.
1736
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
5.1.7. Search complexity (1) Critical point features E Without eliminating the matching points, search complexity"n, where n is the average number of points per image. E If the matching points are eliminated, search complexity"n(n#1)/2. E For the best case after re"nement, search complexity"(n/16), where the total number of regions"16. E For the worst case after re"nement, search complexity"8(n/16), where the number of neighboring regions"8. (2) For the other features, search complexity" m*x(x#1)/2, where x is the average number of points per line (or curve), and m is the number of lines (or curves) per image. Table 2 presents the average number of points. 5.2. Application of fuzzy concepts in the verixcation system Since much of the uncertainty in decision-making is derived from the fuzziness of the problem and the similarity between genuine and forged samples, the application of the fuzzy concepts in this "eld in order to arrive at a certainty factor is quite e!ective. Instead of having a threshold that separates between forged and genuine samples, the use of fuzzy feature de"nition rather than sharp thresholds can improve the performance. The proposed system assigns a degree of certainty to the signature type, while existing systems employ what are essentially sharp threshold methods. Once points of interest are selected, the system assigns fuzzy grades to these points depending on their degrees of matching. Various fuzzy rules are used to judge the type of signature read (forged or genuine). 5.2.1. Similarity measure Consider an unknown pattern X represented by n points. It will be identi"ed as a member of class C (during the recognition phase). Let R1 and R2 be the reference vectors of the two prototypes of this class. The
Fig. 10. Fuzzy set for the distance.
pattern X can be assigned two grades of membership to this class. These grades are derived using the method described below: 5.2.1.1. Best-xt method. This method "nds the best "t between any input point on route j (line or curve) and all the points on the corresponding route j (line or curve) for the prototype i in class C. It then assigns a grade to this point according to the degree of matching as shown in Fig. 10 using four fuzzy states based on the distance (i.e., the distance between the radial displacements, the di!erence between the polar angles, or the distance between two points in the x}y plane) within a range (1,12). Each state is described by a word &Match', &Near', &Mid' and &Far'. These are the elements of the fuzzy set. The membership grade functions have a trapezoidal shape. After measuring the distance between the points of the input pattern and the points of the model pattern there will be n1 number of points that have the state &Match', n2 number of points that have the state &Near', n3 number of points that have the state &Mid', and n4 number of points that have the state &Far'. These linguistic values are then converted into conventional numerical values v1, v2, v3 and v4. To calculate the grade Gi, Eq. (9), which is similar to the center of gravity calculation, is applied. Gi"(n1*v1#n2*v2#n3*v3!n4*v4)/((n1#n2 #n3#n4)*v1).
Table 2 Average number of points Feature
n
m
x
Central lines Corner lines Central circles Corner curves Critical points
48 38 39 17 14
24 15 10 10 *
2.0 2.5 3.9 1.7 *
(9)
Using this method, points that may not completely match are not eliminated from consideration. The penalty concept is e!ective, since ideal matching is not to be found. If a match is found the matching points are not deleted from the list, as a closer match may be found between the point in the model pattern (or prototype) and another point in the input sample. This distance grade is used to determine the degree of similarity or dissimilarity between the input sample and the model.
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
5.2.2. Verixcation decision rules It is not di$cult to detect the di!erence between the input sample and the model sample, but it is very di$cult to decide whether it arises from a genuine signature or a forger. Fuzzy rules allow computers to simulate the type of human knowledge known as common sense, which exists mainly in the form of statements that are usually but not always true. One fuzzy rule can replace many conventional rules. When a non-conservative expert was asked to take a decision (i.e., forged or genuine) based on the feature set, he said: &the signature is genuine if it passes the "rst feature test (match with prototype 1 AND prototype 2) OR it passes the second feature test (match with prototype 1 AND prototype 2) OR it passes the third feature test (match with prototype 1 AND prototype 2)'. To translate this human reasoning into conventional mathematical reasoning, for each premise expression connected by AND, take the minimum of the truth of the expression (i.e., take the degree of membership of the minimum value in the expression). On the other hand, for each premise expression connected by OR, take the maximum of the truth of the expression (i.e., the degree of membership of the larger value in the expression). See rule 4 below. On the other hand a conservative expert said &the signature is genuine if it passes the "rst feature test (nearly match with prototype 1 AND prototype 2) AND it passes the second feature test (nearly match with prototype 1 AND prototype 2) AND it passes the third feature test (nearly match with prototype 1 AND prototype 2)'. See rule 2 below. In a similar manner other rules were formed. The rules are: Rule 1: Maximum of features grade j (Average grade of the two prototypes). Rule 2: Minimum of features grade j (Average grade of the two prototypes). Rule 3: Maximum of features grade j (Minimum grade of the two prototypes). Rule 4: Minimum of features grade j (Maximum grade of the two prototypes). Rule 5: Average of features grade j (Maximum grade of the two prototypes). Rule 6: Average of features grade j (Minimum grade of the two prototypes). where j"1 for the corner line features, 2 for central circle features and 3 for critical point features. There are two types of errors. Type I: Accepting a forged signature as genuine. Type II: Rejecting a genuine signature as forged. Any of the above rules can be used to take the "nal decision depending on the type of error to be minimized. As an advanced step it is possible to base the decision on the results of more than one rule by connecting them with logical ORs or ANDs.
1737
Fig. 11. The experimental results, with degree of certainty more than 85%. Without threshold; with threshold.
5.2.3. Experimental results and discussion Among the total number of 330 handwritten signatures, 132 samples were used to train the system and the remaining 198 samples to test its performance. The samples were divided randomly into the learning set and the testing set. All forged samples were included in the testing set. Fig. 11 shows the correct veri"cation rate obtained through the experiments. The experiment (learning and testing) results reveal the following interesting facts about the system: E All the features used for veri"cation had almost the same degree of e!ectiveness whether they were used to compare the sample to prototype or vice versa. E Central line features contain the largest number of points per image. Some of these points are useless because they are concentrated in the central region. E Corner curve features are not highly e!ective in the veri"cation process. E It is quite su$cient to use the corner line features, central circle features and critical point features in the discriminating feature set. E An average of 98% overall veri"cation certainty was achieved. E The best-"t method produced better results than existing methods because each point was given a grade according to its degree of matching. E The decision to accept or reject was controlled by the grade of the signature. E The most powerful features were the critical point features, although they are computationally exhaustive. E The thinning algorithm may reduce the reliability of the system because all the features are based on the thinned image. Also the thinning algorithm constitutes an overhead on the system.
1738
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
E The proposed algorithm is e$cient compared with existing techniques of signature recognition and veri"cation, particularly the techniques which use transformation functions like Hadamard transform and Zernike moments. E The error rate was partially due to unavoidable defects in the preprocessing phase. E Rules 1 and 3 reduced the average error rate of false rejection, while rules 2 and 4 reduced the average error rate of false acceptance. E The rate of correctness in rejecting forged samples is relatively high using rules number 2 and 4, even when the input sample contains some noise. E The speed of recognition and veri"cation is fast enough for real-time processing (it takes nearly 9 s on a 486 PC at 66 MHZ). E Finally, the system does not give a clear cut decision. Instead the result is a degree of certainty that the input sample is genuine or forged.
6. Conclusions and suggestions for future extensions Lately a great deal of e!ort has been focused on the investigation of automatic signature recognition and veri"cation methods. A reliable automatic signature veri"cation system would be of great use in many application areas including law enforcement, security control, etc. Generally, it can be done in two ways: o!line and on-line. The two methods di!er in the form in which the input data are captured. Because it is di$cult to extract individual features from static images or to detect imitations, o!-line signature recognition and veri"cation is usually more di$cult than the on-line equivalent. However, there is still a demand for o!-line systems. 6.1. Conclusions In this paper, an e!ective approach for signature recognition is introduced. Also, the problem of simulated signature veri"cation in o!-line systems is treated using fuzzy concepts in the decision-making process. The experiments resulted in a recognition rate of 95% and a veri"cation rate of 98%. They also demonstrated the e$ciency and robustness of the proposed system. Some techniques were used to improve the overall performance of the recognition and veri"cation systems: E The system was divided into two major phases: the recognition phase and the veri"cation phase. They were handled as two separate parts. E A multistage classi"er was used during the recognition process. E A suitable combination of global and local features was formed for the recognition phase.
E The center-of-gravity feature was localized in 16 square windows and used as a similarity measure in the third stage of the classi"er. E An individual threshold was taken for each class. It was stored within the feature vector of its class. E Two ideal forms were taken for each person and used as prototypes for his/her class. E Computationally expensive local features like critical point features were used during the veri"cation phase. E The best feature set consisted of corner line features, central circle features and critical point features. E Fuzzy concepts were used in the veri"cation decision rules. 6.2. Suggestions for future extensions Following are some concluding remarks and suggestions: (1) It is acceptable to include a date in the signature on contracts and documents, but not on cheques and credit card receipts. Hence, a recognition and veri"cation system is required for this type of signature in which a numeral "lter will be used to eliminate the numerals of the date from the signature. (2) Topological features are a very important class of shape features. They do not change under normal transformations like rotation and scaling. The topological features include holes, corners, strokes, etc. A system based on the human knowledge of experts is suggested to detect the topological features using fuzzy concepts in the recognition and veri"cation phases. It is believed that the decision-making process of existing algorithms can be re"ned by incorporating more human knowledge [11]. Comments are collected from human experts specialized in signatures then analyzed to build a reliable system. The specialist tries to describe how he or she recognizes a signature and veri"es whether or not it is forged including the most important parts of the signature and the most distinguishing features. By using fuzzy linguistic variables and fuzzy rules, a decision system can be created which perform in as close a manner as possible to that of human experts. (3) Rotation is a di$cult problem to solve and the existing technique is time consuming because it depends on determining the major axis of the signature and rotating it to make it parallel to the x-axis, which is a complex task.
7. Summary O!-line signature recognition and veri"cation is an important part of many business processes. It can be used in many applications like cheques, certi"cates, contracts
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
and historical documents. Since each organization may have a huge amount of signatures, a method is required to allow e$cient recognition and veri"cation with high rate of correctness. In this paper, a system of two separate phases for signature recognition and veri"cation is developed. At "rst the signature database is described, then preprocessing steps and feature extraction for the recognition process are discussed. A suitable combination of global and local features is used to produce more distinctive and e!ective features by combining the advantages of both. A recognition technique is developed based on a multistage classi"er. In which a preclassi"cation stage for a group of similar slant signatures is applied in the "rst stage. Then, a recognition scheme is applied to resolve individual identi"cation within a group. In the second stage, the distances between the global feature vector of the input sample and the mean of each class in the group are computed and compared sequentially, in order to select the best three candidates. Finally, in the third stage, the local center points are used to choose the best candidate or to decide that the signature cannot be recognized. This is decision based on the corresponding threshold of each candidate class. New algorithms for signature veri"cation based on fuzzy concepts are also described and tested. The critical points feature is used in this phase. A set of fuzzy rules is used to make a decision with a degree of certainty. An average of 98% overall veri"cation con"dence was achieved. It is concluded from the experimental results that each of the proposed techniques provides outstanding performance on several counts. In addition, the overall performance of the recognition and veri"cation algorithm is e$cient compared with existing techniques, especially those techniques which use transformation functions.
Acknowledgements The authors are grateful to Dr. Amani Saad and Eng. Abir Leheta for their help in reviewing the manuscripts.
References [1] M. Ammar, Y. Yosheda, T. Fukumura, O!-line preprocessing and veri"cation of signatures, Int. J. Pattern Recognition Arti. Intell. 2 (1988) 589}902. [2] C.C. Tappert, C.Y. Suen, T. Wakahara, The state of the art in on-line handwriting recognition, IEEE Trans. Pattern Anal. Mach. Intell. 12 (8) (1990) 787}807. [3] G.A. Auda, Arabic handwritten signature recognition using NN, M.Sc. Thesis, Cairo University, 1990. [4] R. Bajaj, S. Chaudhury, Signature veri"cation using multiple neural classi"ers, Pattern Recognition 30 (1) (1997) 1}7.
1739
[5] Kai Huang, Hong o!-line signature veri"cation based on geometric feature extraction and neural network classi"cation, Pattern Recognition 30 (1) (1997) 9}17. [6] S. Di Zenzo, M. Del Buono, M. Meucci, A. Spirito, Optical recognition of hand-printed characters of any size, position and orientation, IBM J. RES. Develop. 36 (3) (1992) 487}501. [7] Yingyong Qi, B.R. Hunt, Signature veri"cation using global and grid features, Pattern Recognition 27 (12) (1994) 1621}1629. [8] R.C. Gonzalez, P. Wintz, Digital Image Processing, Addison-Wesley, Reading, MA, 1984. [9] M.A. Ismail, Lecture Notes In Pattern Recognition. Alexandria University. Alexandria, 1993. [10] N.B. Venkateswarlu, P.S.V.S.K. Raju, Fast ISODATA clustering algorithm, Pattern Recognition 25 (3) (1992) 335}342. [11] C. Nadal, C.Y. Suen, Apply human knowledge to improve machine recognition of confusing hand written numerals, Pattern Recognition 26 (3) (1993) 381}389.
Further reading M. Ammar, Y. Yosheda, T. Fukumura, Feature extraction and selection for simulated signature veri"cation, Comput. Recogn. and Human Production of Hand Writing, World Scienti"c Publication, Singapore, 1989, pp. 61}76. M. Ammar, Y. Yosheda, T. Fukumura, Structural description and classi"cation of signature images, Pattern Recognition 23 (7) (1990) 697}710. R. Sabourin, R. Plamondon, Preprocessing of handwritten signatures from image gradient analysis, Proceedings of the eighth International Conference On Pattern Recognition, 1986. A. Khotanzad, Yaw Hua Hong, Invariant image recognition by zernike moments, IEEE Trans. Pattern Anal. Mach. Intell. 12 (5) (1990) 489}497. M. Yasuhara, M. Oka, Singnature veri"cation experiment based on nonlinear time alignment, IEEE Trans. System Man Cybernet. (1987) 403}424. Xiaofan Liu, Shaohua Tan, V. Srinivasan, S.H. Ong, Weixin Xie Fuzzy pyramid-based invariant object recognition, Pattern Recognition 27 (5) (1994) 741}756. G. Schwartz, Fuzzy fundamentals, IEEE spect. (1992) 58}61. L. Zadeh, Making computers think like people, IEEE Spect. (1984) 26}32. S.K. Pal, D.K.D. Majumder, Fuzzy Mathematical Approach To Pattern Recognition, J Wiley, New York, 1986. Kandel, Fuzzy Techniques in Pattern Recognition, J Wiley, New York, 1982. I. Yoshimora, M. Yoshimora, Writter identi"cation based on arc pattern transformation, Proceedings of the 8th International Conference On Pattern Recognition, 1988. M. Ammar, Y. Yoshida, Signature analysis by computer, Fourth International Conference, Graphonomics Society, 1989. M. Ammar, Y. Yoshida, A new e!ective approach for automatic o!-line veri"cation of signatures by using pressure features, Proceedings of the 8th International Conference On Pattern Recognition, 1986.
1740
M.A. Ismail, S. Gad / Pattern Recognition 33 (2000) 1727}1740
J.J. Brault, R. Plamondon, Segmenting handwritten signatures at their perceptually important points, IEEE Trans. Pattern Anal. Mach. Intell. 15 (9) (1990) 953}957. L. Yang, B.K. Widjaja, R. Prasad, Application of Hidden Markov models for signature veri"cation, Pattern Recognition 28 (2) (1995) 161}170. M.K. Brown, Preprocessing techniques for cursive script word recognition, Pattern Recognition 16 (5) (1983) 447}458.
H.D. Chang, J.F. Wang Preclassi"cation for hand written Chinese character recognition by a peripheral shape coding method, Pattern Recognition 26 (5) (1993) 711}719. C. Lam, Signature recognition through spectral analysis, Pattern Recognition 22 (1989).
About the Author*M.A. ISMAIL received the B.Sc. (honors) and the M.S. Degrees in Computer Science from the University of Alexandria, Egypt, in 1970 and 1974, respectively, and the Ph.D. degree in Electrical Engineering from the University of Waterloo, Canada, in 1980. He is a Professor of Computer Science and Vice Dean for Education and Student a!airs, Faculty of Engineering, University of Alexandria, Alexandria, Egypt. He has taught computer science and engineering at the University of Waterloo, Canada, University of Petroleum and Minerals (UPM), Saudi Arabia, the University of Windsor, Canada, and the University of Michigan, Ann Arbor. His research interests include pattern analysis and machine intelligence, data structures and analysis, medical computer science, and nontraditional database. About the Author*SAMIA OMAR received the B.S. Degree in Computer Science from Alexandria University, Egypt, in 1993 and the M.S. Degree in 1996 from Alexandria University. Her interest is in pattern recognition and image processing.
Pattern Recognition 33 (2000) 1741}1748
Feature reduction for classi"cation of multidimensional data H. Brunzell *, J. Eriksson Department of Electrical Engineering, The Ohio State University, 205 Dreese Labs, 2015 Neil Avenue, Columbus, OH 43210, USA Department of System Research and Development, Ericsson Microwave Systems, SE-431 84 Mo( lndal, Sweden Received 10 December 1998; received in revised form 18 June 1999; accepted 18 June 1999
Abstract This paper addresses the problem of classifying multidimensional data with relatively few training samples available. Classi"cation is often performed based on data from measurements or ratings of objects or events. These data are called features. It is sometimes di$cult to determine if all features are necessary for the classi"er. Since the number of training samples needed to design a classi"er grows with the dimension of the features, a way to reduce the dimension of the features without losing any essential information is needed. This paper presents a new method for feature reduction, and compares it with some methods presented earlier in the literature. This new method is found to have a more stable and predictable performance than the other methods. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Feature selection; Feature extraction; Classi"cation; Multivariate data; Pattern recognition; Discriminant analysis
1. Introduction The classi"cation problem is, loosely speaking, the problem of, given an observation, determining from which class, out of a number of possible classes, the observation stems. The observed variables that are used to determine the class belonging are often called features. Examples of features are physical attributes such as length and weight, or measurements of voltage, current, or chemical contents. A number of such attributes are stored in a feature vector. The number of di!erent attributes is the dimension of the features. Observing the length, weight, and age of a population of humans will for example give three-dimensional feature vectors. The features can also be the result of a transformation of measured data, such as a Fourier transform. The properties of the di!erent classes can be known, but more often they have to be estimated from observations. This paper starts with an introduction in Section 2 to some di!erent parametric classi"cation methods. The
* Corresponding author. Tel.: #1-614-688-4329; fax: #1614-292-7596. E-mail address: [email protected] (H. Brunzell).
methods are called parametric since each class is represented by a set of parameters, e.g. mean and covariance, instead of the original data itself. One of the most common classi"cation methods is the quadratic classixer (or QDA for quadratic discriminant analysis) which is a special case of the Bayesian classixer. When the dimension of the features is high, it is of great interest to reduce this dimension. Section 3 presents an overview of methods for dimension reduction previously presented in the literature together with a new method. Section 4 presents seven datasets that are used for comparison of the di!erent dimension reduction methods. The results of the comparison are discussed in Section 5.
2. Classi5cation methods The somewhat loose description of the classi"cation problem, given in the introduction, can be tightened up in the following formulation. Suppose that there are c possible classes n , n ,2, n , and an arbitrary data A point belongs to class n with a priori probability P . The G G data points are d-dimensional vectors. An observation x is assumed to be a random vector with a multivariate probability density function p(x"n ) when x is known to G
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 2 - 9
1742
H. Brunzell, J. Eriksson / Pattern Recognition 33 (2000) 1741}1748
belong to class n , i"1,2, c. We wish to "nd a decision G rule n( (x) that tells us to which class x should be classi"ed. The Bayesian classi"er described in the following section is optimal in the sense that it maximizes the class a posteriori probability.
For the general case of c classes we can de"ne a distance function as d (x)"ln"R "!2 ln P #(x!l )2 R\(x!l ). G G G G G G
(6)
The sample x is classi"ed to the class that minimizes d (x). G
2.1. Bayesian classixer Let j(n "n ) be a measure of the loss or penalty incurred G H when the decision n( (x)"n is made and the true class is G n . The goal of the statistical decision theory is to device H the decision rule n( in such a way that the average loss per decision is minimized. Now, let the loss function be
0, j(n "n )" G H 1,
i"j,
(1)
iOj,
i.e., no loss for a correct classi"cation and a unit loss for classi"cation error. With this loss function the decision rule can be derived to be n( H(x)"n
G
if P(n "x)'P(n "x), ∀j, jOi. G H
(2)
The a posteriori probabilities in Eq. (2) can be rewritten using Bayes rule as n( H(x)"n
G
if
p(x"n )P G G '1, p(x"n )P H H
∀j, jOi.
(3)
2.2. Quadratic classixer As a special case, consider classifying a sample to one of the two multi-normal distributions, i.e., x&N(l , R ), G G i"1 or 2: 1 e\(x!l )2R\(x!l ). p(x"n )" G G G G "2pR " G
(4)
The decision rule can then be found explicitly. Since the ratio in Eq. (3) is positive and the logarithm function is monotonic, we can take the logarithm of both sides: n( H(x)"n
P "R " if 2 ln #ln !(x!l )2 R\(x!l ) "R " P
#(x!l )2 R\(x!l )'0 or equivalently, if ln"R "!2 ln P #(x!l )2 R\(x!l ) 'ln"R "!2 ln P #(x!l )2R\(x!l ).
(5)
3. Feature reduction by linear transformations There are many di!erent ways to reduce the dimension of the feature vectors. One obvious way is simply to discard some of the features. This will almost inevitably degrade the performance of the classi"er. In this paper we will instead study linear transformations of the features. A linear transformation has the form xPy"T2x,
(7)
where T is a d;n matrix and d is the original feature dimension and n is the reduced dimension. As will be shown later, it is possible in some cases to select T such that no performance loss results. Before the method of this paper is presented, a brief overview of some previously suggested methods is given. 3.1. A brief introduction to feature reduction methods The method of discarding features plus some general results on properties of admissible feature reduction methods are given in Ref. [1]. Fisher's linear discriminant function, introduced in Ref. [2] for the two class problem, can be seen as a way to reduce the feature dimension to one. Fisher's method "nds the vector, a (corresponding to T in Eq. (7)), that minimizes the scatter within classes, while maximizing the scatter between classes. The within-class scatter matrix is de"ned as A W" P R G G G and the between-class scatter matrix as A B" P (l !l )(l !l )2, G G G G where A l " P l, G G G Fisher thus maximized the function a2Ba J" . (a2Wa)\
(8)
(9)
(10)
(11)
His solution was a"(R #R )\(l !l ). It can easily be shown that a is an eigenvector to W\B.
H. Brunzell, J. Eriksson / Pattern Recognition 33 (2000) 1741}1748
The canonical variables [3] can be seen as an extension to Fisher's method when the number of classes is greater than two and the reduced dimension is greater than one. The canonical variables are the eigenvectors corresponding to the n largest eigenvalues of W\B. Several other methods exist that are extensions to Fisher's linear discriminant function [4,5]. Ref. [5] compares four di!erent feature reduction methods; the Fisher's discriminant plane, the Fisher}Fukunaga}Koonz transform, the Fisher-radius plane transform, and the Fisher-variance plane. The method that is the topic of the present paper is also closely related to Fisher's method. For the twoclass problem the criterion in Eq. (11) is equal to the Mahalanobis distance between the two classes. This fact will be used in the following section to extend the method to multiple classes. It is also possible to use Principal Component Analysis (PCA) [3] or variations thereof [6] for feature reduction. 3.2. Mahalanobis-based linear transformation
* "(l !l )2R\(l !l ).
(12)
From this measure an upper bound on the expected error probability, e, can be estimated as 2P P . e) 1#P P *
Assuming that all classes are equally probable, we propose a separation measure for the general case of c classes de"ned as *" (l !l )2 (R #R )\(l !l ). G H G H G H WGHWA
(14)
This separation measure is proportional to the geometrical mean of the Mahalanobis distances between all classes. To simplify the notation, let m "l !l and GH G H C "R #R . The following proposition that states how GH G H the transformation should be chosen is inspired by Eriksson [8], where the Mahalanobis-based transformation is derived for a completely di!erent application. Proposition 1 (Mahalanobis-based dimension reduction). Let m and C be dexned as above, and assume that GH GH d'c(c!1)/2. Introduce the matrix U"[C\m 2C\m 2C\ m ] GH GH A\A A\A for 1)i(j)c.
The choice of optimality criterion when designing a feature dimension reduction method is closely connected to the choice of classi"er. We will here assume the use of the quadratic classi"er described in Section 2.2. The obvious optimality criterion is of course to minimize the number of misclassi"cations. The main problem with the design is that, given a number of classes in a multidimensional space, there is no practically useful relation between the class con"gurations and the expected error probability. All such relations involve multidimensional integrals of probability distributions along curves de"ned by the decision rules of the classi"er. There are, however, ways to estimate an upper bound on the error of the quadratic classi"er. For the two-class case it is possible to derive distance measures between the classes, such as the Cherno! distance and Bhattacharyya distance [7], that can be used to estimate an upper bound on the error. We will here introduce an error bound based on the same ideas, but with another distance function. This error bound is then used as an optimality criterion when designing the feature reduction method. Consider the case of classifying to one of the two classes with mean vectors l and covariance matrices R , i"1, 2. Let R"P R # G G P R , where P and P are the a priori probabilities. A measure of separation between these classes is the Mahalanobis distance [7], de"ned as
1743
(15)
The separation measure *" m2 C\m GH GH GH WGHWA
(16)
is preserved by the transformation xPy"T2x
(17)
if T satis"es the condition R(T).R(;).
(18)
Proof. Consider one term, m2C\m, in the separation measure, where the indices have been dropped for ease of notation. Let W"CT and apply the transformation: m2C\mPmT T(T2CT)\T2m "m2C\W(W2W)\W2C\m "m2C\PWC\m,
(19)
where PW is the projection matrix onto the column space of W. Preservation of the separation measure leads to the following condition: m2C\PWC\m"m2C\m
(20)
which is satis"ed if PWC\m"C\m.
(21)
This is equivalent to the condition (13)
R(W)"R(CT).R(C\m).
(22)
1744
H. Brunzell, J. Eriksson / Pattern Recognition 33 (2000) 1741}1748
This condition can be rewritten as R(T).R(C\m).
4. Simulations on various datasets (23)
This condition is trivially satis"ed for T as in Eq. (18), since U contains all C\m . 䊐 GH GH Note that the dimension of the reduced feature space is determined by the number of columns in T. An obvious choice for T would be T"U, but the matrix U has c(c!1)/2 columns, where c is the number of populations. This means that if c(c!1)/2'd, a reduction is not possible. Therefore, it is often of practical interest to use a suboptimal transformation. The suboptimal transformation can be constructed by performing a singular value decomposition (SVD) of U, U"QSV2.
(24)
The transformation matrix T is then taken as the "rst n columns of Q, where n is the desired dimension, i.e., T is the n principal left singular vectors of U. Though this approximation resembles the principal component analysis (PCA) [3], the interpretation is completely di!erent. The PCA performs a least-squares approximation of the original features into a lower dimension, while the suboptimal transformation is an approximation of the optimal transformation matrix. This means that the PCA preserves the energy of the features, while the proposed transformation preserves the separation between the classes. In the case where the covariance matrices of the feature vectors are equal, it is easy to see that the rank of U is only c!1. Every di!erence in the covariance matrices will increase the rank of U. However, practical experience shows that the ewective rank of U is only slightly larger than c, indicating that a signi"cant dimension reduction is possible in most cases. For the two-class problem, the proposed Mahalanobis-based transformation reduces to T"U" (R #R )\(l !l ), which is exactly Fisher's linear dis criminant function. The above-derived method resembles a method presented by Young et al. [9] (also restated in Ref. [10]). The authors there form a matrix M as M"[l !l "l !l "2"l !l " A R !R "R !R "2"R !R ]. A
(25)
This matrix is of size d;(c!1)(d#1). The transformation matrix T is then found by a low-rank approximation using an SVD, just as above. This method is also derived to minimize the Bayes error of a classi"er. Though the formulations of the two above methods resemble each other, they will in general not yield the same results.
To evaluate the feature reduction method presented in this paper we compare the classi"cation results for the quadratic classi"er (Eq. (6)). The error probabilities are estimated using the leave-one-out method [11]. The leave-one-out method uses N!1 out of N data samples to design the classi"er, i.e., to estimate means and covariances of the classes. Then, the Nth sample is classi"ed. This procedure is repeated N times such that all samples have been classi"ed. First, the error probability using the full data dimension is estimated. Then data are reduced to two dimensions and the error probability is again estimated. The evaluation is made using seven di!erent datasets. The reason for choosing these speci"c datasets is that they are available for public down-load from various ftp sites. All the datasets have been studied earlier by other researchers making the results easy to compare and reproduce. For example are the WINE, IRIS, LUNG CANCER, and GLASS datasets studied in Ref. [5]. Below follows a brief description of the data. For further details we refer to the original sources. E WINE: These data are the quantities of 13 constituents derived from a chemical analysis of three types of wine. The wines are grown in the same region in Italy, but derived from three di!erent cultigens. The data were originally introduced in Ref. [12], and have been further studied in, e.g. Ref. [5]. The dataset consists of 178 samples. E FISH: These data are six measurements on six types of "sh caught in a lake in Finland [13]. The original dataset consisted of seven types of "sh, but one type was excluded since only six samples existed on that type. The six measurements are weight, three di!erent length measures, height and width. After some incomplete samples have been removed, there remain 152 samples. E IRIS: These are Fisher's classical data from measurements on three di!erent species of Iris (Iris Setosa, I. Verginica and I. Versicolor) [2]. The measurements on the #owers are petal width, petal length, sepal width and sepal length, and the dataset consists of 150 samples. E LUNG CANCER: These are observations of three types of lung cancer [14]. Fifty-six parameters are observed, but only thirty-two samples are available. This makes it practically impossible to draw any general conclusions about a classi"cation method based on these data. Still, some interesting observations can be made while analyzing the data.
The FISH dataset is available from gopher://jse.stat.ncsu. edu:70/11/jse/data, the others from ftp://ftp.ics.uci.edu/pub/machine-learning-databases/.
H. Brunzell, J. Eriksson / Pattern Recognition 33 (2000) 1741}1748
E GLASS: This dataset is originally from D. German. The measurements are the chemical constitution of glass, manufactured by two di!erent processes. The variables are the weight percent of eight di!erent solids (Na, Mg, Al, Si, K, Ca, Ba and Fe) plus the refractive index of the glass. Eighty-seven measurements are made on glass manufactured by the "rst process (#oat process) and seventy-six on glass by the second process (non-#oat process). E BCW: BCW stands for breast cancer Wisconsin and these data were collected by Dr. William H. Wolberg. These data are nine di!erent attributes of breast tumors, graded from 1 to 10, and grouped into two classes; benign (444 instances) or malignant (239 instances). As compared to the original dataset sixteen samples have been removed due to missing data. E IONOSPHERE: These data are radar measurements of the ionosphere from Johns Hopkins University [15]. There are thirty-four variables in the dataset, corresponding to the real- and imaginary parts of seventeen pulses output from a correlation receiver. The data are grouped into two classes; `Gooda are radar return showing evidence of some type of structure in the ionosphere. `Bada returns are those that do not.
1745
Fig. 1. Histogram for "rst variable of GLASS dataset.
For some of these datasets the number of data is low compared to the number of variables. This can cause problems when inverting the estimated covariance matrices in the distance function of Eq. (6). To avoid this problem, the condition numbers of the estimated covariance matrices are calculated, and if too large the estimates are regularized. The regularization is made by adding a small value times the identity matrix, i.e., R"R#eI. The threshold on the condition number is set to 10, and e is set to 10\.
5. Results The quadratic classi"er is derived under the assumption that the features are normally distributed. None of the datasets used in this paper ful"ll this assumption. As an example see Fig. 1, where a histogram of the "rst variable of class number one in the GLASS dataset is plotted. In the same plot a normal distribution with the same mean and variance is also plotted. Since the feature reduction transformation creates linear combinations of the original (independent) features, the transformed features will often be better approximated by a normal distribution. This can be seen in Fig. 2 where the dimen-
Central Research Establishment, Home O$ce Forensic Science Service, Aldermaston, Reading, Berkshire RG7 4PN. University of Wisconsin Hospitals, Madison, Wisconsin, USA.
Fig. 2. Histogram for "rst variable of GLASS dataset after reduction to two variables.
sion of the GLASS data has been reduced to 2 and a histogram of the "rst feature is plotted. The feature is still not normally distributed, but the approximation is at least better. Now, on to the classi"cation results. The datasets are summarized in Table 1. All classi"cation results from the feature reduction methods are summarized in Table 2. Columns 2}5 are the results using the Mahalanobisbased linear transformation (MLT), the canonical variables (CV), the method due to Young, Marco and Odell (YMO), and principal component analysis (PCA). Columns 6}9 are taken from Ref. [5], and shows the classi"cation results when the feature dimension is reduced
1746
H. Brunzell, J. Eriksson / Pattern Recognition 33 (2000) 1741}1748
Table 1 Datasets and correct classi"cations (in %) using the full feature dimension (d). Dataset
d
c
N
QDA
WINE FISH IRIS GLASS BCW IONOSPHERE LUNG CANCER
13 6 4 9 9 34 56
3 6 4 2 2 2 2
178 152 150 163 683 351 32
99.4% 99.3% 97.3% 62.6% 95.0% 82.9% 50%
to 2 using the Fisher's discriminant plane (FDP), the Fisher}Fukunaga}Koonz transform (FF), the Fisher-radius plane transform (FR), and the Fisher-variance plane (FV). The datasets WINE, IRIS, GLASS, and LUNG CANCER were classi"ed in Ref. [5] using QDA in the full data dimension. The results for the three "rst datasets agree perfectly with the ones obtained in the present paper, but the result for LUNG CANCER data di!ers. In the present paper, 50% of the data were correctly classi"ed, but in Ref. [5] only 31.2% were correctly classi"ed. This is an e!ect of the low number of samples in this dataset, resulting in a high variance of the error estimates. The low number of data also leads to all covariance estimates being singular. This, in turn, leads to that the choice of regularization plays an important role. A version of QDA called regularized discriminant analysis (RDA) was also used in Ref. [5]. This method achieved a correct classi"cation rate of 62.5%. It is the authors opinions that the LUNG CANCER data cannot be used to evaluate any classi"cation system. The dataset contains only thirty-two observations of "fty-six variables. As long as the variables are not linearly dependent, it is possible to project the three classes onto three distinct points in one dimension. To elaborate further, arrange the data in a 56;32 matrix, X, where each column is one observation and each row one variable. Let the vector y contain indices that shows the class belonging
Fig. 3. Correct classi"cations in % for dataset FISH.
for each of the thirty-two samples, e.g., +1, 2, 3,. The equation T2X"y
(26)
then has at least one solution for T. Ref. [14] use this dataset to design and evaluate a classi"er. For most of the datasets, a dimension reduction using the MLT or the canonical variables results in better classi"cation results than using the full dimension. Besides for the LUNG CANCER data (from which no inference can be made), it is only for the FISH dataset that the performance degrades. To gain insight in why this happens the classi"cation result versus feature dimension for the FISH dataset is plotted in Fig. 3 for the MLT. The plot shows that the classi"cation results are good when three dimensions or more are used in the classi"er, but if the dimension is reduced below that, the correct classi"cation rate drops radically. This can be seen as an indication on that the `inherent dimensionalitya of the discrimination problem is three. For the datasets with two classes (GLASS, BCW, and IONOSPHERE), the classi"cation results are actually
Table 2 Correct classi"cations in % Dataset
MLT
CV
YMO
PCA
FDP
FF
FR
FV
WINE FISH IRIS GLASS BCW IONOSPHERE LUNG CANCER
99.4% 89.5% 98.0% 69.3% 97.1% 82.9% 43.7%
99.4% 94.7% 97.3% 68.1% 96.6% 83.5% 40.6%
75.8% 54.6% 96.7% 65.0% 96.5% 82.9% 53.1%
75.8% 73.0% 96.7% 62.0% 96.6% 60.4% 71.9%
88.2% * 97.3% 62.0% * * 59.4%
99.4% * 96.0% 58.9% * * 37.5%
93.8% * 70.7% 62.0% * * 46.9%
74.1% * 98.0% 68.1% * * 62.5%
H. Brunzell, J. Eriksson / Pattern Recognition 33 (2000) 1741}1748
1747
Fig. 4. Correct classi"cations in % for dataset BCW. Fig. 6. Scatter plot for WINE data in two dimensions (PCA).
the classes cannot be projected into one dimension without mixing up the classes. The three classes are quite nicely separated in two dimensions though. Fig. 6 shows the same type of plot, but the feature dimension is now reduced using PCA. The classes are considerably more mixed up.
6. Conclusions
Fig. 5. Scatter plot for WINE data in two dimensions (MLT).
improved if the dimension is reduced to one (70.6, 97.2 and 83.5%). The reason for this is probably that the features get more normal distributed, unimportant (or noisy) features are removed and the parameter estimates get better. This phenomenon is illustrated in Fig. 4, where the classi"cation results for the BCW dataset using MLT are plotted versus feature dimension. The WINE dataset, which is a three class problem, cannot be reduced to less than two dimensions. Fig. 5 shows a scatter plot of the data in two dimensions. The three di!erent classes are indicated with di!erent symbols (*, #, and 䉫). Ellipses, centered at the mean of each class, with two standard deviations radii are also plotted in the same "gure. It is clear from the "gure that
As mentioned earlier, the WINE, IRIS and GLASS datasets are studied in Ref. [5]. The dimension is there reduced to two using various methods. The best results from the di!erent methods in that investigation are almost identical with the ones obtained in the present paper using the proposed Mahalanobis-based feature transformation. One big di!erence is, however, that the best results in that paper were not obtained with one single method on all three datasets. This indicates that the method presented here performs well under very varying conditions, without strong constrains on the data.
7. Summary A di$culty often encountered in classi"cation problems is how to select relevant features. The features are often observations or measurements of objects or events. It is sometimes di$cult to determine before the experiment is performed which of these measurements or observations that are necessary for a good classi"cation. One way to extract the relevant features is to apply a linear transformation to the original feature vectors. Several suggestion of how the linear transformation should be
1748
H. Brunzell, J. Eriksson / Pattern Recognition 33 (2000) 1741}1748
selected have been proposed previously in the literature. Many of them are extensions to Fisher's linear discriminant function, proposed by Fisher in 1936. The new method presented in the present paper can also be viewed as an extension of Fisher's linear discriminant function. The linear transformation is derived by minimizing a criterion function, namely an upper bound on the Bayes classi"cation error. The new method is compared to some previously proposed methods. The comparison is based on seven di!erent datasets. Some of these datasets all well known to the pattern recognition society, and are also available for down-load from various ftp-sites. The new method is found to perform better, or equally good, as any of the old methods. It should, however, be noted that none of the old methods perform well on all datasets. Our conclusion is therefore that the new method, presented in this paper, achieves good results regardless of the type of data, while some previously presented methods do not.
References [1] L. Devroye, L. GyoK r", G. Lugosi, A probabilistic theory of pattern recognition. Applications of Mathematics: Stochastic Modelling and Applied Probability, Springer, New York, Inc., 1996. [2] R.A. Fisher, The use of multiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179}188. [3] K.V. Mardia, J.T. Kent, J.M. Bibby, Multivariate Analysis, Academic Press, New York, 1979. [4] I.D. Longsta!, On extensions to Fisher's linear discriminant function, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9 (2) (1987) 321}325.
[5] S. Aeberhard, O. de Vel, D. Coomans, Comparative analysis of statistical pattern recognition methods in high dimensional settings, Pattern Recognition 27 (8) (1994) 1065}1077. [6] J. Duchene, S. Leclercq, An optimal transformation for discriminant and principal component analysis, IEEE Trans. Pattern Anal. Mach. Intell. 10 (6) (1988) 978}983. [7] P.A. Devijver, J. Kittler, Pattern Recognition, A Statistical Approach, Prentice-Hall, Englewood Cli!s, NJ, 1982. [8] J. Eriksson, Data reduction in sensor array processing using parameterized signals observed in colored noise, Technical Report CTH-TE-46, Department of Applied Electronics, Chalmers University of Technology, 1996. [9] D.M. Young, V.R. Marco, P.L. Odell, Quadratic discrimination: some results on optimal low-dimensional representation, J. Statist. Plann. Inference 17 (1987) 307}319. [10] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York, 1992. [11] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd Edition, Academic Press, San Diego, 1990. [12] M. Forina et al., PARVUS*an extendible package for data exploration, classi"cation and correlation, Technical Report, Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy. [13] P. Brofeldt, Bidrag till kaK nnedom om "skbesta ndet i va ra sjoK ar, Finlands Fiskeriet Band 4, Meddelanden utgivna av "skerifoK reningen i Finland, Helsingfors, Finland, 1917. [14] Z. Hong, J. Yang, Optimal discriminant plane for a small number of samples and design method of classi"er on the plane, Pattern Recognition 24 (4) (1991) 317}324. [15] V.G. Sigillito, S.P. Wing, L.V. Hutton, K.B. Baker, Classi"cation of radar returns from the ionosphere using neural networks, Technical Report 10, Johns Hopkins APL Technical Digest, 1989.
About the Author*HAs KAN BRUNZELL was born in 1968. He received his Master of Science degree in Electrical Engineering in 1994 from Chalmers University of Technology in Gothenburg, Sweden. In 1998 he received his Ph.D. in Signal Processing, also from Chalmers University. His interests are in statistical signal processing and pattern recognition. The main current research topic is detection and classi"cation of buried landmines, a project carried out in cooperation with the Swedish Defence Research Establishment. He is currently with the Information Processing Systems Lab at Ohio State University, Columbus, Ohio. About the Author*JONNY ERIKSSON was born in SjoK vik, Sweden, in 1965. He received the M.S. degree in electrical engineering from Chalmers University of Technology, Gothenburg, Sweden, in September 1994. In December 1996, he received the Technical Licentiate degree in signal processing from Chalmers University of Technology. His research interests include model-based signal processing and sensor array processing with applications to airborne radar. He is currently with the Department of System Research and Development, at Ericsson Microwave Systems, MoK lndal, Sweden.
Pattern Recognition 33 (2000) 1749}1758
An improved maximum model distance approach for HMM-based speech recognition systems Q.H. He , S. Kwong*, K.F. Man, K.S. Tang South China University of Technology, People's Republic of China Department of Computer Science, City University of Hong Kong, 83 Tatchee Ave, Kowloon, Hong Kong, People's Republic of China Received 28 January 1999; received in revised form 21 June 1999; accepted 21 June 1999
Abstract This paper proposes an improved maximum model distance (IMMD) approach for HMM-based speech recognition systems based on our previous work [S. Kwong, Q.H. He, K.F. Man, K.S. Tang. A maximum model distance approach for HMM-based speech recognition, Pattern Recognition 31 (3) (1998) 219}229]. It de"nes a more realistic model distance de"nition for HMM training, and utilizes the limited training data in a more e!ective manner. Discriminative information contained in the training data was used to improve the performance of the recognizer. HMM parameter adjustment rules were induced in details. Theoretical and practical issues concerning this approach are also discussed and investigated in this paper. Both isolated word and continuous speech recognition experiments showed that a signi"cant error reduction could be achieved by IMMD when compared with the maximum model distance (MMD) criterion and other training methods using the minimum classi"cation error (MCE) and the maximum mutual information (MMI) approaches. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.
1. Introduction Hidden Markov Models have been proven to be one of the most successful statistical modeling methods in the area of speech recognition, especially for continuous speech recognition [1,2]. The most di$cult problem with HMM application is how to get a HMM model for each basic recognition unit (which may be subword, word or phrase) based on a limited training data. Besides the maximum likelihood estimation approach, some other approaches were proposed to solve the training problem of HMMs, such as the maximum mutual information (MMI) criterion [3,4], minimum discrimination information (MDI) criterion [5] and minimum classi"cation error (MCE) [6,7]. The MMI approach assumes that a word model is given and attempts to "nd the set of HMMs in which the sample averages of the mutual information with respect to the given word model is
* Corresponding author. Tel.: #852-2788-7704; fax: #8522788-8614. E-mail address: [email protected] (S. Kwong).
maximum. The MDI approach is performed by joint minimization of the discrimination information measure over all probability density (PD) of the source that satis"ed a given set of moment constraints, and all PDs of the model from the given parametric family. The expected performance of the MDI approach for HMM is as yet unknown since it has not been fully implemented or studied, and no simple robust implementation of the procedure is known. Both MMI and MDI modeling approaches aim indirectly at reducing the error rate of the recognizer. In either case, however, it is di$cult to show theoretically that the modeling approach results in a recognition scheme that minimized the probability of error [8]. MCE di!ers from the distribution estimation approaches, such as ML, in that the recognition error is expressed in the computational steps in such a way that it would lead to the minimization of recognition errors. It was claimed that, in general, a signi"cant reduction of recognition error rate could be achieved by MCE against the traditional ML method [7]. The main problem of this approach is how to select a proper error function that could incorporate the recognition operation and performance in a functional form.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 4 - 2
1750
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
We proposed a maximum model distance (MMD) criterion for training HMMs in Ref. [9], which could automatically, focus on those training data that are important for discriminating acoustically similar words from each other. Both speaker-dependent and multi-speaker experiments demonstrated that MMD approach could signi"cantly reduce the number of recognition error, compared with the ML approach. The main disadvantage of MMD is that it pays same attention to all competitive words during the estimation of model parameters of a labeled word, which is not very true in reality. Di!erent competitive words should have di!erent impact in the recognition phase of the system. Through further study, we adopted a more suitable model distance de"nition, which makes di!erent competitive words play di!erent roles in the training phase. The HMM parameter adjustment formulation are derived in Section 2, theoretical and practical issues concerning this approach in speech recognition are investigated. Theoretical analysis and experimental results demonstrated that IMMD is superior to MMD in terms of recognition rate.
2. The improved maximum model distance approach For simplicity of notation, we assume that the task is to recognize a vocabulary of M acoustic units (a unit refers to any legible lexical unit such as phoneme, subword, word or phrase), and a HMM model is constructed for each word. Let ""+j ,l"1,2,M, be the model set, J j"(n, A, B) represents a HMM with N states.
The maximum model distance (MMD) criterion is to "nd the entire model set " such that the model distance is maximized: 4 (") "arg max D(j , "). (4) J ++" J 2.2. Limitation of D(j , ") J Usually, the classi"er/recognizer is operating under the following decision rule: C(O)"C i! P(O"j )"max P(O"j ). (5) J J F F In Eq. (3), all competitors of word j are considered J with the same level of importance, which in general is not realistic with the decision rule in the recognition phase. Assume that O is labeled as j , the competitors M of j could be classi"ed into two clusters, one is S " M +j , P(O"j )*P(O"j ),, and the other is S " F F M +j , P(O"j )(P(O"j ),. If S is not null, then an incorrect F F M decision is made. The goal of any classi"er design is to achieve a minimum error probability. Therefore, di!erent competitors of j should have di!erent impact in the M training phase. If the in#uences of competitors on the system performance are considered at the training (learning) mode of a recognizer, then the competitors of j in M S should be considered more seriously. In other words, utterance O should have more in#uences on models of S than that of S . The aim is to reduce the size of S to zero, which "nally will make the decision correct. 2.3. The improved MMD approach
2.1. Maximum model distance approach For any pair of HMMs j and j , Juang and Rabiner J F [10] proposed a probabilistic distance measure 1 D(j , j )" lim +log P(OJ"j )!log P(OJ"j ),, (1) J F J F ¹ 2J J where OJ"(oJ oJ oJ 2oJ J) is a sequence of observation 2 symbols generated by j . Petrie's limit theorem [11] J guarantees the existence of such a distance measure and ensures that D(j , j ) is nonnegative. We generalized this J F de"nition for any observation sequence with "nite length, which is 1 D(j , j )" +log P(OJ"j )!log P(OJ"j ),. (2) J F J F ¹ J And furthermore de"ned a model distance measure D(j , ") between model j and model set " as J J 1 D(j , ")" log P(OJ"j ) J J ¹ J 1 4 ! log P(OJ"j ) . (3) F
When we take the above considerations into account and combine the basic principle of discriminative training [12], a new de"nition of D(j , ") which relax the J shortcoming of the de"nition in Eq. (3) is de"ned as following:
1 D(j , ")" log P(OJ"j ) J J ¹ J
+ E (P(OJ"j ))E , (6) F FF$J where g is a positive number. When g approaches R, the term in the bracket becomes max P(OJ"j ), i.e. FF$J F only the top competitor is considered. When searching the classi"er parameter ", one could realize di!erent weight distributions among the competitors of j by J varying the value of g:
1 !log M!1
+ + 1 D(")" D(j , ")" log P(OJ"j ) J J ¹ J J J 1 + E !log (P(OJ"j ))E F M!1 FF$J
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
+ 1 1 + 1 " log P(OJ"j )! J ¹ ¹ g J J J J
;log
It can also be seen that when g approaches R, RJ is I equal to either 0 or 1. When OI makes j the top competiJ tor of j , then RJ"1. In all other cases, RJ will equal 0: I I I
+ (P(OJ"j ))E F FF$J
1 + 1 1 ! log . g ¹ M!1 J J
RJ" I
(7)
Since D(") is a smooth and di!erentiable function in terms of the model parameter set ". Traditional optimization procedures like the gradient scheme could be used to "nd the optimal solution of Eq. (4). The parameter adjustment rule is , " I "" #e ; *D(")" L> L L L L
(8)
Where " I is used to distinguish from ", which satis"es the stochastic constraints on the HMM model parameters, such as , a "1 (i"1, 2,2, N). e is a small positive GH L Hsatis"es number that certain stochastic convergence constraints [13]. ; could be an identity matrix or a properL ly designed positive-de"nite matrix, D(") is the gradient vector of the target function with respect to the parameter set ". From Eq. (7), we get *D(") 1 *P(OJ"j ) J " *j ¹ P(OJ"j ) *j J J J J + RJ *P(OG"j ) I J, ! *j ¹ P(OG"j ) J J GG$J G
1751
(9)
where PE(OG"j ) J RJ" I + PE(OG"j ) F FF$G and it makes the di!erence between the IMMD and the MMD approaches. If we let P " P(OI"j ), then RJ can JI J I be rewritten as RJ"(P /P )E/ + (P /P )E. It can I JI II FI II FF$I be seen that P /P has a close relationship with the JI II model distance D(j , j ) (refer to Eq. (2)). The term J I P /P could be interpreted as the similarity between JI II model j and j measured on the observation sequence J I OI. Based on the above observations, the term RJ could I be explained as a relative similarity measure between j and j against all competitors of j . If j is more J I I J similar to j against all other competitors of j , then the I I probability for j to mis-recognize OI is high. Also, if RJ is J I larger than other RF(hOl, k), then it makes OI labeled as I j have much in#uence on the model parameter adjustI ment of j , which is of bene"t to improve the distinguishJ ing ability of j . Thus, it hints that this training procedure J could automatically be focused on those training data that are important for distinguishing between acoustically similar words.
1, P(OI"j )"max P(OI"j ) J F F
(h"1, 2,2, N, hOk).
0, P(OI"j )Omax P(OI"j ) J F F
This means that during the re-estimation of model parameters, IMMD not only improve the ability of j to J distinguish its own tokens from others, but also reduces the possibility of j to become the top competitor of its J competitors. This will improve the discriminative ability of the entire model set ". When g approaches 0, RJ will I approach 1/(M!1), and IMMD degenerates to MMD. At this point, we could list the relationship between IMMD and MMD as follows. E Both IMMD and MMD consider the discrimination e!ect of the competitive tokens to the parameter estimation of model j . MMD considers all the competiJ tive tokens at the same level, but IMMD weighted the contribution of each competitive token to j by J RJ (relative in#uence). This in fact is a more reasonable I way to utilize the given data than the MMD approach. E MMD is a special case of IMMD, when setting g to 0. In the case of discrete HMM with N states and K distinct discrete observations, for model j , we could induce J the following parameter adjustment rules by using a proper positive-de"nite matrix sequence ; (for the L induction detail, please refer to Appendix A):
+ nL>"nL#e cJ (i)! RJcF (i) , G G L F FF$J i"1, 2,2, N,
(10a)
+ a L>"aL #e sJ ! RJsF , GH GH L GH F GH FF$J i, j"1, 2,2, N,
(10b)
j"1, 2,2, N, + bI L>(k)"bL(k)#e cJ ! RJcF , H H L HI F HI k"1, 2,2, K, FF$J (10c) where we use x to distinguish from the x that satis"es the stochastic constraints with HMM: (i) , n "1, (ii) G G , a "1,∀i, (iii) + b (k)"1,∀i. GH G H I cF (i)"P(q "i"OF, j )/¹ is the expected frequency J F in state i of j at time t"1 in OF normalized by J ¹ ; sF "1/¹ 2F\P(q "i, q "j"OF, j ) the expected F GH F R R R> J
1752
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
number of transitions from state i to state j of j J in OF normalized by ¹ ; cF "1/¹ 2F P(q "i"OF, F HI F R R j )d(o , l ) the expected number of times in state j of J R I j and observing symbol v in OF normalized by ¹ . J I F Comparing Eq. (10) with the corresponding equations in Ref. [1], the di!erence is in the use of an adaptive weighing factor for each acoustically similar utterance of OJ instead of the constant factor 1/(M!1). The advantages of this adaptive weighting factor have been discussed above. For the stochastic constraints with HMM, Eq. (10) should be normalized with x "x / x . G G G G In the case of continuous HMM, with N states, in which each state output distribution is a "nite mixture of the form ) b (o)" c N(o, u , R ), 1)j)N, H HI HI HI I
(11)
where o is the observation vector being modeled, c is the HI mixture coe$cient for kth mixture in state j and satis"es the following stochastic constraint: ) c "1, c *0, 1)j)N. HI HI I
(12)
N( ) ) is an any log-concave or elliptically symmetric density. Usually N( ) ) denotes a normal distribution with mean vector u "[u ]* and covariance matrix R for HI HIJ J HI kth mixture component in state j. For the sake of simplicity, R is assumed to be diagonal, i.e. R "[p ]* . HI HI HIJ J Since P(O"j)" , , a (i)a b (o )b ( j), based R GH H R> R> G H on Eq. (11), we get *P(O"j) P(O"j) 2 " c ( j, k), R c *c HI R HI
(13)
*P(O"j) P(O"j) 2 " c ( j, k)(o !u ), R RJ HIJ p *u HIJ HIJ R
(14)
*P(O"j) P(O"j) 2 " c ( j, k) R *p p HIJ R HIJ
o !u RJ HIJ !1 , p HIJ
(15)
where c ( j, k) is the probability of being in state j at time R t with the kth mixture component accounting for o , i.e., R c ( j, k) R
"
a ( j)b ( j) c N(o , u , R ) R R HI R HI HI , a ( j)b ( j) ) c N(o , u , R ) R R HI HI H R I HI 1 n b ( j)c N(o , u , R ), HI HI P(O"j) H HI
t"1
"
1 , a (i)a b ( j)c N(o , u , Rj ), t'1. R HI I P(O"j) G R\ GH R HI (16)
For model j , from Eqs. (8) and (9), we could induce the J following parameter adjustment rules by designing a proper positive-de"nite matrix sequence ; : L
+ cL>"cL #e cJ( j, k)! R cF( j, k) , HI HI L F F F$J + uL>"uL #e *oJ! R *oF , HIJ HIJ L J F J F + pL>"pL #e pL uJ! R uF , HIJ HIJ L HIJ F HIJ FF$J where
(17)
(18)
(19)
1 2F cF( j, k)" cF( j, k), R ¹ F R 1 2J cF( j, k)(oF !u ), *oF" R RJ HIJ J ¹ F R 1 2F cF( j, k) uF " R HIJ ¹ F R
oF !u RJ HIJ !1 . p HIJ
The re-estimation formula of a , n is identical to that GH G used for the discrete observation densities (Eq. (10)). 2.4. Computation complexity analysis HMM with continuous probability density functions will be used in our experiments, and its computation complexity is analyzed. For introducing the weight factor RJ, IMMD needs I extra computation to compute RJ. When calculating I contributions of OI to j , RJ should be estimated "rst. J I According to the de"nition of RJ, M!1 forward calcuI lation is needed to compute P(OI"j ) (h"1,2, M, hOl). F It is obvious that the IMMD will have much higher computation complexity than the MMD. However, this might not be true since the computation of P(OI"j ) (h"1,2, M) is also needed in MMD. MMD F requires M calculations of forward and backward variables in order to give a re-estimation of the entire model set ", but each model is trained sequentially. IMMD re-estimates the model set parameters in a simultaneous mode, M calculations of forward and backward variables are needed to give a re-estimation of the entire model set " and only two additions and one multiplication are needed to calculate one RJ. There are I M(M!1) RJ should be computed for one re-estimation I of model set. Thus, a total of 2M(M!1) additions and M(M!1) multiplication are required to calculate all RJ. I In addition, extra multiplication of the order of MNKD are needed during the model parameter estimation for those RJ in the set S1, where D is the dimension of feature I vector. Totally, the order of MNKD extra basic computation is needed for IMMD to re-estimate the model
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
parameters against MMD, which is only a little portion of the entire computation requirement and the explanation is discussed below. Besides the calculations of the forward variable a (i) R and the backward variable b (i), of the order of R MNKD¹ calculations are required to re-estimate the parameters of a HMM model, where ¹ is the average of length of feature vector sequence (in terms of frame). Therefore, the additional calculation complexity of IMMD is only 1/N¹ of that of MMD, which could be ignored, because N¹ usually is larger than 100 under the typical value setting for N"5, and ¹"30. If the calculations of the forward variable a (i) and the backward R variable b (i) are taken into account, the occupation rate R of the additional calculation should be less than 1/N¹ of the total calculation. In conclusion, the additional computation complexity of IMMD against MMD is not a serious problem. In principle, IMMD uses not only the labeled data, but also all the competitive data to estimate the parameters of model j . It has much higher computation complexity J than the ML approach, about M times of ML, where M is the number of models or basic recognition units which usually is not a small number. For example, there are 60 distinguish phonemes in the TIMIT database. Fortunately, several approaches have been designed to reduce the computation of IMMD and it will not a!ect its performance adversely. For example, the method used in Ref. [1] gives a hybrid training of ML and MMD. Here we give another method based on the fact that most of the tokens could be recognized correctly during the training procedure and the top likelihood of a token measured on model set " is much higher than other's. (1) Initialize the model set ", then perform the following operation repeatedly until the re-estimation procedure is converged. (2) De"ne the competitive model set ) of OJ: if J log P(OJ"j )'log P(OJ"j )!l, then model j is a F J F competitive model of token OJ. (3) Calculate the contribution of every token OJ to its own model and to every model in ) . Usually, the J size of ) is much less than M. J (4) Re-estimate j with adjustment equations of IMMD. J In order to save some computation time, steps 3 and 4 could be repeated several times after each calculation of step 2. It could save time for de"ning competitive model set ) of each OJ. Competitive model set ) of each OJ is J J usually de"ned through forward calculation or Viterbi Algorithm. Step 2 needs M forward calculation or Viterbi calculation to de"ne ) of each OJ. Adopting the J above application procedure of IMMD, the computation complexity of IMMD were reduced to 4}5 times of ML, which is shown in our experiments.
1753
2.5. Extension to multiple observation sequence Left-to-right HMMs are commonly used in speech recognition, hence, we could not make reliable estimates of all model parameters with a single observation sequence, because of the transient nature of the states within the model allows only a small number of observation for any state. To have su$cient data to make reliable estimations of all model parameters, one has to use multiple observation sequences. The modi"cation of the re-estimation procedure is straightforward and is stated as follows. Let OJ"[OJ , OJ ,2,OJ J] be the training data ! labeled to model j , i.e. the training data of j consists of J J C observation sequences. The distance D(j , ") is reJ J de"ned as
1 !J 1 D(j , ")" log P(OJ"j ) J A J C ¹ J A JA 1 + E (P(OJ"j ))E . (20) !log A F M!1 FF$J Thus the modi"ed re-estimation formulas for the model j becomes J 1 !J + 1 !F nL>"nL#e cJA(i)! R cFA(i) , G G L C FA C J A FF$J F A i"1, 2,2, N, (21a)
1 !J + 1 !F a L>"aL #e sJA! RJ sFA , GH GH L C GH FA GH C JA FF$J F A i, j"1, 2,2, N, (21b) cL>"cL HI HI
+ 1 !F 1 !J cJA( j, k)! RJ cFA( j, k) , #e FA L C C J A FF$J F A (21c)
1 !J uL>"uL #e *oJA HIJ HIJ L C J J A + 1 !FJ ! RJ *oFA , (21d) FA J C FF$J F A 1 !J pL>"pL #e pL uJ HIJ HIJ L HIJ C HIJ J A + 1 !F ! RJ uFA , (21e) FA HIJ C FF$J F A where all the immediate variables are de"ned as before, but measured on j with OF instead of OF. J A
3. Experimental results For evaluating the performance of IMMD, isolated word recognition and continuous phoneme recognition experiments on the TIMIT database were carried out.
1754
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
Table 1 Experimental results of isolated word recognition (in error rate)
Closed test Open test
MCE
MMI
ML
MMD
IMMD
Error}reduction
2.03% 2.39%
0.82% 2.38%
2.14% 2.43%
1.73% 2.06%
0.84% 1.69%
51.44% 17.96%
3.1. Isolated word recognition On the isolated word experiment, a database of 21 words which includes `all, an, ask, carry, dark, don't, greasy, had, in, like, me, oily, rag, she, suit, that, to, wash, water, year, youra are used. These words are extracted from sentences SA1 and SA2 of TIMIT, which are the most common sentences of TIMIT. Each has 630 utterances spoken by 630 speakers from eight major dialect regions of the United States. In our experiments, 244 utterances of each sentence were used, 160 for HMM model training, and the other 84 for open test. All these utterances were parameterized using 12 mel-frequency cepstral coe$cient (MFCC) and 12 delta-cepstral coe$cient. For each word, the training data were extracted from the 488 utterances according to the time-alignment transcription provided by TIMIT. Twenty-one contextindependent word models were used, each model has six states and three mixture per state. Table 1 shows the experimental results of the recognizer trained with IMMD, MMD and ML. For performance comparison, the error rates of recognizers trained with MCE and MMI are also listed in the left columns of Table 1. The result indicates that the proposed improved maximum model distance approach is superior to the original MMD, especially for the closed testing set, achieving 17.96% error reduction for open test, and 51.44% for closed test. Meanwhile, the experiment result provides another proof to the conclusion that MMD is superior to ML, which was concluded in Ref. [9] based on discrete HMM. All these results are expected because IMMD utilizes more discriminative information of the given training data than MMD does, and MMD e!ectively utilizes the training data than ML does. MMI has similar closed performance to that of IMMD, but its open test performance is only close to that of ML. Although the HMMs trained by IMMD approximate the distribution of the training data in high precision, which could not provide the same improvement to the open unlimited tokens for the limitation of the training data. This is a common problem for any statistical optimal methods, i.e. a mismatch between the "nite training data and the in"nite test data always exists, which is
Table 2 Experimental results of continuous phoneme recognition Training method
1-%Corr
1-%Acc
MCE MMI ML MMD IMMD Err}reduction
13.87% 14.34% 14.81% 14.02% 13.15% 6.21%
19.19% 19.41% 19.78% 18.95% 18.04% 4.80%
another open problem to improve the robustness of HMM-based recognizer. 3.2. Continuous phoneme recognition The experiment on the continuous phoneme recognition is to recognize all the phonemes of the TIMIT database that consists of 60 phonemes excluding silence. In TIMIT database, no silence appears between any two sequential words, so we did not consider silence in the experiments. The end-point detection of utterances is marked by the labeled time transcription in TIMIT. The experiment uses 600 phonetically balance utterances (including 23 147 phonemes) for testing, 400 of it are used for training (including 15 688 phonemes). All these utterances were parameterized using 12 mel-frequency cepstral coe$cient (MFCC) and 12 delta-cepstral coe$cient. Sixty context-independent phone models are trained, with three states, "ve mixture/state. Acoustic models are trained using bootstrapping technique, which iterates two steps. The "rst one uses existing set of phone segment to train acoustic models of the recognition system, and the second uses these acoustic models to do forced-recognition-segmentation of training utterance with phonetic transcription given, resulting in a new set of phone segment. The initial segmentation of training utterances was simple uniform distribution. Let Cor, Del, Ins, Sub and = are, respectively, the number of correct phonemes, deletions, insertions,
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
1755
Table 3 The experiment results as a function of g g
0.06
0.12
0.25
0.5
1.0
1.5
2.5
5.0
Close Rec}rate (%) Open Rec}rate(%)
98.77 97.68
98.89 97.74
99.02 97.74
99.07 97.80
99.16 97.91
99.04 97.85
99.06 97.80
98.92 97.68
substitutions and the total number of phonemes in the test speech, with Cor"W-Del-Sub. The percentage of the number of words correctly recognized is given by %Cor"(Cor/=);100% and the accuracy is %Acc"((Cor!Ins)/=);100%. Table 2 shows the results of the recognizer trained with IMMD, MMD, ML, MCE and MMI. In recording the errors, only the top recognized phoneme string is used. The result is another proof to assert that IMMD is superior to MMD, achieving 6.21% error reduction in correct recognition rate, and 4.80% error reduction in accuracy. Meanwhile, the results in Table 2 suggest that IMMD is an attractive alternative for HMM training when compared with other algorithms such as ML, MMI and MCE. 3.3. Ewect of weight factor g It was stated earlier that di!erent values of g a!ect the distribution of the weighting factor RJ,(k"1, 2,2, I M, kOl), which controls the contributions of each competitive utterance of word l, and "nally a!ect the performance of the recognizer. We investigated the in#uence of g on isolated word recognition. The in#uence of g is shown in Table 3. g " 1.0 has the best performance. It is observed that no matter when g becomes larger or smaller away from 1.0, the system performance is degraded. A reasonable explanation is that when g equal to 1.0, it makes each utterance provide a natural contribution to its competitive models, which provides the optimal distribution for the training data. Another phenomena is that the change of recognition rate is slow when g changes. We checked the likelihood of competitive models, and found that the likelihood of the top competitive model is usually much higher than that of the other competitive models. This means that only a few RJ, (l"1, 2, 2, M, lOk) has useful value, i.e. its value I is far above 0.0, and most of RJ, (l"1, 2,2, M, lOk) is I close to 0.0. Therefore, an utterance has e!ect only on a few top competitive models. Within the investigated range of g, the in#uence of each utterance to its competitive models only changed slightly when g is set to di!erent values listed in Table 3. Therefore, only slight system performance change was observed in the experiments when g becomes larger or smaller away from 1.0, but the trend of change is predictable.
4. Conclusion We have shown that the maximum model distance (MMD) criterion is superior to maximum likelihood because it uses some discriminative information to train each HMM model [9]. However, the MMD regards all competitive models to have the same importance when considering their contributions to the model re-estimation procedure. This is not completely practical since some competitive models might not be the real competitors for its likelihood is much lower than that of the labeled model. In order to have the best performance: di!erent competitors should also be paid di!erent level of attentions according to its competitive ability against the labeled model. This paper gave a solution to this problem and a more reasonable HMM model distance was proposed. We call the method as improved MMD (IMMD). HMM model parameter re-estimation formula were induced, from which a conclusion was reached that the improved MMD approach could utilize the training data more e!ectively than the MMD. In fact, MMD is a special case of IMMD by letting the weight factor g approach 0. The computation complexity of IMMD was also discussed. We showed that the IMMD's complexity is comparable to that of MMD. Both the isolated word and continuous speech recognition experiments showed that a signi"cant error reduction could be achieved through the proposed approach. In the isolated word recognition, IMMD provided 51.44% error reduction on closed test, and 17.96% on open test; In the continuous phoneme recognition, IMMD decreased the closed test error by 6.21% and open test error by 4.80%. For understanding the limitations of HMM, many extended models have been proposed in recent years to address some of the shortcomings of HMMs, such as segmental model [14,2] and stochastic trajectory model [15]. Although IMMD is designed for standard HMM training, it could easily be extended to handle these extended HMM models.
Acknowledgements This work is supported in part by the City University of Hong Kong Strategic Grant 7000754, City University
1756
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
of Hong Kong Direct Allocation Grant 7100081 and the National Natural Science Foundation of China Project 69881001.
Appendix A
1 2\ , " a (i)a b (o )b ( j)d(o , l ) R GH H R> R> R> I b HI R G
#n b (o )b ( j)d(o , l ) H H I
A.1. Notations By setting a (i)"n b (o ), forward variable a (i)" G G R P(o o 2o , q "i"j) (1)t)¹) could be calculated R R with
2\ , *P " a (i)a b ( j)d(o , l )#n b ( j)d(o , l ) R GH R> R> I H I *b HI R G
1)t)¹!1, , a ( j)" a (i)a b (o ), (22) R> R GH H R> 1)j)N. G Similarly, setting b (i)"1 ∀i, backward variable 2 b (i)"P(o o 2o "q "i, j) could be calculated with R R> R> 2 R ¹!1*t*1, , b (i)" a b (o )b ( j), (23) R GH H R> R> 1)i)N. H Then , , P(O"j)" a (i)a b (o )b ( j) (24a) R GH H R> R> G H , " a (i)b (i) (24b) R R G Other two probability variables are cited during inducting Eq. (10), they are a (i)b (i) c (i)"P(q "i"O, j)" R R R R P(O"j)
(25)
and m (i, j)"P(q "i, q "j"O, j) R R R> a (i)a b (o )b ( j) " R GH H R> R> . P(O"j)
(26)
A.2. Induction of Eq. (10) To induce Eq. (10), the key problem is to calculate
P(O"j) according to Eqs. (8) and (9). Di!erentiating (Eqs. (24a) and (24b)), we get *P a (i)b (i) P(O"j)c (i) , "b (o )b (i)" " G n n *n G G G *P 2\ " a (i)b (o )b ( j) R G R> R> *a GH R 1 2\ " a (i)a b (o )b ( j) R GH G R> R> a GH R P(O"j) 2\ " m (i, j), R a GH R
(27)
1 2\ " a ( j)b ( j)d(o , l ) R> R> R> I b HI R
#a ( j)b ( j)d(o , l ) I
P(O"j) 2 " c ( j)d(o , l ). R R I b HI R
(29)
From Eq. (9), we get 1 *D(") *P(OJ"j ) J " *nJ ¹ P(OJ"j ) *nJ J J G G + RJ *P(OF"j ) F J ! ¹ P(OF"j ) *nJ F J G FF$J
1 cJ (i) + RJcF (i) F ! " nJ ¹ ¹ G F J FF$J
1 + " cJ (i)! RJcF (i) , F n G FF$J
where c (i)"c (i)/¹ is the expected frequency in state i at time t"1 in O, normalized by the length of the observation sequence, *D(") 1 *P(OT"j ) J " *aJ ¹ P(OJ"j ) *aJ GH J T GH + RT *P(OF"j ) F J ! ¹ P(OF"j ) *aJ T GH FF$J F
1 1 2J\ + RJ 2F\ F mF(i, j) " mJ(i, j)! R R aJ ¹ ¹ GH J R FF$J F R
1 + " sJ ! RJ sF , F GH aJ GH GH FF$J
(31)
where s "(1/¹) 2\m (i, j) is the expected number of GH R Ri to transitions from state state j in O, normalized by the length of the observation sequence, *D(") 1 *P(OJ"j ) J " *bJ ¹ P(OJ"j ) *bJ HI J J HI
(28)
(30)
+ RJ *P(OF"j ) F J ! ¹ P(OF"j ) *bJ J HI FF$J F
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
1 1 2J " cJ( j)d(oJ, l ) R R I bJ ¹ JR HI + R T 2F F cF( j)d(oF, l ) ! R R I ¹ FF$J F R 1 + " cJ ! RJcF , (32) HI F HI bJ HI FF$J where c "(1/¹) 2 c ( j)d(o , l ) is the expected number HI R I R R of times in state j observing symbol l in O, normalized I by the length of the observation sequence. If we design the positive-de"nite matrix U as a diagL onal matrix in the following way: the element corresponding to n , a , b (k) are nL, aL , bL(k), respectively, G GH H G GH H which asserts that U is a positive-de"nite matrix, then L we could get the adjustment rule by substituting Eqs. (30)}(32) into Eq. (8):
+ nL>"nL#e cJ (i)! RJcF (i) , i"1, 2, 2, N, G G L F FF$J (33a)
+ a L>"aL #e sJ ! RJsF , i, j"1, 2,2, N, GH GH L GH F GH FF$J (33b) j"1, 2,2, N, + bI L>"bL #e cJ ! RJcF , HI HI L HI F HI k"1, 2,2, M. FF$J (33c)
References [1] Y. Gotoh, M.M. Hochberg, H.F. Silverman, E$cient training algorithms for HMM's using incremental estimation, IEEE Trans. Speech Audio Process. 6 (6) (1998) 539}548. [2] M. Ostendorf, V.V. Digalakis, O.A. Kimball, From HMM's to segment models: a uni"ed view of stochastic modeling for speech recognition, IEEE Trans. Speech Audio Process. 4 (5) (1996) 360}378.
1757
[3] R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer, Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tokyo, Japan, April 1986, [pp. 49}52]. [4] Nam Soo Kim, Chong kwan Un, Deleted strategy for MMI-based HMM training, IEEE Trans. Speech Audio Process. 6 (3) (1998) 299}303. [5] Y. Ephraim, A. Dembo, L.R. Rabiner, A minimum discrimination information approach for hidden markov modeling, IEEE Trans. Inform. Theory 35 (5) (1989) 1001}1003. [6] W. Chou, C.H. Lee, B.H. Juang, F.K. Soong, A minimum error rate pattern recognition approach to speech recognition, Int. J. Pattern Recognition Artif. Intell. 8 (1) (1994) 5}31. [7] Biing-Hwang Juang, Wu Chou, Chin-Hui Lee, Minimum classi"cation error rate methods for speech recognition, IEEE Trans. Speech Audio Process. 5 (3) (1997) 257}265. [8] Y. Ephraim, L.R. Rabiner, On the relations between modeling approaches for speech recognition, IEEE Trans. Inform. Theory 36 (2) (1990) 372}380. [9] S. Kwong, Q.H. He, K.F. Man, K.S. Tang, A maximum model distance approach for HMM-based speech recognition, Pattern Recognition 31 (3) (1998) 219}229. [10] H. Juang, L.R. Rabiner, A probabilistic distance measure for hidden markov models, AT & T Tech. J. 64 (2) (1985) 391}408. [11] Petrie, Probabilistic functions of "nite state Markov chains, Ann. Math. Statist. 40 (1) (1969) 97}115. [12] L. Rabiner, B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Inc. Englewood Cli!s, NJ, 1993 (Chapter 5). [13] P.C. Chang, B.H. Juang, Discriminative template training for dynamic programming speech recognition, Proceedings of the ICASSP-92, Vol. I, San Francisco, March 1992, pp. 493}496. [14] M. Ostendorf, S. Roukos, A stochastic segment model for phoneme-based continuous speech recognition, IEEE Trans. Acoust. Speech Signal Process. 37 (12) (1989) 1857}1869. [15] Y.F. Gong, Stochastic trajectory modeling and sentence searching for continuous speech recognition, IEEE Trans. Speech Audio Process. 5 (1) (1997) 33}44.
About the Author*S. KWONG received his B.Sc. degree and MA.Sc. degree in electrical engineering from the State University of New York at Bu!alo, USA and University of Waterloo, Canada, in 1983 and 1985, respectively. In 1996, he obtained his Ph.D. from the University of Hagen, Germany. From 1985 to 1987, he was a diagnostic engineer with the Control Data Canada where he designed the diagnostic software to detect the manufacture faults of the VLSI chips in the Cyber 430 machine. He later joined the Bell Northern Research Canada as a Member of Scienti"c sta! where he worked on both the DMS-100 voice network and the DPN-100 data network project. In 1990, he joined the City University of Hong Kong as a lecturer in the Department of Electronic Engineering. He is currently an associate Professor in the Department of Computer Science. His research interests are in Genetic Algorithms, Speech Processing and Recognition, Data Compression and Networking. About the Author*QIANHUA HE received the B.S. degree from Hunan Normal University, Changsha City, China, in 1987, and the M.S. degree from Xi'an Jiaotong University, Xi'an City, China, in 1990, and Ph.D. degree from South China University of Technology(SCUT), Guangzhou City, China, in 1993. From May 1994 to April 1996, he was a research assistant and from July 1998 to June 1999, he was a senior research assistant at City University of Hong Kong. He is now an associate professor in the Department of Electronic Engineering of SCUT, where he is teaching graduate courses and doing research work in speech processing and evolutionary algorithm.
1758
Q.H. He et al. / Pattern Recognition 33 (2000) 1749}1758
About the Author*DR. TANG obtained his B.Sc. from the University of Hong Kong in 1988, and both M.Sc. and Ph.D. from the City University of Hong Kong in 1992 and 1996, respectively. Prior to his doctorate programme which was started in September 1993, he has worked in Hong Kong industry for over "ve years. He joined the City University of Hong Kong as a Research Assistant Professor in 1996. He is a member of IFAC Technical Committee on Optimal Control (Evolutionary Optimisation Algorithms) and a member of Intelligent Systems Committee in IEEE Industrial Electronics Society. His research interests include evolutionary algorithms and chaotic theory. About the Author*K.F. MAN was born in Hong Kong, and obtained his Ph.D. award in Aerodynamics from Cran"eld Institute of Technology, UK in 1983. After some years working in UK areopspace industry, he returned to Hong Kong in 1988 to join City University of Hong Kong where he is currently an associate Professor in the Department of Electronic Engineering. He is also a Concurrent Research Professor with South China University of Technology, Guangzhou China. Dr. Man is an Associate Editor of IEEE Transactions on Industrial Electronics and a member of Administrative Committee member of the IEEE Industrial Electronics Society. He serves both IFAC technical committees in Real-time Software Engineering, and the Algorithms and Architectures for Real-time Control. His research interests include active noise control, chaos and nonlinear control systems design, and genetic algorithms.
Pattern Recognition 33 (2000) 1759}1770
Classi"cation of temporal sequences via prediction using the simple recurrent neural network Lalit Gupta*, Mark McAvoy, James Phegley Department of Electrical Engineering, Southern Illinois University, Carbondale, IL 62901, USA Received 9 February 1999; received in revised form 28 June 1999; accepted 28 June 1999
Abstract An approach to classify temporal sequences using the simple recurrent neural network (SRNN) is developed in this paper. A classi"cation problem is formulated as a component prediction problem and two training methods are described to train a single SRNN to predict the components of temporal sequences belonging to multiple classes. Issues related to the selection of the dimension of the context vector and the in#uence of the context vector on classi"cation are identi"ed and investigated. The use of a di!erent initial context vector for each class is proposed as a means to improve classi"cation and a classi"cation rule which incorporates the di!erent initial context vectors is formulated. A systematic method in which the SRNN is trained with noisy exemplars is developed to enhance the classi"cation performance of the network. A 4-class localized object classi"cation problem is selected to demonstrate that (a) a single SRNN can be trained to classify real multi-class sequences via component prediction, (b) the classi"cation accuracy can be improved by using a distinguishing initial context vector for each class, and (c) the classi"cation accuracy of the SRNN can be improved signi"cantly by using the distinguishing initial context vector in conjunction with the systematic re-training method. It is concluded that, through the approach developed in this paper, the SRNN can robustly classify temporal sequences which may have an unequal number of components. 2000 Published by Elsevier Science Ltd on behalf of Pattern Recognition Society. Keywords: Recurrent neural network; Prediction; Classi"cation; Temporal sequence
1. Introduction This paper focuses on developing an approach to classify, via prediction, real-valued temporal sequences using recurrent neural networks. A temporal sequence consists of a series of components (scalers or vectors) with an inherent ordering. Examples of real temporal sequences include the feature vectors derived in speech [1}5] and object recognition problems [6}11]. The classi"cation of temporal sequences tends to be complex and challenging because the process not only involves the classi"cation of components, but, must also take into account the ordering of the components. The complexity increases even further when the temporal sequences have similar
* Corresponding author. Tel.: #1-618-536-2364; fax: #1618-453-7972.
intra-class and inter-class components because each component must be classi"ed with respect to its local context within a sequence. Additional complexity is introduced when temporal sequences from di!erent classes have an unequal number of components. One of the most frequently used neural network for pattern classi"cation is the multilayer perceptron (MLP) which is a feedforward network trained to produce a spatial output pattern in response to an input spatial pattern. The mapping performed is static, therefore, the network is inherently not suitable for processing temporal patterns. Additionally, the dimensions of the pattern vectors across all classes must be equal, therefore, the MLP is also not suitable for classifying temporal patterns which may have unequal durations. Attempts have been made to use the MLP to classify temporal patterns by transforming the temporal domain into a spatial domain [1,2,12]. This approach has been used in
0031-3203/00/$20.00 2000 Published by Elsevier Science Ltd on behalf of Pattern Recognition Society. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 4 9 - 1
1760
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
developing time-delay neural networks [1] in which the units integrate activities from adjacent time-delayed vectors which allows each vector to be weighted separately in time. The backpropagation algorithm is used to train the network given the target values for the output units for various times. The time-delay network has been applied to problems in speech recognition [1,2] and time series prediction [13]. In this approach, however, past events are not stored and have no in#uence on the network output. Therefore, contextual classi"cation of features is not possible. In other attempts related to automatic target recognition [6], MLPs have been proposed for the temporal classi"cation of sequences of local features. The responses of the network to a sequence of local feature vectors are stored in a response array in the order they occur. A dynamic alignment procedure is applied to the response array to compensate for temporal variations in the sequence of local test feature vectors while maintaining the inherent ordering of the feature vectors. In a related approach, the input sequence is optimally aligned with the input of a feedforward network using dynamic alignment [4]. An alternate neural network approach is to use recurrent networks which have memory to encode past history. Several forms of recurrent networks have been proposed and they may be classi"ed as fully recurrent or partially recurrent networks. In fully recurrent networks, any unit may be connected to any other unit in the network and individual units may be input units, output units, or both. Examples of fully recurrent networks include networks trained using (a) backpropagation through time, (b) recurrent backpropagation, and (c) real-time recurrent learning rules. Backpropagation through time is a training method for fully recurrent networks which allows backpropagation with weight sharing to be used to train an unfolded feedforward non-recurrent version of the original network [14,15]. Once trained, the weights from any layer of the unfolded net are copied into the recurrent network, which, in turn, is used for the temporal mapping task. Relatively few applications of this technique exist due to the ine$ciency in handling long sequences. In networks trained using recurrent backpropagation, the backpropagation algorithm is extended directly to train fully recurrent neural networks where the units are assumed to have continually evolving states. The extension was originally restricted to learning static mappings [16,17] and, subsequently, modi"ed for the storage of dynamic patterns [18]. Examples of applications of these networks include learning limit cycles in 2-D space [18] and time series prediction [19]. Backpropagation through time and recurrent backpropagation are o!-line training methods. For long sequences or sequences of unknown duration, real-time recurrent learning is employed to perform on-line training, i.e., the weights are updated while the network is running rather than at the end of the presented sequence
[20]. To save computation and memory requirements, the error is minimized at each time step instead of at the end of the sequence. This method allows recurrent networks to learn tasks that require retention of information over time periods having "xed or inde"nite duration. In partially recurrent networks, partial recurrence is created by feeding back delayed hidden unit outputs or the outputs of the network as additional input units [21}24]. One example of such a network derived from the MLP is the simple recurrent neural network (SRNN) in which hidden unit outputs delayed by one time unit are fedback as context units using "xed unity weights [23,24]. In this network, the current output is a function of the current externally applied input and the hidden layer outputs from the previous cycle which are fed back as context inputs into the network. Numerous studies have been conducted to demonstrate temporal processing in SRNNs. Examples include grammatical inference studies [23,24], recognition and generation of strings from "nite state machines [25,26], speech recognition [5], and interval and rate invariance studies [27]. Most SNRR related studies involve processing noise-free binary temporal sequences with orthonormal components [23}26] or "xed duration feature vectors having low dimensions [5]. This paper focuses on the SRNN due to its relatively simple and well-de"ned structure, its simplicity in training, and its widespread use in prediction applications. The prediction capabilities of the SRNN have been investigated in Refs. [28,29] and it was shown that a single SRNN can be trained to robustly predict, in context, the components of real temporal sequences belonging to di!erent classes. It was also shown in Ref. [29] that the SRNN is tolerant to substitution, insertion and deletion errors in the components of a temporal sequence. The goal of this paper is to extend the capabilities of the SRNN by developing an approach to robustly classify temporal sequences via component prediction. Therefore, a classi"cation problem is formulated as a component prediction problem and investigations are designed to systematically: (a) develop e!ective methods to train, fully, a single SRNN to predict the real components of long multiclass temporal sequences which may have an unequal number of components; (b) develop a classi"cation approach to assign a test sequence to a pattern class based upon the prediction of the components of the test sequence; (c) improve the classi"cation performance of the SRNN by using a distinguishing initial context vector; (d) evaluate the classi"cation accuracy of the resulting SRNN classi"er as a function of noise in test sequences; (e) improve the classi"cation accuracy of the SRNN by systematically re-training the network with noisy training sets.
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
1761
2. Simple recurrent neural network architecture
3. The classi5cation via prediction approach
The speci"c architecture of the SRNN used in this paper is shown in Fig. 1. The network consists of 2 layers of neurons: the output layer and the context layer. The input to the network consists of a concatenation of the externally applied input X(n) and the context input H(n!1) which is the prior output of the context layer. The network is fully interconnected between the output and context layers and between the context layer and the concatenated inputs. If =&7, =6&, and =&& represent the interconnection matrices between the units in the output and context layer, the context layer and the external input, and the context layer and the context input, respectively, then, the output of the ith unit in the output layer is given by
In general, the network can be trained to output an N -element vector in response to an N -element vector using the backpropagation training algorithm. Therefore, the network can be trained on a prediction task in which the input is a component of a temporal sequence and the output is a prediction of the next component in the sequence. The dimensions N and N are equal in such prediction tasks. For the C-class problem, let a temporal sequence from class u , i"1, 2 ,2, C, be G represented by
1 y (n) " for i"1, 2 ,2, N (1) G 1#exp[!=&7H(n)] G and the output of the jth unit in the context layer is given by 1 , h (n)" H 1#exp+![=&&H(n!1)#=6&X(n)], H H for j"1, 2 ,2, H,
(2)
where, =&7 is the ith row of interconnection matrix G =&7, =&& is the jth row of =&&, and =6& is the jth row H H of =6&.
S "+s ( j),, k"1, 2 ,2, K , j"1, 2 ,2, J, G GI G where s is the kth component in S , K is the number of GI G G components in S , and J is the dimension of each comG ponent. Then, a prediction task can be designed in which the network is trained to predict component s given GI> the input component s . GI 3.1. Classixcation via prediction The SRNN can be used to classify temporal sequences by "rst freezing the interconnection weights after fully training the network and formulating a classi"cation rule based on the prediction results obtained from a test sequence. For example, if the input test sequence is S , the G network will predict the components of S in their inG herent ordering. The discrepancy between the network output and the expected network output for each sequence class can be used to classify the test feature vector to the pattern class with the smallest discrepancy. For example, if D is the discrepancy between a test sequence G ¹ and S , the nearest-neighbor rule can be used to assign G the test sequence to the class uH, where, iH is given by G iH"arg min [D ], i"1, 2 ,2, C. G
Fig. 1. The SRNN architecture.
(3)
The discrepancy computation, which is external to the SRNN, is easily determined when the number of temporal feature vectors belonging to all C classes are equal. However, determining the discrepancy for unequal numbers, especially when context-sensitive classi"cation is required, is not an easy task. Because classi"cation is based on the prediction of components which have the same dimension J within and across classes, the dimension of the input vector is not dependant on the number of components K of the temporal sequences. Therefore, G the approach developed in this paper assumes that the temporal sequences are periodic and can be circularly extended. Each sequence S "+s ( j), is circularly exG GI tended to form a sequence G "+g ( j), which has the G GI same number of components K"max(K ). That is, the G sequences are extended to have the same number of components as that of the longest sequence. The network is trained and tested using components of the extended
1762
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
vectors. The discrepancy between ¹ and G can be comG puted as ) D "(1/K) [g !tK ], (4) G GI I I where tK is the network output when the input is comI ponent t . Hereafter, predicted components will be I\ represented using the symbol .
of the hypothesized class to determine the discrepancy for the hypothesized class. The motivation for developing this approach is to force the network to give correct prediction results for sequence G by using the correct G initial context vector ' and to give poor prediction when G incorrect initial context vectors are used. That is, the discrepancy for the correct class of the input is decreased and is increased for the incorrect class by using ' . G
3.2. Classixcation rule to incorporate diwerent initial context vectors
4. SRNN training methods
The initial value stored in the context units is denoted by the vector '. The standard rule used is to assign an initial value equal to 0.5 to the elements of ' [23}27]. Because no justi"cation is provided for selecting this initial value, it is presumed that 0.5 is selected because it is the mid-point of the range of values taken by the context units. It has been shown that the prediction of the beginning ("rst few) components of a sequence is poor when '"0.5 is used [28,29]. This is because the same context is used to predict the beginning components of di!erent sequences. That is, not enough distinguishing context is built into the network for the accurate prediction of the beginning components of di!erent sequences using '"0.5. This adverse e!ect increases when the dimension of the context vector, relative to the input vector, is increased because the context vector then has a greater in#uence on the prediction. Using '"0.5, a test sequence ¹ is presented once to the network input and the discrepancies between the network output ¹K and the sequences from all C classes are computed using Eq. (4). The test sequence is assigned to a class using Eq. (3). Instead of using '"0.5, it is proposed that a unique ' used for each class u . If the dimension of the context G G vector is H, then ' is selected as the "rst H elements of G the sequence G . That is, ' is the "rst H elements of the G G vector formed by concatenating the components of G . G Using this approach, the vector ' is used as the initial G context vector during training and testing. It is also used to separate sequences during training. During testing, a test sequence ¹ is hypothesized to belong to class u and is tested using ' as the initial G G context vector. If the network output using ' for the G initial context vector is denoted by ¹K "+tK ( j),, the G GI discrepancy between ¹K and G is computed as G G ) D "(1/K) [g !tK ]. G GI GI I The testing of the sequence is repeated for all C classes using the corresponding initial context vector ' , G i"1, 2, 2, C. That is, each test sequence is presented to the network input C times. The network output for each presentation is compared only with the expected output
In order to accurately classify temporal sequences via prediction, the prediction of each component must be accurate. Therefore, the network should be fully trained to satisfy a de"ned convergence criterion for each predicted component. In order to train an SRNN, a decision has to be made to model the composite training vector made up of the sequences from all C classes as a "nite duration vector or as a circular vector. The end-components do not wrap around for "nite duration vectors. For circular vectors, the end-components wrap around so that the last and "rst components form an input}output pair. The context unit vector is reset to ' in order to separate sequences when the composite training vector is modelled as a "nite duration vector, whereas, the vector is not reset when it is modelled as a circular vector. For the multi-class classi"cation problem, two distinct training methods can be formulated, viz, &class-incremental training' and &combined classes training'. It should be noted that in both training methods, the network is trained to predict the sequences by presenting the network with pairs of components and not by presenting the network with entire sequences. 4.1. Class-incremental training Training the network to predict the components of sequences modelled as "nite duration training vectors using class-incremental training is as follows: The network is "rst trained component by component to predict the components of class u using the training vector [', G ; ', G ;2], where ', G "[(', g ), g , 2, g ]. J) The resulting fully trained network is represented by N(u ). The network N(u ) is retrained component by component using the class augmented training vector [', G ;', G ;', G ;', G ; 2] to give the network N(u , u ) trained to predict the components of classes u and u . This class incremental procedure is repeated until the network N(u , u , 2, u ) trained to predict the components of all ! C classes is obtained. Convergence during training is tested for each input}output pair of components by
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
1763
comparing the squared di!erence between the component g and its prediction g( with a convergence threGI GI shold q. Therefore, in the network N(u , u , 2, u ), ! training terminates only when
if the training vectors are modelled as circular vectors. The test for convergence is the same as the test used in class-incremental training.
[g !g( ])q for all k and i. GI GI The class-incremental training method can also be used to train the network given a set of training exemplars GI G "+g ( j),, q "1, 2, 2, Q for each class i, GO GIOG G G where Q is the number of exemplars in the ith class. For G this case, given an input g , the network is trained to GIOG output the centroid g computed from the exemplar GI> set. The network is trained as follows:
5. A localized object classi5cation problem
[', GI ;', GI ;', ;GI ;', GI ;', GI ; ;', 2 / 2 GI , ] / 2 to give the fully trained network N(u ) for class u . The G fully trained networks N(u , u ) N(u , u , u ) ,2, N(u , u ,2, u ) are systematically obtained using the ! class-incremental training procedure. If the training vectors are modelled as circular vectors, the systematic class-incremental training procedure is exactly the same except that ' is used only once at the beginning and is not used to separate sequences nor classes. For example, to obtain the fully trained network N(u , u ) from N(u ), the network is trained using [', G ;G ;G ;G ;2]. 4.2. Combined classes training In this method of training, the network N(u , u ,2, u ) is directly obtained by training the network ! component by component with concatenated sequences of the form [', G ;', G ;2;', G ;', G ;2;', G ;', G ;2;', ! ! G ;2] ! if the training vectors are modelled as "nite duration vectors, or, with concatenated sequences of the form [', G ;G ;2;G ;G ;2;G ;G ;2;G ;2] ! ! ! if the training vectors are modelled as circular vectors. Similarly, the network N(u , u , 2, u ) is directly ! obtained by training the network with concatenated sequences of the form [', GI ;', GI ; ;', GI ;', GI ;', GI ; ;', 2 ! 2 GI , ] ! 2 if the training vectors are modelled as "nite duration vectors and [', GI ;GI ; ;GI ;GI ;GI ; ;GI , ] 2 ! 2 ! 2
In order to demonstrate the classi"cation via prediction approach, a localized object classi"cation problem, which is of considerable interest using both neural network and conventional methods [6}11,30}33], was selected. In localized classi"cation, an object is represented by a set of local feature vectors which characterizes parts of the object. Therefore, the composite feature vector consisting of local feature vectors is a temporal sequence. Temporal sequences for the four simulated two-dimensional objects labelled F , F , F , and F in Fig. 2 were derived using the localized contour sequence representation (LCS) which has been shown to be an e!ective representation for extracting local boundary based features [7]. In the LCS, each pixel on the boundary is represented by the perpendicular Euclidean distance between the boundary pixel and the chord connecting the end-points of an odd-numbered window centered on the boundary pixel. A local feature is de"ned by the segment of the LCS between two speci"ed or randomly selected points on the object boundary. The LCSs of the four objects shown in Fig. 2, normalized to take values between 0 and 1, are shown in Fig. 3. For the experiments conducted in this paper, the components were selected as the 20 point segments between the dashed vertical lines in the "gures. The circularly extended sequences are shown in Fig. 4. The resulting extended temporal sequence from object F , i"1, 2, 3, 4, consisting of a sequence of comG ponents is denoted using the notation for extended sequences, i.e., by G "+g ( j),. The temporal properties of G GI the four extended sequences are demonstrated by using a representation of the form G :+g , g , g , g , g , g , g , a, b, g , g , g , g , g , g , g ,; G :+b, g , g , b, g , g , g , g , a, b, b, g , g , b, g , g ,; G :+b, b, c, g , g , g , g , g , c, g ,g ,g , g , b, b, b,; G :+c, g , g , g , d, g , g , g , g , g , d, g ,g ,g , b, g ,. In this representation, features that are unique are denoted by g and features that are similar are denoted GI using a, b,2, . The occurrence of similar features within and across classes is clearly evident in the representations. For example, the feature b occurs 5 times in
1764
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
sequences G and G and also occurs in all four classes. Additionally, the occurrence of adjacent pairs of the form (a, b) and (a, j) clearly shows the need for the prediction of segments b and j to be context sensitive. In order to analyze the classi"cation performance of the network in noise, a noisy sequence G is generated by G adding zero-mean Gaussian noise n with variance G p directly to the noise-free samples of G . That is, the G G noisy sequence is given by Fig. 2. Simulated objects.
G "+g ( j),"+g ( j)#n ( j),, G GI GI GI
Fig. 3. Localized contour sequences of the objects shown in Fig. 2.
Fig. 4. Extended temporal sequences.
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
1765
Fig. 5. Examples of noisy temporal sequences (a) SNR"4 dB (b) SNR"O dB.
where n ( j) is a sample drawn from the density function GI p G (n)"(2np)\exp[n/(2p)]. L G G The variance p is speci"ed to generate a noisy sequence G from class u with a speci"ed mean square signal-toG noise ratio SNR . That is G ) ( SNR "+[1/(JK)] [g ( j)],/(p). G GI G I H Examples of noisy sequences G are shown in Fig. 5. G
6. Classi5cation experiments Several experiments were designed to investigate issues related to network dimensions, network retraining, and evaluating the classi"cation accuracy of the network using '"0.5 and ' . The classi"cation accuracy of the G SRNN is de"ned as the percentage of the number of correctly classi"ed sequences divided by the total number of sequences tested. In order to obtain accurate measures of the classi"cation accuracy, 10 independent networks
1766
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
with randomly initialized weights were trained. For each class, 100 noisy sequences were generated using a "xed SNR and tested on each network. The classi"cation accuracy was estimated by averaging the results across the 10 networks. Although both training methods led to network convergence, the combined classes training method was used because it generally led to faster convergence. 6.1. Retraining with noisy exemplars Several investigations have focused on including various forms of synaptic noise to improve the generalization capability of an MLP [34}36]. While these investigations have focused on adding noise to the interconnection weights of the network, the approach developed in this paper to improve the performance of an SRNN is similar to the re-training method described for the MLP in Ref. [37]. That is, an SRNN fully trained with noise-free exemplars is systematically re-trained with gradually increasing levels of noise in the training set in order to enhance the generalization capability of the network. 6.2. Network dimension From Eqs. (1) and (2), it is clear that the output of the network at any given time, n, is a function of the externally applied input X(n) and the context input H(n!1). It could, therefore, be expected that context which is fedback into the context inputs has an in#uence on the prediction capabilities of the network. The fedback context is clearly a function of the dimension of the context input vector which is same as that of the context layer. Therefore, the e!ect of the context on the prediction capabilities can be investigated by studying the prediction error of the network as a function of the dimension of the context vector. This issue has been investigated empirically in several studies [26,29,38], however, no general guidelines have been established to determine the number of context units for a given problem.
Given that the dimension of each component was 20, the dimensions of the input and output units in the SRNN designed were selected to be 20. Through a systematic set of preliminary prediction experiments in which networks, with the number of context units as a variable parameter, were trained using noise-free sequences and tested using noisy sequences, it was found that a minimum dimension of 20 was required for convergence given a convergence threshold q"0.005. The network did not converge for this dimension when noisy sequences with SNR"10 dB were used in training. The network did converge when the dimension was increased to 30, however, the prediction performance was poor. The dimension had to be increased when higher levels of noise were added to the training sequences. Forty context units were needed for convergence and satisfactory performance when the network was trained using noisy sequences with SNR"6 dB. These preliminary experiments showed that, in order to aid network convergence, the dimension of the context vector must be increased if the SNR in the training set is decreased. 6.3. Classixcation experiments using "0.5 and ' G The "rst set of experiments were designed to test the classi"cation robustness of the network using '"0.5 for the initial context vector. The dimension of the context vector was varied as 40, 45, and 50. The network was initially trained with the noise-free sequences and tested on noisy sequences with varying SNRs. The results are shown in row 1 of Tables 1}3, where, the symbol `*a is used to denote the noise-free training case. Each classi"cation accuracy result presented in the tables is an average computed across [(4;100);10]"4000 tests. The tables also show the number of epochs required for training convergence. An epoch is de"ned as a single presentation of the composite training vector during training. The SRNN was systematically re-trained with gradually increasing levels of noise in the training set. The noise-free trained network was retrained with a lownoise (SNR"10 dB) exemplar training set and was retested using the original set of noisy test sequences. These
Table 1 Classi"cation accuracy using "0.5 with 40 context nodes Training SNR in dB
* 10 8 6
Test SNR in dB
Epochs
10
8
6
4
2
0
99.95 99.90 71.45 62.20
99.45 99.80 71.05 60.10
97.70 99.40 71.25 59.45
93.10 98.30 71.50 58.80
85.05 95.60 67.70 53.15
77.15 89.70 62.90 53.55
21 141 46 620 96 446 124 283
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
1767
Table 2 Classi"cation accuracy using "0.5 with 45 context nodes Training SNR in dB
* 10 8 6
Test SNR in dB
Epochs
10
8
6
4
2
0
100.00 100.00 99.80 90.45
100.00 100.00 99.55 86.85
99.20 99.95 99.10 85.35
96.40 99.70 98.60 84.45
89.25 98.45 95.45 77.70
81.10 94.50 90.65 73.65
16 185 35 319 55 445 71 837
Table 3 Classi"cation accuracy using "0.5 with 50 context nodes Training SNR in dB
* 10 8 6
Test SNR in dB
Epochs
10
8
6
4
2
0
99.90 100.00 93.90 70.85
99.70 100.00 93.25 69.75
98.15 99.95 92.40 70.40
93.60 99.75 90.50 69.95
85.45 97.85 85.95 68.75
77.65 93.40 81.50 66.35
19 403 35 771 58 996 71 051
Table 4 Classi"cation accuracy using with 40 context nodes G Training SNR in dB
* 10 8 6
Test SNR in dB
Epochs
10
8
6
4
2
0
100.00 100.00 100.00 100.00
99.90 100.00 100.00 100.00
98.95 99.95 100.00 100.00
96.55 99.80 99.80 99.90
89.05 98.70 98.55 99.30
81.75 94.85 94.85 95.90
24 562 42 730 59 772 66 206
Table 5 Classi"cation accuracy using with 45 context nodes G Training SNR in dB
* 10 8 6
Test SNR in dB
Epochs
10
8
6
4
2
0
100.00 100.00 100.00 100.00
99.95 100.00 100.00 100.00
99.50 99.85 99.90 99.95
96.75 99.40 99.60 99.95
91.00 97.70 97.75 99.10
83.00 94.30 94.85 95.40
results are shown in row 2 of Tables 1}3. The network was re-trained further with a training exemplar set with a higher noise level (SNR"8 dB), retested on the same set of noisy test sequences, and the results are shown in row 3 of Tables 1}3. Results for additional re-training using SNR"6 dB and retesting are shown in row 4 of Tables 1}3.
17 768 38 159 50 341 56 950
The second set of experiments were conducted in exactly the same manner as the "rst set except that ' was used for the initial context vector. The training G and test sequences were identical in both sets of experiments. The results using ' are shown in Tables 4}6. G From the noise-free training results it was observed that, for both '"0.5 and ' , the performance improved G
1768
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
Table 6 Classi"cation accuracy using with 50 context nodes G Training SNR in dB
* 10 8 6
Test SNR in dB
Epochs
10
8
6
4
2
0
100.00 100.00 100.00 100.00
99.80 100.00 100.00 100.00
99.20 99.95 100.00 100.00
96.55 99.85 99.95 100.00
89.45 99.10 99.40 99.50
81.65 96.80 97.35 98.00
when the dimension of the context vector was increased from the minimum of 40 to 45. However, the performance did not improve by increasing the dimension to 50. This is because the input tended to have a smaller in#uence on the prediction when the context vector dimension was increased. That is, there is trade-o! between increasing the dimension for convergence and maintaining a smaller dimension for improved performance. It is also clear that the results using ' are G superior to those using '"0.5 for the noise-free training case. The results using the systematic retraining method showed that the classi"cation accuracy of the SRNN improved dramatically by training the network with a low-noise (SNR"10 dB) training set. There was a slight improvement in performance when the network was re-trained using higher noise-level training sets using ' , however, the performance dropped when '"0.5 was G used.
7. Conclusions This paper focused on developing an approach to classify multi-class temporal sequences using the SRNN. By formulating the classi"cation problem as a component prediction problem, the dimensions of the externally applied input vector and the output layer were not dependant on the number of components in the sequences. In order to simplify the classi"cation of sequences with an unequal number of components, the sequences were circularly extended to have the same number of components as that of the longest sequence. Two training methods were outlined to train a single SRNN to predict the components of multi-class sequences. Issues related to the selection of the context vector dimension and the in#uence of the initial context vector on the classi"cation accuracy were identi"ed and investigated. The use of a di!erent initial context vector for each class was proposed as a means to improve classi"cation and a classi"cation rule which incorporated the di!erent initial context vectors was formulated. A systematic re-training method was proposed to enhance the classi"cation performance of the SRNN. Temporal sequences were derived from the localized contour sequence representa-
17 682 30 715 38 642 44 898
tions of four simulated objects. Through the methods developed it was shown that: (a) A single SRNN can be trained to predict the components of multi-class sequences. Given that the four temporal sequences had several similar intra-class and inter-class components, each component was predicted with respect to its local context in a given sequence. Therefore, each component was also classi"ed with respect to its local context. (b) The classi"cation accuracy of the SRNN can be increased by using a distinguishing initial context vector instead of the standard '"0.5. (c) The classi"cation accuracy of the SRNN can be improved signi"cantly by combining the distinguishing initial context vector with the systematic re-training method. In general, the method developed can be applied to sequences which can be segmented into a series of "xedduration components. Because a single network is trained, it is likely that the time for convergence will increase with an increase in the number of classes as well as an increase in the number of components in each class. It is also likely that the performance will deteriorate when the number of classes and the number of components in each class are increased. However, the SRNN which is partially recurrent network derived from the MLP, o!ers a viable alternative for classifying temporal sequences with properties that are unsuitable for classi"cation using the MLP.
References [1] K.J. Lang, A.H. Waibel, A time-delay neural network architecture for isolated word recognition, Neural Networks 3 (1990) 23}43. [2] A. Waibel, Modular construction of time delay neural networks for speech recognition, Neural Comput. 1 (1989) 39}446. [3] J.L. Elman, D. Zipser, Learning the hidden structure of speech, J. Acoust. Soc. Amer. 83 (1988) 1615}1626. [4] B.R. Kammerer, W.A. Kupper, Experiments for isolatedword recognition with single- and two-layer perceptrons, Neural Networks 3 (6) (1990) 693}706.
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770 [5] W.Y. Chen, Y.F. Liao, S.H. Chen, Speech recognition with hierarchical recurrent neural networks, Pattern Recognition 28 (6) (1995) 795}805. [6] L. Gupta, J. Wang, A. Charles, P. Kisatsky, Three-layer perceptron based classi"ers for the partial shape classi"cation problem, Pattern Recognition 27 (1) (1994) 91}97. [7] L. Gupta, T. Sortrakul, A. Charles, P. Kisatsky, Robust automatic target recognition using a localized boundary representation, Pattern Recognition 28 (10) (1995) 1587}1598. [8] N. Ayache, O.D. Faugeras, HYPER: A new approach for the recognition and positioning of two-dimensional objects, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8 (1) (1986) 44}54. [9] L. Gupta, K. Malakapalli, Robust partial shape classi"cation using invariant breakpoints and dynamic alignment, Pattern Recognition 23 (10) (1990) 1103}1111. [10] L. Gupta, A.M. Upadhye, Non-linear alignment of neural net outputs for partial shape classi"cation, Pattern Recognition 24 (10) (1991) 943}948. [11] T.F. Knoll, R.C. Jain, Recognizing partially visible objects using feature indexed hypotheses, IEEE J. Robotics Automat. RA-2 (1) (1986) 3}13. [12] T.J. Sejnowski, C.R. Rosenberg, NETtalk: a parallel network that learns to read aloud, Technical Report JHU/EECS-86/01, John Hopkins University, Baltimore, January 1986. [13] A.S. Lapedes, R. Faber, How neural networks work, in: Dana Z. Anderson (Ed.), Neural Information Processing Systems, American Institute of Physics, New York, 1988, pp. 442}456. [14] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumelhart, J.L. McClelland (Eds.), Parallel and Distributed Processing, Vol. I, MIT Press, Cambridge, MA, 1986 (Chapter 8). [15] P.J. Werbos, Backpropagation through time: what it does and how it does it, Proceedings of the IEEE, 78 (1990) 1550}1560. [16] F.J. Pineda, Generalization of backpropagation to recurrent and higher order neural networks, in: Dana Z. Anderson (Ed.), Neural Information Processing Systems, American Institute of Physics, New York, 1988, pp. 602}611. [17] L.B. Almeida, Backpropagation in perceptrons with feedback, in: R. Eckmiller, C. von der Malsburg (Eds.), Neural Computers, Springer, Berlin, 1988, pp. 199}208. [18] B.A. Pearlmutter, Learning state space trajectories in recurrent neural networks, Neural Comput. 1 (2) (1989) 263}269. [19] A.M. Loger, E.M. Corwin, W.J.B. Oldham, A comparison of recurrent neural network algorithms, Proceedings of IEEE International Conference of Neural Networks, San Francisco, Vol. II, 1993, pp. 1129}1134. [20] R.J. Williams, D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural Comput. 1 (2) (1989) 270}280.
1769
[21] M.I. Jordan, Supervised learning systems with excess degrees of freedom, Massachusetts Institute of Technology, COINS Technical Report 88-27, May 1988. [22] R.F. Port, Representation and recognition of temporal patterns, Connection Sci. 1}2 (1990) 15}76. [23] J.L. Elman, Finding structure in time, Cognitive Sci. 14 (1990) 179}211. [24] J.L. Elman, Distributed representations, simple recurrent networks, and grammatical inference, Mach. Learning 7 (2/3) (1991) 19}121. [25] A. Cleeremans, D. Servan-Schreiber, J.D. McClelland, Finite state automata and simple recurrent networks, Neural Comput. 1 (3) (1989) 372}381. [26] J. Ghosh, V. Karamcheti, Sequence learning with recurrent networks: analysis of internal representations, Sci. Artif. Neural Networks SPIE Vol. 1710, (1992) 449}460. [27] D. Wang, X. Liu, S.C. Ahalt, On temporal generalization of simple recurrent networks, Neural Networks 9 (7) (1996) 1099}1118. [28] L. Gupta, M. McAvoy, Investigating the prediction capabilities of the simple recurrent neural network on temporal sequences, Pattern Recognition (1999) in press. [29] R. Blake, Analysis of sequence prediction in recurrent neural networks, M.S. Thesis, Department of Electrical Engineering, Southern Illinois University, Carbondale, 1996. [30] M.W. Koch, R.L. Kashyap, Using polygons to recognize and locate partially occluded objects, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9 (4) (1987) 483}494. [31] J.W. Gorman, O.R. Mitchell, F.P. Kuhl, Partial shape recognition using dynamic programming, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-10 (2) (1988) 257}266. [32] H.C. Liu, M.D. Srinath, Partial shape classi"cation using contour matching in distance transformation, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-12 (11) (1990) 1072}1079. [33] J.L. Turney, T.N. Mudge, R.A. Volz, Recognizing partially occluded parts, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-7 (4) (1985) 410}421. [34] S. Judd, P.W. Munro, Nets with unreliable hidden nodes learn error-correcting codes, in: S.J. Hanson, J.D. Cowan, C.L. Giles (Eds.), Advances in Neural Information Processing Systems, Vol. 5, Morgan Kaufmann, San Mateo, CA, 1993, pp. 89}96. [35] A.F. Murray, P.J. Edwards, Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training, IEEE Trans. Neural Networks 5 (1994) 792}802. [36] K.-C. Jim, C.L. Giles, B.G. Horne, An analysis of noise in recurrent neural networks: convergence and generalization, IEEE Trans. Neural Networks 7 (6) (1996) 1424}1438. [37] L. Gupta, M.R. Sayeh, R. Tammana, A neural network approach to robust shape classi"cation, Pattern Recognition 23 (6) (1990) 563}568. [38] C.L. Giles, C. Omlin, Pruning recurrent neural networks for improved generalization performance, IEEE Trans. Neural Networks 5 (5) (1994) 848}851.
1770
L. Gupta et al. / Pattern Recognition 33 (2000) 1759}1770
About the Author*LALIT GUPTA received the B.E. (Hons.) degree in electrical engineering from the Birla Institute of Technology and Science, Pilani, India (1976), the M.S. degree in digital systems from Brunel University, Middlesex, England (1981), and the Ph.D. degree in electrical engineering from Southern Methodist University, Dallas, Texas (1986). Since 1986, he has been with the Department of Electrical Engineering, Southern Illinois University at Carbondale, where he is currently an Associate Professor. His research interests include neural networks, computer vision, pattern recognition, and digital signal processing. Dr. Gupta serves as an associate editor for Pattern Recognition and is a member of the Pattern Recognition Society, the Institute of Electrical and Electronics Engineers, and the International Neural Network Society. About the Author*MARK MCAVOY received the B.S. degree in electrical engineering from University of Illinois at UrbanaChampaign (1991), the M.S. degree in electrical engineering from Southern Illinois University, Carbondale, (1993), and the Ph.D. degree in engineering science (electrical engineering) at Southern Illinois University, Carbondale (1998). He is currently at the Neuro-Imaging Laboratory, Washington University, St. Louis. His research interests include neural networks, pattern recognition, digital signal processing, and electrophysiology. About the Author*JAMES PHEGLEY received the A.S. degree from Lewis and Clark Community College, Godfrey, Illinois (1987), the B.S. degree in mechanical engineering from the University of Illinois at Urbana-Champaign (1989), the B.S. degree in electrical engineering from Southern Illinois University at Edwardsville (1993), and the M.S. degree in electrical engineering from Southern Illinois University at Edwardsville (1997). He is currently pursuing a Ph.D. degree in electrical engineering at Southern Illinois University at Carbondale. Mr Phegley has worked in industry as a control systems engineer. His research interests include pattern recognition and digital signal processing. He is a member of the Institute of Electrical and Electronics Engineers, the Signal Processing Society, and Eta Kappa Nu.