Pattern Recognition 33 (2000) 1119}1133
A technique for image magni"cation using partitioned iterative function system Suman K. Mitra, C.A. Murthy, Malay K. Kundu* Machine Intelligence Unit, Indian Statistical Institute, 203, B.T. Road, Calcutta 700035, India Received 9 September 1998; accepted 29 April 1999
Abstract A new technique for image magni"cation using the theory of fractals is proposed. The technique is designed assuming self-transformability property of images. In particular, the magni"cation task is performed using the fractal code of the image instead of the original one resulting in a reduction in memory requirement. To generate the fractal codes, Genetic Algorithm with elitist model is used which greatly decreases the search for "nding self similarities in the given image. The article presents both theory and implementation of the proposed method. A simple distortion measure scheme and a similarity criterion based on just noticeable di!erence have also been proposed to judge the image quality of the magni"ed image. Comparison with one of the most popular magni"cation techniques, the nearest-neighbor technique, is made. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Image magni"cation; Iterated function system (IFS); Partitioned iterative function system (PIFS); Genetic Algorithms (GAs)
1. Introduction Image magni"cation ideally is a process which virtually increases image resolution in order to highlight implicit information present in the original image, not evident as such. It can be looked upon as a scale transformation. Image magni"cation is used for various applications like matching of images captured using different sensors (having di!erent capturing resolutions), satellite image analysis [1,2], medical image display, etc. Normally, the image is represented in the form of a twodimensional array of pixel values (matrix form), and it requires large memory space. The memory requirement for storage or bandwidth requirement for transmission is greatly reduced when di!erent coding schemes are used. The actual requirement (memory/bandwidth) is dependent on the size of the image and the method used for coding. Conventional magni"cation operation is gener-
* Corresponding author. E-mail addresses:
[email protected] (S.K. Mitra),
[email protected] (C.A. Murthy),
[email protected] (M.K. Kundu)
ally performed on an image represented in the form of a matrix (normal form). Before applying magni"cation technique, any coded image has to be converted into normal form through decoding process which requires some computational cost. So it should be bene"cial if the magni"cation could be done during decoding itself. Moreover bandwidth requirement of an image transmission system would be reduced further if image of smaller size is transmitted but at the receiving end a magni"ed version is generated. With this problem in mind an attempt is made to propose a new magni"cation technique which can be applied directly on the coded version of the image. Fractal image coding technique is one of the e$cient approximate image coding techniques currently available. In image coding, the reconstructed image produced is usually subjectively very close to the original image. Actually, the codes of an image must implicitly carry all the spatial information associated with the image. Besides the spatial information, the fractal codes carry the information of the self-similarities present in the image. These self-similarity property is also exploited in the proposed image magni"cation technique. We call it as fractal image magni"cation technique.
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 0 8 - 9
1120
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
Fractal geometry has recently come into the limelight due to its uses in various scienti"c and technological applications, specially in the "eld of computer-based image processing. It is being successfully used for image data representation [3,4] and as image processing tool [5,6]. In this connection, the use of iterative function system (IFS) and Collage theorem [3] have shown a remarkable improvement in the quality of processing compared to that obtained using existing image processing techniques. A fully automated fractal image compression scheme, known as partitioned iterative function system (PIFS) of digital images was "rst proposed by Jacquin [7]. The basic idea of fractal image compression or to "nd the fractal codes of an image is to approximate small blocks, called range blocks, of the given image from large blocks, called domain blocks, of the same image. Thus, to "nd the fractal codes for a given image, a mathematical transformation for each range block is to be found, which, when applied to an appropriate domain block gives rise to an approximation of the range block. This set of transformations, obtained by partitioning the whole image is called partitioned iterative function system (PIFS). In this scheme the self-similarity of the image blocks are obtained locally so the scheme is also known as local iterative function system [8]. Several researchers have suggested di!erent algorithms with di!erent motivations to obtain PIFS for a given image. We have already suggested a faster algorithm, to obtain PIFS, using genetic algorithms (GAs) [9,10]. GAs [11}13] are optimization algorithms which are modeled according to the biological evolutionary processes. These optimization techniques reduce the search space and time signi"cantly. In the present work an attempt is made to use fractal codes as an input to an image magni"cation system. Some of the popular techniques of digital magni"cation of images are nearest-neighbor, bilinear and bicubic interpolations. All these techniques are based on surface interpolation. Usually, in the interpolation techniques, the global information is often ignored. The local or semi global information is generally exploited. In the proposed scheme, the magni"cation task has been performed by using fractal codes where both the local and semi global information are used. The scheme is nothing but a decoding scheme of fractal codes which gives rise to a magni"ed version of the original image. The article reports the initial results of magni"cation by a factor which is a multiple of two. The technique uses fractal codes which are obtained by a GA-based technique [9,10]. Comparison with the nearest neighbor image magni"cation method has also been reported here. In the magni"cation techniques the distortion due to blocking which is a local phenomenon is very usual. To quantify the amount of distortion, the widely used distortion measure is the mean squared error (MSE) or some
other form of it. MSE is a global measure which fails to account properly the local distortion due to blocking. But the blocking e!ects are very much sensitive to the human visual system. So, to quantify the global and the local distortions simultaneously, a new distortion measure ("delity criterion) is introduced. In the process of magni"cation, the magni"ed image should be visually similar to the original one. Beside the visual judgment, we have proposed here a similarity criterion based on just noticeable di!erence (JND). As the sizes of the magni"ed image and the original image are di!erent, the similarity between them can not be judged by inspecting the pixel values alone. Hence the JND-based scheme is proposed in this regard. Theory and key features of IFS, magni"cation using PIFS and GA are outlined in Section 2. The methodology of using fractal codes for magni"cation of a given image is described in Section 3. A new "delity criterion to judge the performance of the proposed algorithm is discussed in Section 4. JND-based similarity criterion is discussed in Section 5. Section 6 presents implementation and the results. Discussion and conclusions are provided in Section 7.
2. Theory and basic principles The detailed mathematical description of the IFS theory, Collage theorem and other relevant results are available in [3,14,15]. Only the salient features are discussed here. The theory of IFS in image coding and PIFS in image magni"cation are described in the following subsections. The basic principle of Genetic Algorithms is also described. 2.1. Theoretical foundation of IFS Let I be a given image which belongs to the set X. Generally, X is taken as the collection of compact sets. Our intention is to "nd a set F of a$ne contractive maps for which the given image I is an approximate "xed point. The "xed point or attractor `Aa of the set of maps F is de"ned as follows: lim F,(J)"A, ∀J3X, , and F(A)"A, where F,(J) is de"ned as F,(J)"F(F,\(J)), with F(J)"F(J), ∀J3X. Also the set of maps F is de"ned as follows: d(F(J ), F(J )))s d(J , J ), ∀J , J 3X and 0)s(1.
(1)
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
Here `da is called the distance measure and `sa is called the contractivity factor of F. Let d(I, F(I)))e,
(2)
where e is a small positive quantity. Now, by Collage theorem [1], it can be shown that e d(I, A)) . 1!s
(3)
Here, (X, F) is called iterative function system and F is called the set of fractal codes for the given image I. 2.1.1. Image coding using PIFS Let, I be a given image having size w;w and the range of gray level values be [0, g]. Thus the given image I is a subset of 1. The image is partitioned into n nonoverlapping squares of size, say b;b, and let this partition be represented by R"+R , R ,2, R ,. Each R L G is named as range block where, n"w/b;w/b. Let D be the collection of all possible blocks (within the image support) which is of size 2b;2b and let D" +D , D ,2, D ,. Each D is named as domain block K H with m"(w!2b);(w!2b). Let, F "+ f : D P1; f is an a$ne contractive map,. H H Now, for a given range block R , let, f 3F be such that G GH H d(R , f (D )))d(R , f (D )) ∀f3F , ∀j. G GH H G H H Now let k be such that d(R , f (D ))"min +d(R , f (D )), G GI I G GH H H
(4)
Also, let f (D )"RK . GI I GI Our aim is to "nd RK for each i3+1, 2,2, n, or in GI other words for each range block (R ) we are to "nd G appropriately matched domain block (D ) and approI priately matched map (f ). Thus W "+D , f , is called GI G I GI fractal code for R and the set W"+W , i"1(1)n, is G G called the PIFS of the given image I. Fig. 1 illustrates the mapping of domain blocks to the range blocks.
1121
2.2. Image magnixcation using PIFS The a$ne contractive transformation f is construcGI ted using the fact that the gray values of the range block are scaled and shifted version of the gray values of domain block. The transformation f , de"ned on 1, is such GH that f (D ) approximate R . f consists of two parts, one GH H G GH for spatial information and the other for information of gray values. The "rst part indicates which pixel of the range block corresponds to which pixel of domain block. The second part is to "nd the scaling and shift parameters for the set of pixel values of the domain blocks to the range blocks. The "rst part denotes shu%ing the pixel positions of the domain block and can be achieved by using any one of the eight transformations (isometries) on the domain blocks [7,9]. Once the "rst part is "xed, the second part is an estimation of a set of values (gray values) of range blocks from the set of values of the transformed domain blocks. These estimates can be obtained by using the least-square analysis of the two sets of values [9,10]. The second part is obtained using least-square analysis of two sets of gray values once the "rst part is "xed. Moreover, the size of the domain blocks is double that of the range blocks. But, the least-square (straight line "tting) needs point-to-point correspondence. To overcome this, one has to construct contracted domain blocks such that the number of pixels in the contracted domain blocks become equal to that of range blocks. The contracted domain blocks are obtained by adopting any one of the two techniques. In the "rst technique, the average values of four neighboring pixel values in a domain block are considered as the pixel values of the contracted domain blocks [7]. In the other scheme, contracted domain blocks are constructed by taking pixel values from every alternative rows and columns of the domain blocks [9,10]. In the present article we have adopted the "rst one. Thus f can be looked upon as mixture of two transGI formations, f "t C, where, C is contraction operation GI GI and t is transformation for rows, columns and gray GI values. Here we have, I"L R and using Eq. (2) we have, G G L L )e. (5) d 8 R , 8 RK G GI G G Now, let M be a magni"cation operator such that
L L d 8 R , 8 M(R ) )e , (6) G G G G where e is a small positive quantity. Now by Eqs. (5) and (6) we have,
Fig. 1. Mapping for an PIFS scheme.
L L d 8 R , 8 M(RK ) )e , G GI G G
(7)
1122
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
where e is a small positive quantity. Again, we have, RK "f (D )"t C(D ). GI GI I GI I So,
L L d 8 R , 8 Mt C(D ) )e . (8) G GI I G G Now, reconstruction of images using the operator M should be an inverse of contraction operation using the operator C. So, by Eqs. (7) and (8) d(M(RK ), t (D )))e . GI GI I Hence, by Eq. (7),
(9)
L L (10) d 8 R , 8 t (D ) )e . G GI I G G Both e and e are small positive quantities. Thus from Eq. (10), it is clear that there is no need of constructing the magni"cation operator M, only the second part of the fractal codes has to be applied on the domain block to get an image which is very close to the given image I and this image has size double that of the given image. 2.3. Genetic Algorithms Genetic Algorithms (GAs) [11}13] are highly parallel and adaptive search and machine learning processes based on a natural selection mechanism of biological genetic system. Parallelism of GAs depend upon the machine used for computations. GAs help to "nd the global optimal solution without getting stuck at local optima as they deal with multiple points (called, chromosomes) simultaneously. To solve the optimization problem, GAs start with the chromosomal (structural) representation of a parameter set. The parameter set is coded as a string of "nite length called chromosome or simply string. Usually, the chromosomes are strings of 0's and 1's. If the length of the chromosome (string) is l then the total number of chromosomes is 2J. To "nd a near optimal solution, three basic genetic operators (i) Selection, (ii) Crossover and (iii) Mutation are exploited in GAs. In selection procedure the objective function values or the "tness function values of each individual string is responsible for its selection as a new string in the next mating pool. We have used the elitist model of GAs where the worst string in the present generation is replaced by the best string of the previous generation. The crossover operation on a pair of strings emulates the mating process of natural genetic system. This process is very often in natural genetic system and thus, a high probability is assigned to indicate the occurrence of this operation. In mutation operation every bit of every string is replaced by the reverse character (i.e. 0 by 1 and 1 by 0) with some probability. Usually, a low probability is as-
signed for mutation operation and the occurrence of this operation is guided by this probability. We have used the varying mutation probability scheme [16] to guide the mutation operation in the present work. Starting from the initial population (of strings) a new population is created using three genetic operators as described above. This entire process is called an iteration. In GAs a considerable number of iterations are performed to "nd the optimal solution. The string which possesses optimal "tness value among all the strings is called the optimal string. The optimality of the "tness value of strings is problem dependent. If the problem is a minimizing problem, the lowest "tness value is taken as the optimal one and the maximum "tness value is selected as the optimal one if the problem is a maximization problem. The convergence of GAs to an optimal solution is assured as the number of iterations increases [17]. The methodologies to obtain magni"ed images from fractal codes are now discussed below.
3. Methodology So far we have discussed how to apply the fractal codes or PIFS to get a magni"ed image which is double in size than that of the given image. On successive applications of this proposed algorithm, magni"cation by factor 4, 8, 16, etc. can also be achieved. But the "rst task is to obtain the fractal codes or PIFS for a given image. 3.1. Construction of PIFS for magnixcation The size of the range blocks plays an important role in image compression as well as magni"cation. If small blocks are taken, the "ner details of the image are preserved and restored in the decompressed image but the compression ratio will be less. On the other hand, more compression will be achieved, at the cost of "ner image details, if larger range blocks are considered. Thus a trade o! has to be made to get good quality decompressed image as well as considerable amount of compression. But the main task in magni"cation is only to restore all the image details and almost no emphasis is given on the amount of compression achieved. So, in this case, small range blocks are considered to keep track with every minute details of the original image. In the proposed algorithm, to obtain the fractal codes of small range blocks of a given image, the blocks are "rst classi"ed into two groups using a simple classi"cation scheme [10]. The groups are formed according to the variability of the pixel values in the blocks. If the variability of a block is low, i.e., if the variance of the pixel values in the block is below a "xed value, called threshold, we call the block as smooth-type range block. Otherwise, we call it a rough-type range block. The threshold value to separate the range blocks into two types is obtained from
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
the valley in the histogram of the variances of pixel values of the blocks. All the pixel values in a smooth-type range block are replaced by the mean of its pixel values. So, it is enough to store only the discretized mean value. On the other hand, for each rough-type range block, the appropriately matched domain block as well as appropriately matched transformation from eight possible isometric transformations [7] have to be searched out. To solve this search problem a GA-based technique [9,10] is adopted. GA is a search technique which "nds out the optimal solution faster than the exhaustive search technique [10]. 3.2. GA to xnd PIFS The parameters which are to be searched using GA are location (starting row and starting column) of domain block and its eight possible isometric transformations [7]. The realization of the "rst is two integer values between 1 and w!2b and the second can take any value between 1 and 8. Binary strings of length l are introduced to represent the parameter set. Here l is chosen in such a way that the set of 2J binary strings exhausts the whole parametric space. A string indicates the location and the isometric transformation of a domain block. In fractal codes we are to "nd an appropriate domain block and an appropriate transformation for each range block. In other words, we are to "nd the appropriate string for each range block. Out of 2J strings a few strings are selected randomly to start the GA. Starting with the initial mating pool and using the three basic operations, new populations are generated in each iteration of the GA. After a large number of iterations, the GA will produce a near global optimal solution. To obtain the appropriate string in each step we are to calculate the "tness function of each string in each iteration. Mean squared error (MSE) is used as "tness function of a string. In each mating pool, the strings "rst undergo, crossing over operation pairwise and the mutation operation is applied in each bit of each string. Using the fractal codes or PIFS by GA, a magni"cation of order two is achieved. In the next subsection the technique for successive magni"cation has been described. 3.3. Successive magnixcation In the case of successive applications of the algorithm, or to obtain magni"cation of order more than two, the fractal codes need not be computed afresh. The fractal codes that are used in a step (of order which is a multiple of two) are obtained by modifying the fractal codes obtained in the previous step. In particular, the transformations t are modi"ed by using the image that is already GI obtained in the previous step. The locations of the appropriately matched domain blocks are kept "xed in all the
1123
steps. Only the size of the domain blocks is increased in the modi"cation process. Thus, in particular, only the gray level transformation in t is to be modi"ed. The GI gray level transformations are obtained using leastsquare technique. In this technique a straight line is "tted with two sets of gray level values of which one is from the range block and the other is from the contracted domain block. In the successive magni"cation scheme these two sets are enlarged. These enlarged sets are divided into several parts and separate straight lines are "tted for each part using the same least-square technique. The sets are divided into 4 parts for achieving magni"cation by a factor 4 and divided in to 16 parts for the magni"cation by a factor 8, and so on. So, the number of fractal codes, in a step, becomes 4 times larger than its counterpart in the previous step. So, it is enough to "nd the fractal codes in the case of magni"cation by a factor 2 and in other cases these codes are modi"ed accordingly. The modi"ed codes are then used for achieving magni"cation greater than two. The next subsection provides a discussion on the magni"cation by any order. 3.4. Magnixcation by any order In this article, fractal image magni"cation algorithm is implemented with magni"cation factors which are multiples of 2. But in practice, one may need to magnify the given image by other factors too. One way of performing the magni"cation of order k is to select the domain blocks which are k times larger than that of the range block. But if k is large, i.e. if the size of the domain blocks are very large compared to that of range blocks, the similarity patterns between range blocks and domain blocks will not be observed. In that case true magni"cation will not be possible. To avoid this, magni"cations of order 4, 8 and 16 are obtained by modifying the fractal codes of range block size 2;2 and using magni"ed image obtained in the previous step (e.g. magni"cation of order 4 is obtained using the twice magni"ed image and the fractal codes.) Similarly, to obtain magni"cation of order 3 one has to consider the domain blocks which are three times larger than that of range blocks. Hence, by modifying these codes, magni"cation of order which are multiples of three can be obtained. On the other hand, magni"cation by factors which are not multiples of two or three can also be achieved by considering the normalized distance between the range blocks and their matched domain blocks in the original image. The normalized distances are stored in the PIFS codes for each range block. Now, to achieve magni"cation of order k, from PIFS codes, the location of the matched domain blocks for range blocks may be obtained by multiplying the normalized distances by k. Once the matched domain block is "xed the rest is the same as magni"cation by factor two. While performing
1124
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
the magni"cation task, if the distance between the range block and the domain block appears to be fractional, one has to discretize it. The PIFS codes provide some loss of information in the reconstructed image. So, the image quality, in magni"cation, will be decreased with the increasing value of k. This problem can be handled by splitting the magni"cation of k factor task into several steps and by modifying the PIFS codes in each step. Though some loss would be incurred, the overall gain will be in terms of storage requirements as instead of the original image the codes are used, for the task to perform. In the next section, we shall discuss the evaluation criteria to judge the performance of the proposed algorithm.
4. Fidelity criterion It is necessary to judge the performance of the proposed fractal-based image magni"cation algorithm for understanding of its properties and fruitfulness. For this purpose one has to measure the distortion between the given image and the reconstructed image. To quantify the amount of distortion, the widely used distortion measure is peak signal-to-noise ratio (PSNR) which is a function of mean squared error (MSE). MSE or PSNR examines the similarities between two images. But MSE is a sizedependent measure, i.e., the two images, under consideration, should have the same size. Moreover, it is a global measure which is the average of pixel-to-pixel di!erence. It does not accurately indicate the large and signi"cant local distortions due to blocking or blurring [20] as it deals with the average distortions. But the blocking effects are very much sensitive to the human visual system. So, one has to think of a size-independent measure which re#ects local as well as global distortions and judges the performance of magni"cation methodology. A new "delity criterion whose performance is also similar to that of visual judgment is introduced to "nd out the distortion between the given image and the magni"ed reconstructed image. The overall performances of the proposed algorithm is found using this new distortion measure. 4.1. Edge-based distortion measure The images that are obtained from the codes usually have speci"c artifacts such as blocking, ringing and blurring. Actually, these artifacts are re#ected more prominently in the high-frequency component of the image and are very sensible to the human visual system [18]. So, in our proposed distortion measure, we have tried to measure the dissimilarities in edge pattern. For simplicity, only the vertical and horizontal edges have been considered.
Both the images are "rst partitioned into blocks proportional to their respective sizes in such a way that both images contain equal number of blocks. The error is then measured blockwise and "nally the average error is computed. To detect the edges of each block we have used the scheme suggested by Ramamurthi et al. [19]. The edge blocks consist of value `0a and `1a where, `1a represents the presence of edge. Now, it is expected that the original and the magni"ed blocks should have the same type of edge distributions. In other words, the expected run of 1's present in both the blocks should be the same if normalized by their respective sizes. Thus the distortion measure is de"ned by the di!erence between the normalized `expected runa of 1's present in the given image and in the magni"ed reconstructed image. The vertical and horizontal edges are considered separately and then averaged to give rise to the "nal distortion measure of a block. The algorithm of the proposed distortion criterion is described below. 4.1.1. Description of the algorithm Step 1: Partition the images, I and I (with size of I less than the size of I ) into square blocks such that the number of partitions is the same in both the images. Let p and p (with p (p ) be the sizes of the square blocks for the images I and I , respectively. Let these blocks be B , B ,2, B and B , B ,2, B . L L Step 2: From B , i"1, 2 and j"1, 2,2, n compute GH gradient matrices or the edge images. Let GF and GT be, GH GH respectively, the horizontal and the vertical gradient matrix. The elements of the gradient matrices are all either 0 or 1. The gradient matrices are de"ned as follows "g !g " KL KL> (¹ GF (m, n)"0 if GH (g #g )/2 KL KL> "1 otherwise. and "g !g " KL K>L (¹ GT (m, n)"0 if GH (g #g )/2 KL K>L "1 otherwise. Here g "Gray level value of (m, n)th pixel in a block KL and ¹"A pre"xed threshold value. Step 3: Find the expected run of 1's present in both horizontal and vertical directions in both GF and GT . Let GH GH ¸ be the random variable denoting the number of run of 1's in a particular gradient matrix in a particular direction. Compute EFF(¸), EFT(¸), ETF(¸), ETT(¸). Here exGH GH GH GH pected run of 1's is de"ned as E(¸)" Number of times the run of length k appears k . ¹otal number of runs (of all possible lengths) present I
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
1125
Now compute
6. Implementation and results
EFF(¸)#EFT(¸)#ETF(¸)#ETT(¸) GH GH GH . E (¸)" GH GH 4
To "nd the fractal codes for a given image the search is to be made for all possible domain blocks as well as eight possible isometric transformations [7]. To reduce the search space and time Genetic Algorithm is used instead of exhaustive search. The search space reduction is achieved since near optimal solutions are usually satisfactory and, intuitively, solutions which are far away from the expected are rejected in a probabilistic manner. This is the reason for GA to perform well for optimization problems. The good performance of GAs to "nd fractal codes of a given image has already been shown by Mitra et al. [10]. The results are quite satisfactory and at least 20 times reduction in the search space is achieved. For the speci"c implementation of the proposed algorithm, a part of the original `Lenaa image (Fig. 2) is treated as the original input image. The input image is a 128;128, 8 bit/pixel image. The GA-based technique [10] is applied to generate the fractal codes. Moreover, a simple classi"cation scheme [10] for range blocks have been adopted to retain the regions where the gray level variation is minimum. In the classi"cation scheme, the range blocks are grouped into two classes viz., `smootha and `rougha. Every pixel value of a smooth range block is replaced by the average of all the pixel values. For each rough-type range block the GA-based technique [10] is used to "nd fractal codes. In the case of magni"cation algorithm, small range blocks of size 2;2 are considered for the computation of fractal codes. It is true that compression ratio will be reduced by considering small range blocks but, the "ner details of rough-type range blocks will be retained. The main aim of a magni"cation task is to magnify the image keeping all the image details. So, we have considered small range blocks. Now, Using the obtained fractal codes, in the way described in Section 3, an image of the size 256;256 is reconstructed. The reconstructed image is two times magni"ed than the original image. This image is found to be very close to the original image which is judged by the error measure and the similarity measure as described in Sections 4 and 5, respectively. The fractal codes are then modi"ed stepwise, as described in Section 3.3, to get the images which are 4 times and 8 times magni"ed than the original one. In each step, the error, in comparison to its previous step, is measured successively. Also the similarities of magni"ed images are judged, successively, by the JND-based similarity criterion. The proposed algorithm is also compared with the nearest neighbor technique for image magni"cation in terms of proposed distortion measure and similarity measure. Nearest-neighbor is the simplest method of digital magni"cation. Given an image of size w;w, to magnify it by a factor k, every pixel in the new image is assigned the gray value of the pixel in the original image which is nearest to it. This is equivalent to repeating the
Step 4: Normalize E (¸) by respective block size. GH E (¸) E (¸)" GH if i"1 GH p ;p E (¸) " GH if i"2. p ;p Step 5: Compute the "nal error measure E between the two given images I and I . E is de"ned as 1 L E" +E (¸)!E (¸),. H H n H The next section discuses the JND-based similarity criterion.
5. JND-based similarity criterion Just noticeable di!erence (JND) measures the amount of change in gray value of a pixel in comparison with its surrounding pixels. Usually, JND is used to evaluate the edges present in an image [22]. Here we have proposed a similarity measure based on JND to judge the similarity between two images having unequal size. In particular, the proposed similarity measure judges the performance of the proposed PIFS-based image magni"cation technique. JND is basically the di!erence in contrast of an object from its background and it plays an important role in human visual system. The human visual system (HVS) model [20] deals mainly with three factors, the luminance level, spatial frequency and signal content. Out of these, the perceived luminance is a nonlinear function of incident luminance. According to Weber's Law [21], if the luminance (¸ #*¸) of an object is just noticeably di!erent from its background luminance ¸ , then *¸/¸ "constant. Therefore, the just noticeable di!erence (JND) *¸ increases with the increase in ¸ . In the present case, we have developed a criterion based on JND, to judge the similarity between two images, which are unequal in size. In particular, "rst the average values of *¸ from both the images are computed. Next percentage of similarity present is computed using average values of *¸. We call this percentage of similarity based on JND as JND similarity. Note that, JND similarityP100, implies complete similarity between two images. The computation of ¸ and *¸ are carried out as described in Ref. [22].
1126
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
Table 1 The results obtained in terms of Distortion of the image magni"cation algorithms Image
Reconstructed image statistics MF
Lena LFA
2 2
Distortion
MF
Fractal
NN
1.18 2.32
1.62 2.37
4 4
Distortion
MF
Fractal
NN
1.37 2.85
3.11 4.59
8 8
Distortion Fractal
NN
1.24 2.43
6.15 9.11
MF"Magni"cation factor and NN"Nearest-neighbor.
Table 2 The results obtained in terms of similarity of the image magni"cation algorithms Image
Reconstructed image statistics MF
Lena LFA
2 2
Similarity (%) Fractal
NN
63.47 82.05
64.39 68.22
MF
4 4
Similarity (%) Fractal
NN
57.75 72.46
49.06 51.83
MF
8 8
Similarity (%) Fractal
NN
54.50 64.40
44.38 47.03
MF"Magni"cation factor and NN"Nearest-neighbor.
gray values k;k times to obtain the magni"ed image. The resultant image for large magni"cation factors will have prominent block like structures due to lack of smoothness. The other techniques of digital image magni"cation are basically interpolation methodologies which are based on linear, bilinear, cubic or bicubic interpolation [23}26]. The proposed algorithm has also been implemented on a `low #ying aircrafta (LFA) image having size 128;128 and range of gray level values (0}255). Other parameters of the algorithm are kept "xed as in the case of `Lenaa image. All the results obtained are presented in Tables 1 and 2. The original and decoded images of `Lenaa are shown in Figs. 2 and 3, respectively. Figs. 4, 6 and 8, are respectively, the two times, four times and eight times magni"ed images of `Lenaa using the proposed fractal-based technique while Figs. 5, 7 and 9 are, respectively, two times, four times and eight times magni"ed images of `Lenaa using nearest neighbor technique. Figs. 10 and 11 are, respectively, the original and decoded image of `LFAa. The results of fractal-based magni"cation of `LFAa image are shown in Figs. 12}14 while Fig. 15 is the eight times magni"ed image of `LFAa using nearest-neighbor technique. From Table 1, it is evident that in terms of the proposed error criterion the performance of the proposed fractal-based image magni"cation scheme is better than that of the nearest neighbor technique. The results pre-
Fig. 2. Original `Lenaa image.
Fig. 3. Decoded `Lenaa image.
sented in Table 2 shows that in terms of the similarity criterion, the nearest-neighbor technique is better than that of the fractal-based technique but the latter appears to be better for magni"cation factor more than two. Comparing Figs. 4 with 5, 6 with 7 and 8 with 9, visually,
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
1127
Fig. 4. Two times magni"ed `Lenaa using fractal technique.
Fig. 6. Four times magni"ed `Lenaa fractal technique.
Fig. 5. Two times magni"ed `Lenaa using nearest-neighbor.
one can "nd some ringing and blurring are present in the case of the nearest neighbor technique for magni"cation of order more than two. On the other hand, in the case of proposed fractal-based magni"cation a few block e!ects have been observed.
7. Discussion and conclusions Fig. 7. Four times magni"ed `Lenaa using nearest-neighbor.
The most important advantage of the proposed technique of fractal image magni"cation is that it utilizes the coded (fractal) version of the input image instead of the original image. Therefore, it is cost e!ective in the sense of storage space and time as no decoding is performed at the receiving end in case of transmission of the codes. Another advantage of fractal image magni"cation is that it magni"es the image by expanding the fractal codes or the transformations which may be looked upon as independent of image resolution. The only error involved
with it is the problem of discretization. Thus the structure and the shape of the image remains almost the same. In a sense, it is like interpolation resulting in a sharper expanded image. Other image magni"cation schemes use pixel replication to expand image. Pixel replication makes an image blocky, blurry and patchy after a certain extent of expansion.
1128
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
Fig. 8. Eight times magni"ed `Lenaa fractal technique.
The size of the range blocks considered plays a vital role in fractal image compression and fractal image magni"cation (ref. Section 3.1). In particular, these two algorithms are in opposite direction from the point of view of range block size. So, one can think of an optimal range block size for which good quality magni"ed images can be reconstructed from the fractal codes and at the same time considerable amount of compression (in terms of compression ratio) can be achieved. To solve this problem one can think of quadtree partitioning of the images instead of square partitioning while generating the fractal codes [27]. Another scheme, to obtain the PIFS codes of
an image, has been suggested by Thomas et al. [28] can also be adopted in this connection. In this scheme they have considered irregular shaped range blocks. Automatically, the matched domain blocks are just the scaled and magni"ed versions of these irregular shaped range blocks. In the present article we have introduced a new distortion measure or "delity criterion to judge the performance of the proposed algorithm. There are other methods which are non-parametric statistical tests [29] for the same purpose. The common tests for examining the degree of association between two distributions
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
Fig. 9. Eight times magni"ed `Lenaa using nearest-neighbor.
Fig. 10. Original `LFAa image.
Fig. 11. Decoded `LFAa image.
1129
1130
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
Fig. 12. Two times magni"ed `LFAa using fractal technique.
whose distribution functions are unknown are Sign test, Wald Wolfowitz Run test, Wilcoxson test and Kolmogorov Smirnov test. Another evaluating criteria based on fractal dimension has already been suggested by Lalitha et al. [30]. But the most important feature which should be considered while examining the distortion between two images is the edge distribution of the images as the edges are very sensitive to human eyes. But neither the fractal-based evaluating criteria nor the statistical tests take care of distortions present in the edges. The main advantage of the proposed error criterion is that it takes care of distortions in the edges. Thus, one of the important tasks is to "nd the proper edges in the images for the implementation of the proposed distortion measure. We have used a very simple technique for the detection of edges though one may suggest more complex techniques for it. The other measure proposed for judging the performance of the proposed fractal-based image magni"cation technique is JND-based similarity criterion. This measure also takes care of the distribution of edges as JND is basically the change in contrast of an object with respect to its background. But one disadvantage of this similarity measure is that it deals with the change in pixel values ignoring the edge pattern.
8. Summary Image magni"cation is a process which increases image size, keeping all the image details una!ected, in order to highlight implicit information present in the image, not evident as such. Image magni"cation is used for various applications like, satellite image analysis, medical image display. Generally, the image is represented in the form of a two-dimensional array of pixels values (matrix form) and it requires huge memory space. The memory requirement of storage is greatly reduced by using some form of image compression. But for performing image
Fig. 13. Four times magni"ed `LFAa Fractal technique.
processing tasks, the coded image is usually brought back to its normal form. It is increasingly the case that the coded form of the image is used, instead of the normal form, as the input to perform di!erent image processing tasks. With this aim in mind an attempt is made, in the present article, to propose a new magni"cation technique which can be applied directly on the coded form of the image. In particular, the proposed algorithm uses that coded form of the image which is obtained by fractal image compression technique. Recently, fractal-based image compression is very popular and there are many techniques available in the literature for "nding fractal code of an image. The encoding process involves in computing a set of linear contractive maps from the target image. In the decoding process, the obtained set of maps is applied on an arbitrary image in an iterative way to result in an image which is very close to the target image. The set of maps is called fractal code or partitioned iterative function system (PIFS) code of the image. In the process of iterative sequence, PIFS code converges to a "xed image which is very close to the target image. Computational task for "nding PIFS code of an image is usually time consuming. But we have used here a cost e!ective Genetic Algorithm (GA)-based technique to "nd PIFS code to propose a new magni"cation technique. Conventional magni"cation techniques are basically interpolation methodologies which are based on linear, bilinear, cubic or bicubic interpolations. In the proposed algorithm magni"ed version of an image is obtained using the reconstruction of the fractal code of that image. No magni"cation operator-like interpolation is needed.
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
1131
Fig. 14. Eight times magni"ed `LFAa fractal technique.
Only the fractal code or in other words the set of a$ne contractive maps is needed to magnify the image. We have proposed here the technique of magni"cation of orders which are multiples of two. The technique can be extended to the case of magni"cation by any order. We have also proposed two techniques to judge the performance of the proposed magni"cation algorithm. The main task involves here to measure the distortion or similarity between the given image and the magni"ed image. The commonly used distortion measure is peak signal-to-noise ratio (PSNR) which is a function of mean-squared error (MSE). The MSE is a global measure and also it is image size dependent. The proposed tech-
niques are not only image size independent but also utilizes both global and local information. The "rst technique is a distortion measure based on the edge distribution of the images and indicates the in#uence of artifacts like blocking, blurring and ringing which may appear due to magni"cation. The other one is a similarity measure based on just noticeable di!erence (JND) which is nothing but change in luminance of an object pixel with respect to its background pixels. The overall performance of the proposed magni"cation technique is found to be satisfactory both qualitatively and quantitatively. Comparison with one of the most popular magni"cation techniques, the nearest-neighbor technique, is made.
1132
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133
Fig. 15. Eight times magni"ed `LFAa using nearest-neighbor.
Acknowledgements The authors are glad to acknowledge Prof. S.K. Pal and Dr. S.N. Sarbadhikari for their helpful suggestions and encouragement during the course of the work. Dr. Murthy acknowledges the center for Multivariate Analysis, Pennsylvania State University, University Park, P.A. 16802, USA for the academic and "nancial assistance received in carrying out this work.
[2]
[3]
[4]
References [5] [1] P.S. Chavez Jr., Digital merging of landsat TM and digitised NHAP data for 1 : 24000 scale image mapping,
Photogrammetric Eng. Remote Sensing 52 (1986) 1637}1646. R. Welch, M. Ehlers, Merging multi resolutions SPOT HRV and Landsat TM data, Photogrammetric Eng. Remote Sensing 53 (1987) 301}303. M.F. Barnsley, V. Ervin, D. Hardin, J. Lancaster, Solution of an inverse problem for fractals and other sets, in: Proceedings of National Academy of Science (USA), 1983 (1986). A.E. Jacquin, Fractal theory of iterated Markov operators with applications to digital image coding. Ph.D. thesis, Georgia Institute of Technology, August 1989. A.P. Pentland, Fractal based description of natural scenes, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-6 (6) (1984) 661}674.
S.K. Mitra et al. / Pattern Recognition 33 (2000) 1119}1133 [6] J.M. Keller, S. Chen, R.M. Crownover, Texture description and segmentation through fractal geometry, Comput. Vision, Graphics, Image Processing 45 (1989) 150}166. [7] A.E. Jacquin, Image coding based on a fractal theory of iterated contractive image transformations, IEEE Trans. Image Processing 1 (1) (1992) 18}30. [8] M.F. Barnsley, L.P. Hurd, Fractal Image Compression, A.K. Peters, MA, 1993. [9] S.K. Mitra, C.A. Murthy, M.K. Kundu, Fractal based image coding using genetic algorithm, in: P.P. Das, B.N. Chatterji (Eds.), Pattern Recognition, Image Processing and Computer Vision. Recent Advances, Narosa Publishing House, New Delhi, 1995, pp. 86}91. [10] S.K. Mitra, C.A. Murthy, M.K. Kundu, Technique for fractal image compression using genetic algorithm, IEEE Trans. Image Processing 7 (4) (1998) 586}593. [11] D.E. Goldberg, Genetic Algorithm in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, 1989. [12] L. Davis, Handbook of Genetic Algorithms, Van Nostrand Reinhold, New York, 1991. [13] Z. Michalewicz, Genetic Algorithms#Data Structure" Evolution Programs, Springer, Berlin, 1992. [14] G.A. Edger, Measure, Topology, and Fractal Geometry, Springer, New York, 1990. [15] J. Feder, Fractals, Plenum Press, New York, 1988. [16] S. Bandyopadhyay, C.A. Murthy, S.K. Pal, Pattern classi"cation with genetic algorithms, Pattern Recognition Lett. 16 (8) (1995) 801}808. [17] D. Bhandari, C.A. Murthy, S.K. Pal, Genetic algorithm with elitist model and its convergence, Int. J. Pattern Recognition Arti"cial Intell. 10 (6) (1996) 731}747. [18] S.A. Karunasekera, N.G. Kingsbury, A distortion measure for blocking artifacts in images based on human visual sensitivity, IEEE Trans. Image Processing 4 (6) (1995) 713}724.
1133
[19] B. Ramamurthi, A. Gersho, Classi"ed vector quantization of images, IEEE Trans. Commun. COM-34 (1986) 1105}1115. [20] S. Daly, The visual di!erence predictor: an algorithm for the assessment of image "delity, in: SPIE Conference on Human Vision, Visual Processing and Digital Display III, San Jose, CA, 1992, pp. 2}15. [21] K.R. Bo!, J.E. Lincoln, Engineering Data Compendium: Human Perception and Performance, Wright-Patterson, AFB, OH, AAMRL, 1988. [22] M.K. Kundu, S.K. Pal, Edge detection based on human visual response, Int. J. Systems Sci. 19 (12) (1988) 2523}2542. [23] R.C. Gonzalez, R.R. Wood, Digital Image Processing, Adison-Wesley, Reading, MA, 1993. [24] R.G. Keys, Cubic convolution interpolation for digital image processing, IEEE Trans. Acoustic Speech Signal Processing 29 (1981) 1153}1160. [25] S.K. Park, R.A. Schowengerdt, Image reconstruction using parametric cubic convolution, Comput. Vision Graphics Image Processing 23 (1983) 258}272. [26] R.A. Schowengerdt, S.K. Park, R. Gray, Topics in two dimensional sampling and reconstruction of images, Int. J. Remote Sensing 5 (1984) 333}347. [27] Y. Fisher, E.W. Jacbos, R.D. Boss, Fractal image compression using iterated transforms, in: J.A. Storer (Ed.), Image and Text Compression, Kluwer Academic Publishers, Dordrecht, 1992, pp. 35}61. [28] L. Thomas, F. Deravi, Region-based fractal image compression using heuristic search, IEEE Trans. Image Processing 4 (6) (1995) 832}838. [29] C.R. Rao, Linear Statistical Inference and its Applications, Wiley Eastern Limited, New Delhi, 1965. [30] L. Lalitha, D.D. Majumder, Fractal based criteria to evaluate the performance of digital image magni"cation techniques, Pattern Recognition Lett. 9 (1989) 67}75.
About the Author*SUMAN K. MITRA was born in Howrah, India in 1968. He received his B. Sc. and M. Sc. degrees in Statistics from the University of Calcutta, India. He is currently a Senior Research Fellow in the Machine Intelligence Unit of Indian Statistical Institute, Calcutta. His research interests include Image Processing, Fractals, Pattern Recognition and Genetic Algorithms. About the Author*C.A. MURTHY was born in Ongle, India in 1958. He received his M. Stat and Ph.D. degrees from the Indian Statistical Institute, Calcutta. He is currently an Associate Professor in the Machine Intelligence Unit of the Indian Statistical Institute. He visited The Michigan State University, East Lansing, in 1991}1992, for six months. He also visited the Pennsylvania State University, University Park, for 18 months in 1996}1997. His "elds of interest include Pattern Recognition, Image Processing, Fuzzy Sets, Neural Networks, Fractals and Genetic Algorithms. About the Author*M.K. KUNDU received his B. Tech., M. Tech. and Ph.D. (Tech.) degrees in Radio Physics and Electronics from the University of Calcutta in the year 1972, 1974, and 1991, respectively. In 1976, he joined the process automation group of research and development division at the Tata Iron and Steel company as assistant research engineer. In 1982, he joined the Indian Statistical Institute, Calcutta. Currently, he is a professor in the Machine Intelligence Unit of the same institute. He had been the Head of the Machine Intelligence Unit during September 1993}November 1995. During 1988}1989, he was at the A.I. Laboratory of the Massachusetts Institute of Technology, Cambridge, USA, as a visiting scientist under UN Fellowship. He visited the INRIA Laboratory and International Center for Pure and Applied Mathematics (ICPAM) at Sophia Antipolis, France, in 1990, and also in 1993 under UNESCO}INRIA}CIMPA fellowship program. He was a guest faculty at the department of Computer Science, Calcutta University from 1993 to 1995. His current research interest includes image processing, and image data compression, computer vision, genetic algorithms, neural networks and conjoint image representation. He received the J.C. Bose memorial award from the Institute of Electronics and Telecommunication Engineers (IETE), India in the year 1986. He is a life fellow member of IETE and member of IUPRAI (Indian section of IAPR) and ISFUMIP(Indian section of IFSA).
Pattern Recognition 33 (2000) 1135}1146
Active models for tracking moving objects Dae-Sik Jang, Hyung-Il Choi* School of computing, Soongsil university, 1-1, Sangdo-5 Dong, Dong-Jak Ku, Seoul, 156-743, South Korea Received 11 March 1998; received in revised form 12 April 1999; accepted 12 April 1999
Abstract In this paper, we propose a model-based tracking algorithm which can extract trajectory information of a target object by detecting and tracking a moving object from a sequence of images. The algorithm constructs a model from the detected moving object and match the model with successive image frames to track the target object. We use an active model which characterizes regional and structural features of a target object such as shape, texture, color, and edgeness. Our active model can adapt itself dynamically to an image sequence so that it can track a non-rigid moving object. Such an adaptation is made under the framework of energy minimization. We design an energy function so that the function can embody structural attributes of a target as well as its spectral attributes. We applied Kalman "lter to predict motion information. The predicted motion information by Kalman "lter was used very e$ciently to reduce the search space in the matching process. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Tracking, active model; Energy minimization; Kalman "lter
1. Introduction The problem of tracking moving objects has received a good deal of attention over the last few years. We can "nd various types of approaches for this topic in related literatures [1}4]. Many of them utilize some form of prediction to reduce the area where a target is searched for. A Kalman "lter may be applied for this purpose [5]. A tracking process can then be interpreted as the process of search for a target within the reduced scope. This process is usually embodied through model matching. Such an approach is often called a model-based method. Since a model-based approach is able to exploit speci"c attributes of target objects, it shows signi"cant advantages over other methods which utilize generic image features. The superiority becomes apparent especially for complex situations in which there are multiple moving objects or there may be large motion of an object from one frame to the next [5].
* Corresponding author. Tel.: #82-2-820-0679; fax: #82-2822-3622. E-mail address:
[email protected] (H-I. Choi).
The major di$culty in a tracking problem is to deal with inter-frame changes of moving objects. It is clear that the image shape of a moving object may undergo deformation, since a new aspect of the object may become visible or an actual shape of the object may change. Thus a model needs to evolve from one time frame to the next, capturing the changes in the image shape of an object as it moves. In this paper, we address the problem of tracking non-rigid objects in complex scenes, including the case where there are other moving objects present. We use an active model which characterizes regional and structural features of a target object such as shape, texture, color, and edgeness. Our active model can adapt itself dynamically to an image sequence. Such an adaptation is made under the framework of energy minimization. We design an energy function so that the function can embody structural attributes of a target as well as its spectral attributes. Fig. 1 shows the overall procedure of our method. Our tracking algorithm has two main modules: prediction module and updating module. The prediction module estimates motion parameters of a target. The estimates are used to limit possible areas of a target in successive frames. The updating module is to account for inter-frame change of a target. It "rst seeks the best
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 0 0 - 4
1136
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
Fig. 1. Overall #ow of our tracking method.
match for an old model through template matching. The best match is then incrementally transformed until the state of minimal energy is reached. The state of minimal energy, in turn, reveals the updated model for the next frame. Our active model has some similarity to a snake model [6]. Both models incrementally evolve themselves to re#ect inter-frame changes under the framework of energy minimization. However, a snake model does not involve target-speci"c information. It rather operates based on generic evidences of image features. Our model, on the other hand, explicitly incorporates target-speci"c information, as will be seen later, in its energy function. Furthermore, our active model can characterize regional attributes as well as boundary attributes of a target.
2. Structure of active models We represent a target in the form of a labeled graph. The nodes of a graph are determined at "ducial patches of a target image. Depending on the position of a patch, a node is called either a boundary node or an internal node. A node is labeled with an n-dimensional feature vector which describes an associated patch. The feature vector includes such attributes as color, edgeness, texture and geometric structure among neighboring nodes. The edges of a graph are formed based on the concept of neighborhood. In order to de"ne "ducial patches on which nodes are formed, a target should be determined. This step of initialization can be done manually by a user or automatically by some identi"cation module. Let us suppose that a target is somehow determined as in Fig. 2(a). We "rst "nd out the LER(least enclosing rectangle) of the target, and
partition the LER into cells of uniform size. The size of a cell can be varied depending on the size of a target. In this paper, we set the size of each cell to be 5;5 pixels. Among the partitioned cells, we select boundary cells and internal cells, and associate them to nodes of a model graph. The boundary cells are those which lie at the boundary of a target object and border a background as in Fig. 2(b), and the internal cells are those which are inside of a target. Boundary cells can be easily detected by following the border of a target. To determine internal cells, we do segmentation of a target into several sub-regions by exploiting color information of pixels belonging to a target. For example, we may utilize color histograms or some clustering algorithm for this purpose. Once segments of a target are formed, we take center cells of each segment as internal cells. The selected cells become nodes of a model graph, and they are linked to neighboring nodes. We consider three types of neighbors based on regional adjacency. The "rst is between boundary nodes. Boundary nodes are linked to their adjacent boundary nodes. The second is between internal nodes. Internal nodes are linked to each other when their associated sub-regions are adjacent. The third is between a boundary node and an internal node. Each internal node is linked to its closest boundary node. But, this type of linkage does not involve the internal nodes whose associated sub-regions do not touch boundary cells. Fig. 3 shows one example of a model graph where nodes are linked as was illustrated. We assign a feature vector to each node of a model graph. The feature vector is to re#ect regional characteristics of an associated cell as well as structural characteristics among its neighboring nodes. Four types of features are considered for this purpose; color feature, texture
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
1137
Fig. 2. Boundary cells and interior cells.
where H(k, l) and S(k, l) are hue and saturation values, and = is the width of a cell and N is the number of pixels in a cell. As for texture features, we use Gabor wavelet coe$cients because they have been shown to be e!ective in characterizing texture information [7]. Gabor wavelets are biologically motivated convolution kernels, and they have the shape of plane waves restricted by a Gaussian envelope. Each wavelet g (x,y)can be generKL ated by appropriate dilations and rotations of a mother wavelet g(x,y) through the generating function. Fig. 3. Example of a model graph.
feature, edge feature, and shape feature. As for color features, we use hue(H) and saturation(S) of HSI color space. Intensity(I) is not considered because it is desirable that features are robust to variations of brightness [16]. We compute the mean and standard deviation of these values of pixels belonging to a cell. The computed values represent color features of an associated cell. IV>U 1 JW>U 1 H(k, l) kG " kG " 1 N & N IV\U JW\U
IV>U JW>U S(k, l), IV\U JW\U
pG " &
IV>U 1 JW>U (H(k, l)!kG ), & N IV\U JW\U
pG " 1
IV>U 1 JW>U (S(k, l)!kG, 1 N IV\U JW\U
w"(=!1)/2,
(1)
1 1 x y exp ! # #2pj=x , 2pp p 2 p p V W V W g (x, y)"a\Kg(x, y), a'1, m, n"integer, KL x"a\K(x cos h#y sin h) g(x, y)"
(2)
y"a\K(!x sinh#y cos h), where m and n represent dilation and rotation indices, respectively, h"np/K, and K and S are the total number of rotations and scales, respectively. In this paper, we set K and S to be 4 and 3, respectively. Gabor wavelet coe$cients are computed by convolving intensity values of a cell with the wavelet kernels. The computed values represent texture features of an associated cell. IV>U JV>U GG " I(k, l)g (k!x, l!y), (3) KL KL IV\U JW\U where I(k, l) is an intensity value of a pixel. As for edge features, we compare the mean intensity value of a cell with the mean intensity values of its neighboring cells. For this comparison, we apply Laplacian operator as in Eq. (4). Fig. 4 shows a Laplacian mask overlaid onto the cell of interest and its eight-neighbor cells. The
1138
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
when an energy function is minimized. Our active model is an example of a more general deformable model. It adapts itself to an image by means of energy minimization. The adapted graph then becomes a new model for a next frame. We de"ne the energy function of a model as the weighted sum of di!erences between feature values of a model and feature values extracted from an input image. The feature values of an input image are computed at cells where nodes of a model graph position. The size of cells is the same as was de"ned when feature values of a model were computed. Our energy function has four terms; color energy, texture energy, edge energy, and shape energy.
Fig. 4. Cell mask for computing edgeness.
feature of edgeness is useful for characterizing boundary nodes. E "!M !M !M G PV\ W\ PVW\ PV>W\ !M #8;M !M PV\W PV W PV>W !M !M !M . (4) PV\W> PVW> PV>W> As for shape features, we de"ne a measure which can re#ect the geometric disposition depicted by neighboring nodes. When a node N is linked to n neighboring nodes G N ,2,N L, we represent each linkage as a vector I I V H from a center coordinate of N to that of N H. By GI G I taking average of them, we get a reference vector V0. The G reference vector is used to estimate the average value of di!erentials. The estimate is made by evaluating the average of di!erences between the reference vector and every V H. GI 1 V0" (V #2#V L), G GI n GI 1 S " ("V !V0"#2#"V L!V0"). G n GI G GI G
(5)
The shape measure has the smallest value of 0 when nodes are evenly spaced on a straight line. It has bigger values as nodes get scattered more asymmetrically. This measure is a coordinate independent measure, as the same value is obtained under translation and rotation. This is a desirable characteristic for model matching. The computed four types of features characterize each node in the form of a feature vector as in Eq. (6). F(C )"[kG , kG , pG , pG , GG ,2, GG , S , E ]2. G & 1 & 1 KL G G
(6)
3. Energy functional of active models Our model graph is to be matched against a new input frame in order to reliably "nd a target object. We interpret the matching process as an energy minimizing process. In other words, we devise a matching metric in the form of an energy function so that the best match is found
EH " (aE (C )#bE (C )#cE (C ) KMBCJ AMJMP G E?@MP G QF?NC G GZ+ #dE (C )). (7) CBEC G In (7), the summation is over all the nodes in a model graph, and a, b, c and d are weights to balance the relative strengths of terms in the function. Each term of Eq. (7) is de"ned as a Euclidean distance between corresponding feature values. Eqs. (8)}(11) show how these terms are computed, where the subscript i denotes a node index, and superscripts M and I denote a model and image, respectively. E (C )"""C+!C'"", (8) AMJMP G G G E (C )"""G+!G'"", (9) E?@MP G G G E (C )""E+!E'", (10) CBEC G G G E (C )""S+!S'". (11) QF?NC G G G The color energy involves means and standard deviations of hue and saturation values in the form of a feature vector C"[k , k , p , p ]. In Eq. (8), C+ is a color & 1 & 1 G feature vector contained in the ith node of a model graph, and C' is a color feature vector extracted from an image G cell where the ith node overlies. This energy term encourages nodes to move toward locations whose color features are close to those of a target. The above formula re#ects not only the di!erence between the directions of two vectors, but also the di!erence in magnitude. If the di!erence in magnitude does have a wide dynamic range, only the relative di!erence of directions can be considered by normalizing two vectors before taking the di!erence. Eq. (12) shows one example of such normalization, which bounds values in the interval [0, 2].
C+ C' G ! G . (12) E (C )" AMJMP G ""C+"" ""C'"" G G The texture energy involves Gabor wavelet coe$cients in the form of a feature vector GG"[GG ,2,GG ]2, KL and the edge energy involves edgenesses of nodes and image cells. The texture energy leads nodes of a graph to
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
locations whose textural characteristics agree with those of a model. This energy is especially useful for maintaining the internal structure of a model graph. The edge energy forces boundary nodes to keep staying at the boundary of a target and internal nodes to remain inside a target. The shape energy reveals the structural change of a model graph during the matching process. Since the degrees of nodes do not vary during the process of minimization, the tendency is for nodes to always retain their geometric structures with their adjacent nodes. This tendency may suppress the adaptability of a model. But, this energy has the e!ect of preventing nodes from drifting widely around. Furthermore, we presume that a target does not change its shape rapidly from one frame to the next. The behavior of our active model is similar to that of a Snake model in a sense that both of them work under the principle of energy minimization. But, our model is di!erent from a Snake model in involved energy terms. While a Snake model embodies energy forces in terms of the connectivity of nodes and image features, our model embodies energy forces in terms of di!erences between model feature values and image feature values. In other words, a Snake model does not contain model-speci"c features. It just relies on generic image features. For example, a Snake model does not say that edgeness has to be in some range, but it prefers stronger edgeness. On the other hand, our active model embeds features of a target object in terms of energy forces. It explicitly speci"es that feature values have to lie in certain ranges.
4. Updating a model through prediction and energy minimization Our tracking method operates by comparing the model at a given frame, M , to the image at the next R frame, I , in order to "nd the cells specifying the state R> of the minimal energy for that model in that image. Then the new model, M , is generated by treating feature R> values of the selected cells as node entities and organizing them in the form of a graph. This new model, M , R> represents a target object to appear in the next time frame. This process of tracking can be decomposed into two steps: (1) estimating a possible area of a target at the current time frame by a Kalman "lter, and (2) updating a model of a target through energy minimization. In order to reduce the scope of an image where a model is to be compared, we utilize a Kalman "lter [8}10]. In other words, we design a Kalman "lter whose state parameters may induce an expected area for a target in the next time frame. This is equivalent to encoding a tracking history into state parameters of a Kalman "lter and estimating motion parameters of a target. For
1139
this purpose, we take a linear state model as in Eq. (13) s(t)"'(*t)s(t!*t)#w(t!*t),
(13)
where s(t) denotes a system state at a time frame of t, '(*t) denotes a state transition matrix during a frame interval of *t, and w(t) denotes an estimation error. We express a system state as an eight-dimensional vector containing the information on position and size of a target. Eq. (14) shows such a system state and an estimation error, respectively.
*x(t)
= VR *y(t) = WR xs(t) = VQR ys(t) = s(t)" , w(t)" WQR . *x(t) = VYR *y(t) = WYR xs(t) = VQYR ys(t) = WQYR
(14)
The positional change of a target, *x and *y, is represented by the displacement of center coordinates per unit frame interval, and the size, xs and ys, by horizontal and vertical lengths of its enclosing rectangle. The prime symbol denotes a derivative with respect to t. We also assume that the trajectory of a target varies with a constant acceleration and the size of a target varies linearly. Under the assumption, a state transition matrix may have the following form.
1 0 0 0 *t 0
'(*t)"
0
0
0 1 0 0 0
*t 0
0
0 0 1 0 0
0
*t 0
0 0 0 1 0
0
0
*t
0 0 0 0 1
0
0
0
0 0 0 0 0
1
0
0
0 0 0 0 0
0
1
0
0 0 0 0 0
0
0
1
(15)
A Kalman "lter algorithm estimates system states based on a set of measurements. We assume a linear relationship between system states and a set of measurements. m(t)"H(t)s(t)#v(t),
(16)
where m(t) denotes a set of measurement, H(t) an observation matrix, and v(t) measurement errors. We measure the positional change and size of a target at each time
1140
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
frame to obtain values of m(t). m(t) and H(t) are formed as follows.
*x
m(t)"
*y xs
xy
1 0 0 0 0 0 0 0
H(t)"
0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
.
0 0 0 1 0 0 0 0
(17)
Now that we have de"ned a system model and measurement model, we can apply a recursive Kalman "lter algorithm to obtain the estimates of motion parameters such as expected positional and size changes of a target [5]. The resulting estimates are used to set a searching area of a target at the next time frame. That is, we determine the center coordinate (x , y ) of a searching A A area by utilizing the estimated values of *x and *y, and we set the size of a searching area to be 1.5 times the estimated size. x "xNPCT#*x, y "yNPCT#*y, A A A A w "xsH1.5, w "ysH1.5, (18) V W where xNPCT and yNPCT denote center coordinates of a A A searching area at a previous frame, and *x, *y, xs, ys denote the estimates entities. Having speci"ed a searching area for a target, we are now interested in "nding cells whose feature values agree well with those in a model graph M . This is equivalent to "nding cells for which the R energy function EH is minimized. The process of minKMBCJ imization makes M continue to deform itself until it R reaches a stable state which corresponds to a new model M . We accomplish this process of model updating in R> two steps: (1) "nding the initial location where M is to be R positioned, and (2) computing M from M and image R> R features. To determine initial cells over which a model M is R anchored, we treat M as a template with feature values R
and attempt to match it against image features. The matching proceeds by sequentially moving the template from top to bottom and left to right within a prede"ned searching area. Recall that each node of M contains four R di!erent types of features. Among them, we employ three types: color feature, texture feature, and edge feature. A shape feature is excluded, since the shape of M is not R allowed to change during this matching process. To determine a best match, we use the sum of distances between model features and image features as in Eq. (19). d """F (C )!F (C )"", G + G ' G 1 S" , 1# d GZ+ G
(19)
where F (C ) denotes feature values of the ith node in + G a model, and F (C ) denotes feature values of the corre' G sponding image cell. The measure S has a value between 0 and 1. This measure denotes the degree to which current location contains feature values similar to those of a model. Fig. 5(a) illustrates how this process of template matching operates. The rectangle of = ;= deV W notes a searching area determined by a Kalman "lter. In this example, the best match is found as in Fig. 5(b). Having identi"ed the best location of the model M in R the subsequent image frame I , it now remains to build R> M by determining which cells of I are part of the R> R> new model. We do this by incrementally altering M until R its energy function reaches the minimum. The state of the minimal energy then reveals a new model of M . There R> are many algorithms which solve the minimization problem. For example, we may consider a dynamic programming method [11], a variational calculus method [12], and a greedy algorithm [13]. Each of them has its own advantages and disadvantages. We apply a greedy algorithm to our problem of model updating. A greedy algorithm works iteratively as illustrated in Fig. 6. During each iteration, the algorithm sequentially
Fig. 5. Example of template matching.
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
1141
Fig. 6. Pseudo-codes for energy minimization.
determines the cells of I over which nodes of a model R> are to be superimposed. We compute the energy function for the current location of a node N and each of its G neighboring cells. The location having the smallest value is chosen as the new location of a node N . A node G N has already moved to its new position during the G\ current iteration. Its location, along with those of other
adjacent nodes, is used to compute the shape energy term for the proposed location for N . G Having found the best cells of I on which the energy R> function is minimized, the new model M is construcR> ted by associating feature values of the found cells with nodes of a graph. We assign feature values of the cells as node entities and link the nodes in a prede"ned format.
1142
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
The new model is then used in tracking a target in the subsequent frame. Our new model is isomorphic to the old one in terms of the number of nodes and the linkage among nodes. But, the node entities of a new model can be di!erent to those of an old one. A greedy algorithm is quite simple and runs much faster than other algorithms, but it does not guarantee to give a global minimum. One means of dealing with this problem is to set the initial location of a model close to an ideal position. Our template matching step helps it. The process of template matching "nds the translation of a current model giving the minimum distance as in Eq. (19). The resulting location becomes the initial position where the process of energy minimization starts to operate. The template matching also has the e!ect of reducing the running time of the minimization process.
5. Experimental results We evaluated the suggested method under realistic tracking conditions. This section presents some experimental results which illustrate the operational characteristics of the proposed method. A camera was set up in a laboratory with #uorescent lights on its ceiling. We turned on some of the lights in the middle of the tracking process in order to evaluate the e!ects of illumination changes. We set a frame interval to be 1/10 s, since the current implementation of the method requires such a processing time per frame for a video image (360;240 pixels with 32 bits per pixel) on IBM-PC. Recall that the initial model of a target object can be made manually by a user or automatically by some identi"cation module. In the current implementation, we detect the entrance of a person into the "eld of view of a camera and use the detected person as a target. The detection process is performed by comparing two successive frames and identifying di!ering areas of one
against another image [5]. It is presumed that there is no moving object in the "eld of view when a system begins to operate. Fig. 7 shows images with which an entering object has been detected. Figs. 8 and 9 illustrates the process of generating an initial model. Fig. 7(a) and (b) are compared at a pixel level to detect changes and then the detected changes are binarized to yield Fig. 8(a). We can notice that several regions are detected. A labeling process is applied to identify connected regions and then small regions are "ltered out [14]. Fig. 8(b) shows one remaining region which corresponds to a target object of Fig. 8(c). To represent a target object as a labeled graph of boundary and internal nodes, a target object is partitioned into cells whose size is 5;5. We then have determined boundary and internal cells as in Fig. 8(d). Boundary cells are those which lie along the boundary of a target object. We excluded those cells whose pixels partially belong to a target object. That is why some of the boundary pixels are discarded in Fig. 8(d). To determine internal cells, pixels of a target object are segmented by fuzzy C-means clustering algorithm [15]. Hue(H) and saturation(S) of HSI color space are used as input features for the clustering [19,20]. Fig. 9(a) shows the segmented areas. Three prominent clusters are detected, and their values of hue and saturation are used in segmenting a target object into homogeneous segments [17,18]. We discarded small segments and selected the largest three ones. The center cells of the selected segments become the internal cells. Fig. 9(b) shows the selected internal cells. The boundary and internal cells correspond to nodes of our model graph. We establish the linkage among nodes and evaluate features of each node. Fig. 9(c) shows the topological structure of our model graph. The upper body of a moving person was determined as a target. We can notice that the model contains just three internal nodes, while it contains relatively large number of boundary nodes. This implies that in this experiment
Fig. 7. Images for an initial model.
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
Fig. 8. Illustrative example of generating an initial model.
Fig. 9. Illustrative example of generating an initial model.
1143
1144
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
boundary nodes play the more important role than internal nodes do. The moving object has small number of internal nodes because it is composed of small number of sub-regions. However, if a moving object is very complex and it is divided into many sub-regions, it may have many internal nodes and consequently the internal nodes would become more important when tracking and matching the target. The person was asked to move around in arbitrary directions. It means that the size and shape of a target
can vary during the tracking process. Another person was asked to move in and cross over the target person in the middle of tracking process. Fig. 10 illustrates how our tracking process operates. The dotted white rectangle denotes the window predicted by a Kalman "lter, over which a target is to be searched for. The solid white rectangle denotes the enclosing rectangle of a target object extracted through energy minimization. The nodes of an updated model are also pictured in small rectangles.
Fig. 10. Example of tracking process.
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
Fig. 11. Average plot of model energy over the course of time.
Fig. 10(c) illustrates a situation where other object moves across the target and Fig. 10(b) shows a situation where some lights are turned on. We can notice that the target is still being tracked successfully under such a tracking environment. When another object moves across the target, other approaches like di!erence images may cause the target to be confused with the other object or to be mixed up with the other object. However, our approach could distinguish the target robustly from the other object because the active model contains targetspeci"c features that are distinguishable from those of the other object. Especially, the shape feature prevents the model from being mixed up with the other object because it restricts unexpected structural variations of the model. We can note that our model is also robust to variations of brightness from Fig. 10(b). Fig. 10(f) illustrates a situation where the target has changed its shape substantially by altering its moving direction. We can con"rm that our active model adapts itself reliably to such variations of the target. Fig. 11 plots the time course of energy minimization. The horizontal axis represents the number of iterations, and the vertical axis represents the average value of model energy. The average is taken over the sequence of 100 frames. We can notice that four or "ve iterations are enough to get a saturated value. We consider that such fast convergence is due to prediction and template matching. Especially, the template matching places the old model in a good position from where the minimization begins to operate. It also has the e!ect of preventing the minimization process from falling onto the local minimum.
6. Summary In this paper, we propose a model-based tracking algorithm which can extract trajectory information of
1145
a target object by detecting and tracking a moving object from a sequence of images. The algorithm constructs a model from the detected moving object and match the model with successive image frames to track the target object. We use an active model which characterizes regional and structural features of a target object such as shape, texture, color, and edgeness. Our active model can adapt itself dynamically to an image sequence so that it can track a non-rigid moving object. Such an adaptation is made under the framework of energy minimization. We design an energy function so that the function can embody structural attributes of a target as well as its spectral attributes. We applied Kalman "lter to predict motion information. The predicted motion information by Kalman "lter was used very e$ciently to reduce the search space in the matching process. Our tracking algorithm has two main modules: a prediction module and updating module. The prediction module estimates motion parameters of a target. The estimates are used to limit possible areas of a target in successive frames. The updating module is to account for inter-frame change of a target. It "rst seeks the best match for an old model through template matching. The best match is then incrementally transformed until the state of minimal energy is reached. The state of minimal energy, in turn, reveals the updated model for the next frame. Our active model has some similarity to a snake model [6]. Both models incrementally evolve themselves to re#ect inter-frame changes under the framework of energy minimization. However, a snake model does not involve target-speci"c information. It rather operates based on generic evidences of image features. Our model, on the other hand, explicitly incorporates target-speci"c information, as will be seen later, in its energy function. Furthermore, our active model can characterize regional attributes as well as boundary attributes of a target.
References [1] O. Lee, Y. Wang, Motion-compensated prediction using nodal-based deformable block matching, J. Visual Commun. Image Represent. 6 (1) (1995) 26}34. [2] A.L. Gilbert, M.K. Giles, R.B. Flachs, R.B. Rogers, Y.U. Hsun, A real-time video tracking system, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-2 (1980) 47}56. [3] T. Uno, M. Ejiri, T. Tokunaga, A method of real-time recognition of moving objects and its application, Pattern Recognition 8 (1976) 201}208. [4] D.P. Huttenlocher, J.J. Noh, W.J. Rucklidge, Tracking non-rigid objects in complex scenes, Fourth International Conference on Computer Vision, 1993, 93}101. [5] D.-S. Jang, G.-Y. Kim, H.-Il. Choi, Model-based tracking of moving object, Pattern Recognition 30 (6) (1997) 999}1008. [6] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Int. J. Comp. Vision (1988) 321}331.
1146
D-S. Jang, H-I. Choi / Pattern Recognition 33 (2000) 1135}1146
[7] W.Y. Ma, B.S. Manjunath, Texture features and learning similarity, IEEE International Conference on Computer Vision and Pattern Recognition, San Francisco, CA, June (1996). [8] S. Haykin, Adaptive Filter Theory, Prentice-Hall, Englewood Cli!s, NJ (1986). [9] G. Minkler, J. Minkler, Theory and Application of Kalman Filtering, Magellan, 1994. [10] A. Azarbayejani, T. Starner, B. Horowitz, A. Pentland, Visually controlled graphics, IEEE Trans. Pattern Anal. and Mach. Intell. PAMI-15 (1993) 602}605. [11] A.A. Amini, S. Tehrani, T.E. Weymouth, Using dynamic programming for minimizing the energy of active contours in the presence of hard constraints, Proceedings of Second International Conference on Computer Vision, 1988, pp. 95}99. [12] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Proceedings of First International Conference on Computer Vision, 1987, pp. 259}269. [13] D.J. Williams, M. Shah, A fast algorithm for active contours, Third International Conference on Computer Vision, 1990, pp. 592}595.
[14] H. Takahashi, F. Tomita, Fast region labeling with boundary tracking, IEEE ICIP'89 (1989) pp. 369}373. [15] J.C. Bezdek, R. Ehrlich, W. Full, FCM: the fuzzy c-means clustering algorithm, Comput. Geosci. 10 (2}3) (1984) 191}203. [16] Y. Dai, Y. Nakano, Extraction of facial images from complex background using color information and SGLD matrices, International Workshop on Automatic Face and Gesture Recognition, Zurich, 1995, pp. 238}242. [17] F. Perez, C. Koch, Toward color image segmentation in analog VLSI: algorithm and hardware, Internat. J. Comput. Vision 12 (1) (1994) 17}42. [18] Y.I. Ohta, T. Kanade, T. Sakai, Color information for region segmentation, Comput. Vision Graphics Image Process. 13 (1980) 222}241. [19] C.C. Yang, J.J. Rodriguez, E$cient luminance and saturation processing techniques for bypassing color coordinate transformation, IEEE Systems, Man cybernet., (1995) 667}672. [20] K. Sobottka, I. Pitas, Segmentation and tracking of faces in color images, International Conference on Automatic Face and Gesture Recognition, October, 1996, pp. 236}241.
About the Author*HYUNG-IL CHOI received the B.S. degree in electronic engineering from University of Yonsei, Korea in 1979, the M.S. degree in computer engineering from University of Michigan, in 1982, the Ph.D. degree in computer engineering from University of Michigan, in 1987. Dr. Choi is presently a professor and director of the Computer Vision Laboratory. He has published more than 20 papers in journals and conference proceedings. Dr. Choi's areas of research interest include computer vision, fuzzy & neural network, pattern recognition, knowledge based system. He is a member of the IEEE. About the Author*DAE-SIK JANG received the B.S., M.S. and Ph.D. degrees in computer science from Soongsil University, Korea, in 1994, 1996 and 1999, respectively. He is currently working as a researcher for the Institute of Industrial Technology Soongsil University. Dr. Jang's research interests are motion understanding, fuzzy systems, neural networks and pattern recognition.
Pattern Recognition 33 (2000) 1147}1160
Extraction of strokes in handwritten characters Eric L'Homer* CMLA-DIAM, ENS de Cachan, 94235 Cachan Cedex, France Received 26 August 1998; received in revised form 6 April 1999; accepted 6 April 1999
Abstract Among the many handwritten character recognition algorithms that have been proposed in the past few years, few of them use models which are able to simulate handwriting. This can be explained by the fact that simulation models require the estimation of strokes starting from statistic images of letters, while crossing and overlapping strokes make this estimation di$cult. The approach we suggest is to e$ciently deal with crossing areas and overlaps using parametric representations of lines and thickness of stroke: a probabilistic model of strokes is described to extract non-overlapping strokes of the image. A bayesian approach using a statistical study and a model of stroke crossing is described that optimizes the reconstruction of crossings and permits to characterize image of letters by robust graphs of curves. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Handwritten character recognition; Thinning algorithms; Graphs of strokes; Stroke crossing; Stroke path detection
1. Introduction An algorithm for reconstructing lines in cursive handwriting using a parametric model is presented. Thinning algorithms are used by handwritten characters recognitions techniques to represent letters as strokes, loops or crossings. Many thinning algorithms have been proposed in the past few years, but these algorithms are not yet really suited to images of handwritten letters. Two techniques are usually used by thinning algorithms. Parallel thinning algorithms [1}3] are fast and easy to code, but skeletons of closed loops and single strokes give the same results*a single line*and overlapping strokes often generate short segments which distort the representation of a letter. Fig. 1 shows the skeleton of an image of the word ` xssurea, which summarizes some of these di$culties. Some algorithms have been proposed to reduce the number of short segments which appear on the crossing [4}6], or to estimate closed loops [7], but these algorithms do not estimate overlapping stroke paths.
* Corresponding author. E-mail address:
[email protected] (E. L'Homer).
Vector thinning algorithms link boundary points of letter in order to "nd the underlying strokes [8,9]. These algorithms give plane lines as sequences of points, but cannot deal with crossings. Bodies of letters are then split into two parts: one part for regular strokes and one part for components whose boundaries are not linked. This dual description of letters into lines and `crossing areasa is better than the result given by parallel thinning algorithms, but is not satisfying enough to represent and recognize letters. Doermann and Rosenfeld [10,11] address this problem of the interpretation of inferring strokes to reconstruct crossing areas. Their process "rst detects strokelike and non-stroke-like regions of letter. Local con"guration of strokes (local measure of the con"dence that a given pair of segments are portions of the same stroke), and local compatibility of the reconstruction with the image is used to interpret non-stroke region as highcurvature point, corner, crossing, etc. A cubic-spline approximation is used to reconstruct strokes. Although this work seems to be the most useful, the approach proposed does not use a global model of crossing areas, and then needs to be extended to all simple cases. The approach we suggest is to e$ciently deal with crossing areas and overlaps using parametric representations of lines and thickness of strokes, global model of
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 0 3 - X
1148
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
Fig. 1. Squeleton obtained from a classical thinning algorithm: the closed loop is described as one single stroke (a), and paths of overlapping strokes are not well described (b, c, d). Fig. 2. Relations between stroke path and tangents of stroke boundaries.
crossing of strokes and statistical studies to estimate the model parameters. Several remarks have convinced us to search such results, in the perspective of using strokes to recognize handwritten words. First, crossings are less stable than strokes in letters and particularly in words, where overlaps between strokes of di!erent letters create crossings which do not appear in single letters. Moreover, strokes that are cut by crossing are less stable than whole strokes. Finally, the process we have tested to decide if the mark of a stroke is, in fact, the path of a closed loop seems to show that the only way to decide this is to try to link the stroke with two other single strokes. This forces us to treat crossing areas, if we want to extract stable structural information on handwritten letters and words. We "rst de"ne a model of strokes, and present an algorithm to extract non-overlapping strokes. This allows us to split up letters into two parts. A second algorithm that optimizes the reconstruction of crossings is then proposed. Finally, results of our algorithms on a NIST database is presented.
2. Extraction of simple strokes We present here a simple model for strokes which uses relations between boundaries and strokes to estimate lines. We consider in this section images of strokes without noise.
+C , t3[0, 1], u3[!p, p], of the family of discs which RP verify: *C /*t is parallel to *C /*u, that is, with RP RP l "(x y )*: R R R
x R E "l(t)!r R R y R
R (1!r v\ R R
x R E "l(t)!r R R y R
r
r
!(1!r v\ R R , r R
R !(1!r v\ R R
(1!r v\ R R . (1) r R
Fig. 2 shows an example of a stroke with its envelopes. Let E be the set of points of the envelopes, then B is strictly included in E6C 6 C , where C and C are the circumferences of the "rst and the last discs of the stroke. So, to study relations between B and the stroke path, we "rst describe relations between E and the stroke path, and then we study cases where E is di!erent from B. Given two points E and E on each stroke envelR R ope, two relations may be used to compute the point l of R the path which "t E and E (Fig. 3): R R Relation 1. (a) If the tangents t and t are dexned on E and E , then the two straight lines which are perpenR R dicular to t , and contain E , for i"1, 2, cross on F , and G GR R "E F """E F ". (b) l "F . R R R R R R
2.1. Dexnitions and properties
Relation 2. Let H be the center of [E E ], then [H l ] is R R R RR perpendicular to [E E ], and H l "v r r . R R R R R RR
Let a stroke be a family of disc +D , t3[0, 1],, which is R de"ned by the stroke path l: [0, 1]P1 and the radius of the discs r: [0, 1]P1H>. We assume that the functions l and r verify the two following properties: l and r are C and r (t)(v(t) for t3[0, 1], with v"dl/dt. The trace of a stroke on 1 is the set G"+M31, t3[0, 1], M3D ,, and the stroke boundaries is B"G !Gs . R With these conditions on l and r, the two envelopes (E , E ) of a stroke are always de"ned. The envelopes of a family of discs D are points of each circumference
Estimating the path link given the envelopes is the same as matching the "rst envelope with the second, and one point E is "t with the point of E at the same R parameter value t using Relation 1(a). However B and E do not always "t, especially, when the curvature of the stroke path is high: in this case, one of the envelopes may be strictly included in G. In such cases we cannot "t the two boundaries of the stroke to estimate the path link. We introduce the following de"nition to describe this case:
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
1149
Fig. 3. Envelopes of the stroke is hidden between B1 and B4, and l.s.i between B2 and B3.
De5nition 1. The envelopes of a stroke are locally strictly included in G (l.s.i.) at the parameter value t if: j'0 such that ∀e, "e"(j, d (t, e)(0, i"1 or 2 G with d (t, e)"[E !l ]!r . G GR R>C R>C Even if the envelopes are nowhere l.s.i, this is not su$cient to force their inclusion in the boundaries of G, since envelopes may be hidden by another part of the stroke mark, or by another stroke. This de"nition, however, allows to obtain a simple and local way to decide whether the two envelopes of a segment of stroke are `visiblea, i.e. is B included, and then to decide if it is possible to "t both boundaries. Let o be the curvature of the stroke path. It is easy to show, with a Taylor's expansion, that the envelopes are l.s.i if
v 1!or
1!
r v
v rr #2 !rrK !r '0. v
(2)
In practice, we use a pathlink, so v(t)"1, and we assume rK "0. The relation becomes 1!or(1!r!r (0.
(3)
2.2. First algorithm Our main hypothesis is to assume that handwritten characters images are a union of noisy stroke traces. The goal of this "rst algorithm is to divide the trace L of a letter into two parts A and B. The "rst set A may be described by segments of stroke in which both envelopes are included in theirs boundaries, and B"L!A. Given a black and white letter image, our "rst step is to smoothen the local boundaries strokes using classical masks, and to link the edge points (B) of the image, which
Fig. 4. Estimation of the stroke path for a `visiblea stroke.
corresponds to discrete and noised sequences of points of the stroke boundaries. Let A1 and B1 be two points of the same boundary, for which the distance between A1 and B1 is equal to the half average h of the thickness of the letter's strokes. The algorithm tests the following hypothesis, H : (A1, B1) T is the end of a segment with visible envelopes. If the test is positive, the algorithm estimates the corresponding edge points A2 and B2 of the second stroke boundary. First, we estimate the two tangents t and t on both points A1 and B1, using their neighboring edge points. Let FA (res. FB) be the "rst edge point that crosses the half-straight line which contains A1 (res. B1), and is perpendicular to t (res. t ) (see Fig. 4(I)). These two points are used to obtain, from a neighborhood S of FA and FB, a set of positions for A2 and B2. For each couple of points (A2, B2) of S, which de"nes a segment of stroke ¹"[A1, B1; A2, B2], the algorithm tests if ¹ is l.s.i, which is made by testing if relation (3) is veri"ed. We estimate o, r and r using the centers (C1, C2) of both segments of edge points. (Fig. 4(II)). We arbitrate among the possible couple of points which verify condition (3) by choosing the ones which best verify relation 1(a). A cost function is de"ned, which depends on the angles between the tangents on the four edge points. A Gaussian noise is introduced to deal with the error on the estimation of these angles. Then the cost function is de"ned as the quadratic distance between the expectation of the four angles and their estimations. This process is used to link, step by step, the two boundaries of each stroke with visible envelopes: given the two ends (B1, B2) of the little segment of stroke, let D1 be an edge point of the right neighborhood of B2, at a distance h from B2, and SD be the set of left neighboring
1150
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
edge points of B , at a distance smaller than h from B2 (Fig. 4(IV)). The algorithm follows the process described below to "t in SD the best corresponding end D2. 2.2.1. End of stroke Strokes end in three cases: 1. When r is too high on a little segment and thus is not compatible with a de"nition of a stroke. This happens when the stroke ends at a crossing, like a T-junction for example. 2. When the test is negative for all the points of S. This corresponds to a segment of stroke whose curvature is high. 3. When the boundaries meet each other, the stroke stops naturally at its end. Finally we obtain, for each letter, one part described by sequences of linked edge points, which form simple strokes, that is strokes with visible envelopes, and another part which corresponds to traces of stroke crossings and small segments of stroke with high curvature. Fig. 6(I) shows results of this "rst algorithms for images of the training set of the NIST form-based handprint recognition system.
3. Models for stroke crossings 3.1. Motivation and Hypothesis Results obtained by this "rst algorithm are similar to those of vectorial thinning algorithms: letters are described as graphs of strokes, and part of letters where strokes cross are not described, apart from connecting strokes.
The main disadvantages of such results are the description of closed loops as single strokes, and the variation of the graph of curves for letters of the same class; therefore the study and the parameter estimation of the deformation of such graphs to simulate handwritten letters are di$cult and ine$cient. Let us take the following examples to clarify both problems: Fig. 5 shows four examples of letters a and the corresponding thinning results obtained with a classical parallel thinning algorithm (the AFP3 algorithm described by Guo and Hall [2]. As usual, lines obtained are often bad estimations of overlapping stroke paths. Fig. 6(I) shows the corresponding results obtained by our "rst algorithm, that can be described as graphs of strokes: the graph nodes are strokes and `crossing areasa, and a graph structure connects nodes that join each other (Fig. 6(II)). For these four examples of the same class of letters, we obtain four di!erent graph structures, whereas a characterization of these examples using stable graphs of curves is possible, if we are able to estimate stroke paths that cross each other, and if we segment paths to obtain smooth lines (Fig. 6(III)). In such descriptions, called graphs of curves, that are close to those of on-line handwriting characters, crossings are `implicita and do not increase the variation of graph structures by segmenting smooth strokes: graph structures may be reduced to the number of curves. Our goal is then to deal with crossings and overlapping strokes, to decrease the variation of graph structures. Our approach is "rst to use a parametric model of strokes in order to merge strokes that are segmented by crossing, and then to determine between the di!erent con"gurations of crossing the one which best matches the observations.
Fig. 5. Examples of results with classical thinning algorithms (the AFP3 algorithm described by Guo and HallGuo [2].
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
1151
Fig. 6. Graphs of strokes without crossings estimations (I and II), and graph of curves with crossing identi"cation (III).
Three main hypotheses are made to make an e$cient algorithm of crossing reconstruction: h1. Curvature of strokes has a low value when crossing. h2. A branch of a stroke may merge with at least two other branches h3. The algorithm deals with crossings with less than "ve branches. A branch of a crossing Cr is a stroke which ends in Cr. The "rst hypothesis is used to merge two branches A, B of a crossing into a stroke S only if S can be described as one smooth stroke. S is the link between A and B. Fig. 7 shows a crossing with four branches on a letter `ka, and three di!erent ways to reconstruct stroke paths into these crossing, which correspond to three di!erent descriptions of (A, B): as two strokes (Fig. 7(a)), as one stroke whose curvature is high in its center (Fig. 7(b)), and as a loop (Fig. 7(c)). Observations on this image are not su$cient to choose between these three solutions but, if we describe these three solutions using graphs of curves we obtain three similar graph with three curves (Fig. 7(II)). Thus, generally, links that form smooth strokes are the only ones that are useful and easy to estimate. The second hypothesis is made to deal with closed loops, and the last one is done because in the NIST database we use, the number of crossing with more than four branches is very low and, conversely, the number of con"guration of such crossing is too high to test each of them.
3.2. Parametric model of strokes A crossing C is a set of overlapping strokes. The trace of a crossing is described by the branches of the crossing, parts of strokes that do not overlap, that are connected with the `crossing areaa, the center of the crossing that is not described as a stroke. The "rst algorithm does not use a parametric model to de"ne strokes, that are de"ned as sequences of edge points. To reconstruct smooth strokes whose edges are hidden by a crossing, we introduce the following parametric models on paths and thickness of strokes D(l, r):
x(t)"f (t)" a tG, G G (4) y(t)"g(t)" b tG G G and, if ¸ is the length of the line of the stroke, and p the average thickness on the letter:
l(t)"
¸ ¸ r(t)"c #c t#c exp !t #c exp (t!1) p p
" c .h (t) , (5) O O O with 0)t)1. We note S(h) such a stroke, with h"(a, b, c). Observations on a trace of a stroke is de"ned by two sequences of edge points. Let (X ) and (X ) be the J
L P
L sequences of these points. Noise on the edge points
1152
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
Fig. 7. Three di!erent ways to reconstruct lines into a crossing, which correspond to three similar graphs of curves.
depends on di!erent factors like pen or pencil type, paper and ink quality, scanning accuracy. We use a Gaussian additive and independent noise on each edge point to model such variations. Let +k ,2,k , be the abscissa of L the sequences of edge points. We have supposed that stroke paths are smooth, so we can assume that the envelope is included in the edge stroke. Then, let t and n be the unit tangent and normal vectors of the stroke path, and (l , l , m , m ) four n-dimensional Gaussian vecP J P J tors, with k"k ,2, k : L X (k)"E (k)#l n(k)#m t(k), P P PI PI X (k)"E (k)#l n(k)#m t(k) (6) J J JI JI with (E , E ) the two sequences of points of the envelopes P J (Fig. 8):
r (k) r(k) E (k)"l(k)! t(k)!r(k) n(k) P v(k)
r (k) 1! , v(k)
r (k) r(k) E (k)"l(k)! t(k)#r(k) n(k) J v(k)
r (k) 1! . v(k)
we assume that c(i, j)"1 corresponds to a link between branches i and j of a crossing with m branches, and c(i, j)"0 corresponds to an absence of such link between both branches, then 3 is the set of crossing conK xgurations: according to the main hypothesis, 3 K describes the di!erent reconstructions of crossings with m branches. For example, the matrix c (8) corres ponds to a crossing with three branches A, B, C, parameterized by h } , the parameters of the link S(h ), and ! ! by h , the parameters of the stroke S(h ) whose trace is A, and the matrix c corresponds to a three-branch cross ing parametrized by (h } , h } ), the parameters of the ! two links S(h } ) and S(h } ) between the "rst branch and ! the two others: this con"guration corresponds to a closed loop.
0 0 0
c " 0 0 1 , 0 1 0 (7)
3.3. Topology of crossings Let 3 be the set of m;m binary (0 or 1) symmetric K matrix c that verify: ∀j, c(i, j))2 and c(i, i)"0 ∀ . If G G
0 1 1
c " 1 0 0 . 1 0 0
(8)
Fig. 9 shows the set 3 . Let C be the group of cyclic permutations on sets with K m elements. 3 /C is the set of types of crossings, i.e. the K K set of con"gurations modulo cyclic permutations on branches. We assume that con"gurations with the same type have the same probability of appearance, as we
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
1153
Fig. 8. (I):sequences of points, observations on a 4-branch crossing. (II):a$ne interpolation of the abscissa of the edge points for the merging of two branches.
Fig. 9. The 8 con"gurations for a crossing with 3 branches.
choose not to take account of the orientation of strokes in our a priori knowledge of con"guration distribution. Fig. 10 shows the set of types of 3 . Finally, a crossing Cr with m branches, whose con"guration is c33 , is parameterized with h "+h " K A IJ c(k, l)O0,6+h " c(k, l)"0,, the parameters of links I J S(h ) and branches S(h ) with no links on Cr. I J I Then the likelihood of the observations (X) 2 is de P "ned as follows: de"ne for each point X the point E in G G the trajectories of S(h) which minimizes the di!erence between the thickness r on this point and "E !X ". Let G G G R ""E !X "!r . We assume that (R) is a su$cient G G G G statistic for (X), and that R&N(0, p.Id), where p, the variance of the noise on the edges of strokes, is estimated on the image of letter.
Fig. 10. The 12 types of con"gurations for a crossing with 4 branches.
4. Estimation of crossing con5gurations In this section, we detail the principles of our second algorithm, an algorithm of crossing estimation. Let Cr be a stroke crossing with m branches. We assume, under the hypothesis of the existence of a link S(h ) between two branches A and B, that h is independent of the other links and of the crossing branches. Thus in a "rst step, the second algorithm makes a least square estimation of all the possible links between two branches on Cr. In a second step, the algorithm computes for each con"guration
1154
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
an estimation of the maximum a posteriori loglikelihood of the observation, and "nally chooses the ones which maximize it. 4.1. Two branch merging Let (A, B) be two branches of a crossing, and H } the hypothesis that (A, B) merge into one stroke S(h } ) whose path curvature remains at a low value. Given the sequences of edge points (Ar, Al) 2 and (Br, Bl) 2 L L which describe the two branches, let k"+k ,2, k , L k ,2, k , be the abscissa of these sequences under L > L >L H } . (Ar, Al) and (Br, Bl) are sequences of noisy observa tions of points of the two envelopes of S } . Let (C , C ) be P J the sequences (A , A )6(B , B ): C (k )"A (k )1 # P J J P P G P G GWL B (k )1 . The estimator hK } is computed with the paraJ G GL metric model of strokes, using estimation of the abscissa and tangents, and simpli"cation of the formula 7. First, we use the main hypothesis h1 to make an a$ne interpolation (kK ) of the abscissa (Fig. 8(II)). Unit tangents and normal vectors are then locally estimated using this a$ne interpolation of the path. We note these estimators, t and n at the abscissa i. G G Furthermore, we assume that r /v;1, so as to neglect the term (r (k )r(k )/v(k ))t(k ), and we simplify G G G G
r(k ) n(k )(1!r (k )/v(k ) into r(k ) n(k ). Finally, we G G G G G G obtain C (k )"l(k )!r(k )n #l n #m t , P G G G G PG G PG G C (k )"l(k )#r(k )n #l n #m t . (9) J G G G G JG G JG G Then a least-squares linear estimation hK } of the para meters of S is obtained from the linear system (9). Fig. 12 \ shows some examples of stroke estimated using this process. Remark. The e$ciency of the a$ne estimation of the abscissa depends on the main hypothesis h1: without it, such linear estimations would not be acceptable. If one uses this algorithm to merge a branch A with another B whose thickness is higher, for instance when B is the trace of a closed loop, then the result is often wrong because one of the edges of S is hidden in the \ trace B (Fig. 11). So in such a case, we introduce two subhypothesis H } and H } : P J 1. H } Only the right edge of S } is visible in the trace P of the branch B. 2. H } Only the left edge of S } is visible in the trace of J the branch B.
Fig. 11. Examples of the merging of a simple stroke with a closed loop (a}c), and the "nal estimation of the crossing (d). (a) and (b) correspond to false hypotheses, and (c) is the true one.
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
Under these two hypotheses, we assume that the thickness of S } is constant, to estimate the hidden edge, and to obtain a least-square estimation of hK and \ P hK (Fig. 11). \ J We obtain for each hypothesis of branch merging a quick and linear estimation of the corresponding strokes. Using the same process to estimate the parameters h of each branch of the crossing, an estimation II hK of the parameters of each con"guration c is done A (Fig. 12). The last step of the algorithm is to identify, within the set 3 , which con"guration best "ts the observations. K 4.2. Crossing identixcation Let (X) 2 be the observations on a crossing Cr with G P m branches, and Let n"2r be the dimension of the observation on the crossing. Let # be the sample space of hK (X)"+hK , k, l3+1,2, q,,, where q is the cardinal of IJ 3 . We use a bayesian approach to identify crossing K con"guration. We de"ne a decision function d : s;#P3, which selects estimate c( "d(X, hK (X)) of the con"guration of Cr. To construct d, we introduce a loss function C:3;3P[0, 1] which measures the cost introduced by the identi"cation of a con"guration c by c( , and we take
1155
the function which minimizes the expected risk R: O R(d, X, hK (X))" P(c"X, hK (X)) . C(c,c( ) (10) A O p(X"hK (X) "c).P (c) K . C(c, c( ), " (11) p(X, hK (X)) A O p(X"hK (X), c).p(hK (X)"c).P (c) A K . C(c, c( ). (12) R(d, X, hK )" p(X, hK (X)) A The function C we used is such that C(c, c( )"0 if c"c( and C(c, c( )"1 if cOc, so the choice of c( becomes the maximum a posteriori likelihood (MAP) estimator: c( "argmax ¸(X, hK , c) (13) A "argmax log p(X"hK , c)#(log p(hK "c)#log P (c)), (14) A K A that can be split into two terms, one for the loglikelihood of the observations, and one for an a priori logdensity of the parameter estimators. 4.2.1. A priori distribution of the parameters In the context described in the beginning of this article, no information about the class of the image of the letter which is treated is available. So we introduce a su$cient statistic of the parameters h which does not depend on
Fig. 12. The six di!erent simple links between branches for a crossing with 4 branches.
1156
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
Fig. 13. A false link between two branches on a crossing with 4 branches.
Fig. 14. Examples of estimations of stroke paths for images of letters of a NIST database.
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
the orientation of the stroke but depends only on the intrinsic shape of the stroke, that is a function of the curvature. For each stroke S , which merges one branch IJ k whith another l, there are two options: this stroke is a real one, or this stroke does not exist, and we take a statistic , function of the curvature, whose densities IJ are di!erent enough to discriminate the two hypothesis. For example, functions like the mean of the curvature will not be suitable because even if the link does not exist, the mean of the curvature of the line may have a low value (Fig. 13). Let ¸ be the length of a stroke. The statistics we use is:
* "o.o(s)" ds, (15) Thus *o(h) allows one to take into account the curvature variations. As we assume that the density of *o depends only on the fact that the stroke exists or does not exist, an empiric estimation of the density of *o for both hypotheses has been done on 200 crossings of the database images. A parametric estimation of the law of these two empiric densities was computed using a gamma law c(a , b ) with a "0.68, b "2.15 when the connection really exists, and a law c(a , b ) with a "1.51, b "0.01 when the link is missing. Let g(. " 0) be such density when the link does not exist, and g(. " 1) when the link exists. Eventually, the a priori density of *o(hK ) is: *o(h)"
g(Do(hK )"c(k, l)). (16) p(*o(hK )"c)" IJ IJ2OI$J The a priori distribution of the con"guration P (c) is K estimated on a test set of letters of the database, under the hypothesis that the a priori probability of a con"guration depends only on the type of the con"guration. 4.2.2. Loglikelihood estimation To compute the loglikelihood log p(X"c, hK ), we have to estimate the series (R) of errors on the thickness on the edge points for each con"guration c. When X stays on G a branch of the crossing that is described with only one stroke S, the least-square estimation of the parameters of S computes straight the error R of the thickness on the G edge point X . When two strokes overlap on a branch, or G when X is an edge point of the area of the crossing, the G algorithm chooses for X the minimum of errors for all G the strokes of c. But in practice the series R is not computed for each con"guration and sub-con"gurations. A minimal di!erence R + , is computed for each edge G JI points X and for each stroke S K , and for each con"gG FJI uration we take: R "min(R + , " c(k, l)"1). G G JI A problem appears when one follows this process: the dimension of the parameters which describe a crossing depends on the type of con"guration, i.e. on the number of strokes of the con"guration. So the law of R"
1157
2 R, which is a su$cient satistics for the loglikeliG P G hood, depends on the con"guration: a bias appears that gives an advantage to the con"gurations whose number of parameters is higher. So we use a loglikelihood that is penalized by a penalty term pen(dim(X), dim(c)) to take into account this bias: it is described in the appendix. Finally, the algorithm computes for each con"guration and possible subcon"guration c, an estimation of the maximum of the penalized a posteriori loglikelihood: ¸C(X, hK , c)"log(p(X"hK , c)!pen(dim(X), dim(c)) A #log p(o(hK )"c)#log P (c) K "log(p(R"hK , c)!pen(dim(X), dim(c))
(17)
#log P (c)# g(*o(hK )"c(k, l)) K IJ IJ2K I$J (18) The estimator of the con"guration becomes: c( "argmax ¸C(X, hK , c) (19) A AZ3 and the crossing is described by strokes corresponding to c( .
5. Results The estimation of the a priori distribution of the crossing types has been done on 520 images of letters (20 letters for each class) of the NIST database. The following arrays summarize the distribution of the number of crossings by letters, the number of branches by crossing, and the distribution of crossing types for crossings with 3 and 4 branches.
Crossing by letter Number of observed letters %
0 162
1 291
2 62
3# 5
31%
56%
12%
1%
Number of branches Number of observed crossings % Types Number of strokes Number of con"gurations Number of observed crossings %
3 354
4 75
5 2
82.1%
17.4%
0.5%
T3.1 3 1
T3.2 2 3
T3.3 2 3
others * *
6
97
245
6
1.7%
27.4%
69.2%
1.7%
(Continued on next page)
1158
Types Number of strokes Number of con"gurations Number of observed crossings %
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
T4.1 4 1 0 0%
T4.2 3 4 7 9.3%
The last column (`othersa) corresponds to crossings where one branch is the mark of three or more strokes, which is not described by our model. Fig. 14 shows 24 examples of letters for which the crossing identi"cation is true. The last four letters of this "gure show examples of wrong results, which happen in about 10% of the letters. The image (a) shows a letter with a 4-branch crossing and a 5-branch crossing that are not reconstructed. The letters (b) and (c) show the main factor of errors: when the curvature of one branch of a closed loop varies strongly, the corresponding link between the branch and the closed loop may not be reconstructed. In order to illustrate the importance of a priori knowledge to estimate strokes, the last example (d) shows a crossing identi"cation that does not correspond to the true line of the letter, but will be acceptable without any knowledge of the class of the letter. Such result indicates that our method will not be useful for images of word without any a priori knowledge of the graph of curves to be estimated conditionally to the class of letters.
T4.3 3 2 1 1.3%
T4.4 2 1 56 74.6%
T4.5 2 2 2 2.7%
T4.6 3 8 5 6.7%
T4.10 3 4#4 2 2.7%
others * * 2 2.7%
Let X be a random vector on 1B, which veri"es under the hypothesis H :
i I B\ X(i)" a #m , i"122n G I 2n I
(20)
and, under the hypothesis H :
B\ i I i I X(i)" b 1+ ,#b 1+ ,#m , I n GWL GL I>B n G I i"122n,
(21)
m&N(0, I ). With an observation x of this random L vector, we would like to identify the true hypothesis, and estimate the corresponding unknown parameters A"(a ,2, a )H or B"(b ,2, b )H. B\ B\ We can recognize, with simpli"cations, the two hypothesis between two branches of a crossing: the two branches are linked and form one stroke (H ), or they are the mark of two di!erent strokes (H ). With the following notations: ¹ "((i/2n)H), ¹ " L L ((i/n)H) and ¹ "(2L ), the two models can be rewritten L 2L like this:
6. Conclusion
H : X"A¹ #m, L
(22)
We have described an algorithm for reconstructing stroke paths of o!-line handwritten characters, which deals with the case of stroke crossings. A simple parametric model of stroke has been introduced. The method is based on the properties of the edges of this model, and on a statistical study of the stroke crossings on handwritten characters. With this method, handwritten characters are described by simple graphs of curves that are similar to the on-line description of characters. Therefore, unlike the description obtained by classical thinning algorithms, an adaptation of on-line handwriting recognition methods becomes possible, and an e$cient model for the law of graphs of curves can be used.
H : X"B¹ #m. L
(23)
Appendix A In this section, we "rst analyze the bias of the loglikelihood estimator in a simple case, then we generalize the results to the case of stroke crossings.
For each hypothesis, the estimator of the maximum likelihood for A and B corresponds to the least-square estimator: AK "(¹H ¹ )\¹H X, L L L
(24)
BK "(¹H ¹ )\¹H X. L L L
(25)
Thus the estimator of the maximum of the loglikelihood for H and for H is: H : ¸ (X)"!(X!AK ¹ )H(X!AK ¹ )#cst(n), L L
(26)
H : ¸ (X)"!(X!BK ¹ )H(X!BK ¹ )#cst(n). L L
(27)
Under H we have: ¸ (X) "cst(n)! (A¹ #m!AK ¹ )H & L L ;(A¹ #m!AK ¹ ) L L "cst(n)!mH* m L
(28) (29)
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
with * "(I !¹ (¹H ¹ )\¹H ). Setting * " L L L L L L L I !¹ (¹H ¹ )\¹H , we have, for the statistics of the L L L L L loglikelihood estimator of the hypothesis H under H : ¸ (X) "cst(n)! (¹ A#m)H* (¹ A#m). (30) & L L L Under H , both statistics become: ¸ (X) "cst(n)! (¹ B#m)H* (¹ B#m) (31) & L L L ¸ (X) "cst(n)!mH* m (32) L & Thus the di!erence ""¸ (X)!¸ (X) which is used to identify the hypothesis, is, under H : " "¸ (X) !¸ (X) & & & "! (mH(* !* )m!2(AH¹H * ¹ A (33) L L L L L #AV10H¹H * m)) (34) L L and under H : " "¸ (X) !¸ (X) & & & "! (mH(* !* )m!2(BH¹H * ¹ B (35) L L L L L #BH¹H * m)) (36) L L In both cases, the di!erence is the sum of two terms; the law of the "rst term is !s(d), and the law of the second is a Gaussian N(!2D(n), 4D(n)). Under H , D(n)" AH¹H * ¹ A, and under H , D(n)"BH¹H * ¹ B, L L L L L L but in both cases, D(n) is equal to the quadratic distance between the series of points of the true curve and the ones estimated on the false hypothesis: under H , ¹ A is the L vector of the true points of the curve (without any noise), and ¹ B is the vector of the estimation of these points L under the false hypothesis. Therfore, D(n) "(¹ A!¹ BK )H(¹ A!¹ BK ) & L L L L "AH¹ H * ¹ A. L L L The calculus is the same for H . Under H , D(n)&10\ for the curves observed on the stroke crossings, which is not surprising because the number of parameters of the false models is higher than the number of parameters of the true model. So we can neglect this term and, under this hypothesis, if we compare both loglikelihood, there is a systematic bias on account of H , whose law is ! s(d). Under H (2 curves), the quadratic distance D(n) be tween the true curves and the one estimated with H is a measure of the error between both models. Under H , the bias on account of H is the sum of two terms, whose law are s(d) and N(D(n), D(n)). Thus the choice of a penalty term depends closely on the quadratic distance between both models: if under H D(n) has a low value, which shows that the observa tions are well described by the model with one curve, we have to choose this hypothesis, to favour the model with less parameters. So we turn the hypothesis H into H , "
1159
which is the model H with D(n)"n.D , where D is
the average distance for which under H , H cannot be chosen. Thus let LC (X) be the penalized loglikelihood, G i"0, 1: ¸C (X)"¸ (X)!pen(d, n, D ),
¸C (X)"¸ (X)!pen(2d, n, D ),
where C is such that:
(37) (38)
P(¸C (X)!¸C (X)(0"H ) "P(¸C (X)!¸C (X)(0"H ). (39) The error rates are the same under both hypothesis. Giving n and D , such function pen( ) can be easily
estimated using simulation of the loglikelihood bias. The same process is used to compensate for the loglikelihood bias of the stroke crossing observations, as this simple case corresponds to the model of one edge with two branches. So we just have to make an empirical choice of a `limita crossing for which two branches, which are not linked, can be described as one stroke, to determine D .
References [1] L. Lam, C.Y. Suen, An Evaluation of parallel thinning algorithms for character recognition, IEEE Trans. Pattern Anal. Mach. Intell. 17 (9) (1995) 914}919. [2] Z. Guo, R.W. Hall, Fast fully parallel thinning algorithms, CVGIP: Image Understanding 55 (3) (1992) 317}328. [3] M. Roth, H. Bunke, E.G. Schukat-Talamazzini, O!-line. cursive handwriting recognition using hidden Markov Models, Pattern Recognition 28 (9) (1995) 1399}1413. [4] S. Iliescu, R. Shinghal, R.Y.-M. Teo, Proposed heuristic procedures to preprocess character patterns using line adjacency graphs, Pattern Recognition 29 (6) (1996) 951}975. [5] M.J.J. Holt, I.S.I. Abuhaiba, S. Datta, Processing of binary images of handwritten text documents, Pattern Recognition 29 (7) (1996) 1161}1177. [6] D.S. Yeung, H.S. Fong, A fuzzy substroke extractor for handwritten chinese characters, Pattern Recognition 29 (12) (1996) 1963}1980. [7] J.-C. Simon, O!-line cursive word recognition, Proc. IEEE 80 (7) (1992) 1150}1161. [8] J.-C. Pettier, J. Camillerapp, Segmentation et representation d'images de traits, Publication interne, 756, IRISA. Campus de Beaulieu 35042 Rennes Cedex France, 1993. [9] T. Suzuki, H. Nishida, S. Mori, Thin line representation from contour representation of handprinted characters, From Pixels to Features III: Frontiers in Handwriting Recognition, Elsevier, Amsterdam, 1992, pp. 29}40. [10] D.S. Doermann, A. Rosenfeld, The interpretation and reconstruction of infering strokes, International Workshop on Frontiers in Handwriting Recognition, 1993, pp. 41}50. [11] D.S. Doermann, A. Rosenfeld, Recovery of temporal information from static images of handwriting, Int. J. Comput. Vision 15 (1}2) (1995) 143}164.
1160
E. L'Homer / Pattern Recognition 33 (2000) 1147}1160
About the Author*ERIC L'HOMER received a Ph.D. degree in applied mathematics from the University of Orsay, France, in 1998. Since 1993, he has been on the CMLA, Ecole Normale Superieure de Cachan. He is currently a research scientist at Paris 13 University. His research interests include pattern recognition, stochastic neural networks and Gaussian mixture distribution.
Pattern Recognition 33 (2000) 1161}1177
On internal representations in face recognition systems Maxim A. Grudin* Miros Inc., 572 Washington Street, Suite 18, Wellesley, MA 02482, USA Received 28 September 1998; received in revised form 4 March 1999; accepted 4 March 1999
Abstract This survey compares internal representations of the recent as well as more traditional face recognition techniques to classify them into several broad categories. The categories assessed include template matching and feature measurements, analysis of global and local facial features, and incorporation of interpersonal and intrapersonal variations of human faces. Analysis of the face recognition systems within those broad categories makes it possible to identify strong and weak sides of each group of methods. The paper argues that a fruitful direction for future research may lie in weighing information about facial features together with localized image features in order to provide a better mechanism for feature selection. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Face recognition; Computer vision; Neural networks; Elastic graphs; Multiresolution techniques; Eigenfaces; Wavelets; Template matching; Principal component analysis
1. Introduction Face recognition may seem an easy task for humans, and yet computerized face recognition systems still cannot achieve a completely reliable performance. The di$culties arise due to large variations in facial appearance, head size and orientation, and changes in environmental conditions. Such di$culties make face recognition one of the fundamental problems in pattern analysis. Although computerized recognition of human faces was initiated more than 20 years ago, in the last decade there has been an explosion of scienti"c interest in this area. However, there is still no widely accepted benchmark for testing the developed systems. Therefore, comparison of di!erent face recognition systems is no easy task. Comparison on the basis of their recognition performance is often misleading, since most of the systems are tested on di!erent facial databases. Other factors that impact the performance are the accuracy of the face location stage and the number of actual face recognition techniques used in each system.
* Corresponding author. Tel.: #781-235-0330x241; fax: #781-235-0720. E-mail address:
[email protected] (M.A. Grudin)
Over the last 10 years, there have been numerous reviews of face recognition techniques, with Samal and Iyengar [1], Valentin et al. [2], Chellappa et al. [3], and a recent issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence [4], among the most prominent. This paper aims to update these previous surveys by reviewing many of the recent developments in this "eld. In addition, the present paper attempts to establish a set of underlying recognition principles used in the design of each reviewed technique. Thus, this paper di!ers from previous reviews by helping identify the strengths and weaknesses of each class of techniques as well as outlining a set of general principles that may "nd applications in future designs. The remainder of this paper is organized as follows: Section 2 addresses the di!erences between the face recognition techniques that use feature measurements and those using template matching. Section 3 describes recognition of faces using comparison of whole faces rather than local features. It examines techniques of the principal component analysis, neural networks, and #exible templates. Face recognition methods that use localized features are presented in Section 4. Section 5 discusses methods that attempt to improve the recognition performance using intrapersonal variations of localized
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 0 4 - 1
1162
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
features. The discussion and conclusions are presented in Section 6.
2. A comparison between feature- and template-based models The two traditional classes of techniques applied to the recognition of frontal views of faces are measurements of facial features and template matching. The "rst technique is based on extraction of relative positions and other parameters of distinctive features (Fig. 1). Typical geometrical features include (from Brunelli and Poggio [5]): E eyebrow thickness and vertical position at the eye center position; E a description of the eyebrows' arches; E nose vertical position and width; E mouth vertical position, width, height, upper and lower lips; E radii describing the chin shape; E face width at nose position; E face width halfway between nose tip and eyes. The measured features must be normalized in order to be independent of position, scale, and rotation of the face. A set of the above measurements is stored as a feature vector. Once obtained from the input image, the feature vector is compared with an existing database of the feature vectors to "nd the best match. Early attempts of feature-based face recognition included works of Bledsoe [6], Goldstein et al. [7], and Kaya and Kobayashi [8]. Some of the more recent investigations can be found in Refs. [9}12].
In the simplest version of template matching, each person is represented as a database entry whose "elds contain a two-dimensional array of pixels, extracted from a digital image of their frontal view. The image must be normalized in a similar fashion to that used in featurebased matching. Recognition is performed by comparison of the unclassi"ed image with all the database images, using correlation as a typical matching function. Basic studies of the template-based matching were performed by Baron [13]. Comparison of these two classes of face recognition techniques was performed by Brunelli and Poggio [5]. The feature-based strategy showed a higher recognition speed and smaller memory requirements. However, it was concluded that the template-based technique is superior in recognition ratio. An increase in the number of measurements may improve recognition performance of the feature-based approach only slightly, because it is very di$cult to improve the quality of the measurements. Also, the performance of the feature-based matching vastly deteriorates with partial face occlusions and any image degradations, such as camera misfocus. For the above reasons, our review will concentrate on the template-based techniques.
3. Extraction and analysis of global facial features Out of the two most common directions in face recognition } analysis of global and local facial features } analysis of global features presumes a somewhat simpler problem formulation, since it avoids the question of selecting the size of localized features. Instead, images of the whole faces are aligned, typically in order to maximize the correlation between di!erent facial images. In most cases, the alignment is performed with respect to the eye region, which is undisputedly the most discriminating area ([5], and many others). The faces are scaled so that the eyes in all faces correspond to the same physical locations. 3.1. Compact face representation using eigenvectors
Fig. 1. Geometrical measurements used for feature-based recognition (from Brunelli and Poggio [5]).
One of the most well-known transformation applied to the facial images in order to extract global features is the Principal Component Analysis [14]. In this approach, a set of faces is represented by a small number of global eigenvectors, which encode the major variations in the input set. Originally, it was applied to faces by Sirovich and Kirby [15], who performed approximate reconstruction of faces in the ensemble using a weighted combination of eigenvectors (eigenpictures), obtained from that ensemble. The weights that characterize the expansion of the given image in terms of eigenpictures are seen as global facial features. In an extension of that work, Kirby and Sirovich [16] included the inherent symmetry of
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
faces in the eigenpictures (Fig. 2). The latter method produces slight improvements in the reconstruction of faces. Turk and Pentland [17] used eigenfaces for face detection and identi"cation. Fig. 3 shows an average face and the "rst 15 eigenfaces. Given the eigenfaces, each face is represented as a vector of weights. The weights are obtained by projecting the image into the eigenface components by a single inner product operation. The identi"cation of the test image is done by locating the database entry, whose weights are closest (in Euclidean distance) to the weights of the image. Especially, large di!erences between the image weights and the database entries typically indicate absence of a face in the input image. The authors reported 96% correct classi"cation over lighting variations, 85% over orientation variations and 64% over size variations. The authors conclude that the robust performance of their system under di!erent lighting conditions is caused by a signi"cant correlation between images with di!erences in illumination (see also Ref. [3]). However, Zhang et al. [18] show that the
Fig. 2. First nine eigenpictures, in order from left to right and top to bottom (from Kirby and Sirovich [16]).
Fig. 3. The average face (top left corner) and eigenfaces (courtesy of A. Pentland).
1163
performance of an eigenface-based technique deteriorates when lighting variations cannot be characterized as `very smalla. Pentland et al. [19] extended the capabilities of their earlier system in several directions. Di!erent structural con"gurations were considered, some of them included a search of facial features. The system showed 95% recognition rate on the FERET database, which contains 7562 images of approximately 3000 individuals. Using the system, the database can be interactively searched for images of certain types of people. To achieve orientation invariance, several entries with di!erent head orientations are stored for each individual (see also Ref. [20]). 3.2. Properties of individual eigenvectors O'Toole et al. [21]. studied the relationships between the values of facial eigenvectors and the characteristics of the faces, such as gender and race. It was shown that information in the weights of the second eigenvector yielded correct race predictions for 88.6% of the faces. Fig. 4 shows (from left to right) the "rst and the second eigenvectors, u and u , the sum of the "rst and the second eigenvectors, u #u , resulting in a male image, and the result of subtracting the second eigenvector from the "rst one, u !u , resulting in a female image. However, the most important information for face discrimination is found in the eigenvectors with smaller eigenvalues (Fig. 5). The eigenfaces with small eigenvalues contain information about higher frequencies, which also contains most of the image noise. Those eigenvectors are usually removed in favor of eigenvectors with the largest eigenvalues. The eigenvectors with large eigenvalues contain information about lower image frequencies, which contain less discriminative details of a face. The authors [22,23] conclude that the strategy of minimizing the least-squares error is not the best one for the purposes of recognition. However, they do not provide a clear solution for overcoming the limitation of the approach based on eigenfaces, which we believe arises due to processing of whole faces rather than their constituent parts. In another study, Blackwell et al. [24] claimed that whole image preprocessing, such as PCA, cannot solve the problems associated with learning large, complicated data sets. A more recent eigenface-based technique is described in Ref. [25]. The authors consider the class-conditional density as the most important object representation to be learned. Their maximum-likehood mechanism for face location uses two types of density estimates, a multivariate Gaussian for unimodal distributions and a Mixture-of-Gaussians model for multimodal distributions. Knowledge of those densities makes it possible to use a Bayesian framework for face recognition. The posterior
1164
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
Fig. 4. Gender prediction using the second eigenvector (from O'Toole et al. [22]).
Fig. 5. Discrimination power as a function of the range of eigenvector (adopted from O'Toole et al. [22]).
similarity measure is computed for the two classes corresponding to the intrapersonal and interpersonal variations. Some other approaches that take into account intrapersonal variations are described in later in this section and in Section 5.
Fig. 6. A three-layer backpropagation network.
3.3. Neural networks for face recognition A number of research and commercial face recognition systems use neural networks. The variety of neural-network techniques used to recognize faces is enormous, which makes it impossible to describe them all in a single survey. In this section, we will consider face recognition using the multi-layer perceptrons (MLP), which was used by perhaps the largest number of researchers. Examples of applications of other neural architectures to face recognition are covered in Sections 4.2 and 5, and in other surveys, for example Ref. [2]. Originally formulated by Werbos [26], MLPs contain several fully interconnected layers of nonlinear neurons (Fig. 6). The connections between neurons contain weights, whose values determine the pattern space of the training patterns. The connection weights are adjusted by the backpropagation rule, which minimizes the error of the association.
The aim of face processing using the MLP is to develop a compact internal representation of faces, which is equivalent to feature extraction. Therefore, the number ¸ of hidden neurons is less than in either input or output layers. That causes the network to encode inputs in a smaller-dimensional subspace that retains most of the important information. Oja [27] showed that the linear hidden neural units span the same space as the same number of principal components with the highest eigenvalues. The major di!erence between representations using the hidden neurons and those using the principal components is that in the former case the variance is evenly distributed across the hidden units [28]. The simplest form of using MLP for face recognition is to feed the facial image into the input layer without applying and preprocessing. Trying to retain only the most signi"cant information, Kosugi [29] decreased
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
resolution of the facial image down to 12;12 pixels, which were fed into the MLP. Luebbers et al. [30] decomposed the image into a series of binary images using isodensity regions of the facial image [31]. Vincent et al. [32] trained several MLPs to locate facial features in the images (one network per feature). Five neural networks were used for each of the eye regions and two for the corners of the mouth. Systems that directly associate the pixel information with a high-level syntactic description of an object are extremely sensitive to changes in the image. More sophisticated techniques extract features before feeding them into the network. Goudail et al. [33] used local autocorrelations for regions of 3;3 to 11;11 pixels. The system proposed by Augusteijn and Skufca [34] used secondorder statistical information about textural regions to classify the facial features. The hidden units of the MLP contain information which can be used to classify input images according to their typicality, sex and identity [35]. In another implementation, Golomb et al. [36] use a cascade of two MLP's for gender classi"cation (Fig. 7). Both MLPs consist of three neuronal layers. The compression network is trained to reconstruct faces using a compact representation of the hidden layer. Once the weights in the compression network have reached equilibrium, the values of the hidden neurons are fed into the second MLP, which is trained to associate a person with its gender. The method produced 91% correct performance on 10 new faces. In summary, the MLP approach has a similar representation to the approach based on eigenfaces. Lanitis et al. [37] point out that an important di!erence of these two approaches arises from the fact that the internal representation of the MLP is created during a training stage, which is speci"c for each particular application. In the application to face recognition, the researchers tend
1165
to associate di!erent appearances of a single face with the person's identity. Considering faces as points in the decision space, the network learns to reduce the distance between di!erent appearances of the same person while increasing the distance between faces of di!erent people. 3.4. Recognition using yexible models Another implementation that uses eigenvectors for face recognition was developed by Lanitis et al. [38]. This approach is based on the use of #exible models, related to those proposed by Yuille et al. [39]. Flexible models consider a facial image as a 3-D projection of a visual object that belongs to a certain class. These models are allowed to translate, rotate and deform to "t the best representation of their shape present in the image. The approach of #exible appearance models [38] consists of two phases } modeling, in which #exible models of facial appearance are generated; and identi"cation, in which these models are applied for classifying images (Fig. 8). Overall, three models are used, describing shape variations, localized intensity pro"les, and shape-free gray-level intensities (Section 3.5). Distributions of those parameters are learned during the modeling stage. At the recognition stage, shape parameters and gray-level information are used to compare the face to all the database entries. All the models used in the system have the same mathematical form. They are generated by performing the principal component analysis on the training samples. The authors use discriminant analysis [40] to isolate intrapersonal variation and interpersonal variation. Some of the signi"cant modes of shape variation account only for intrapersonal variation. Fig. 9(a) illustrates the e!ect of the four most signi"cant modes of the intrapersonal variation. Notice that the "rst three modes just
Fig. 7. Architecture of SexNET (adopted from Golomb et al. [36]).
1166
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
Fig. 8. Block diagram of the face identi"cation system. Adopted from Lanitis et al. [38]).
Fig. 9. The e!ect of: (a) main modes of the intrapersonal shape variation; (b) main discriminant modes of the interpersonal shape variation (from Lanitis et al. [38]).
change the 3-D orientation of the model, highlighting the importance of estimating the correct head orientation. Fig. 9(b) shows the e!ect of varying the main four discriminant modes of interpersonal variation. Only six discriminant variables were needed to explain 95% of the interpersonal variation. Using those modes of variation, a new image is assigned to the class that minimizes the Mahalanobis distance D between the centroid of that +? class and the calculated appearance parameters. In the practical realization, an active shape model [41] automatically locates a face in a new image. Once the model is "tted, both discriminant shape variables and gray-level parameters are measured. The obtained set of
the appearance parameters is used to identify the person. A peak recognition performance of 95.5% was achieved on images from the test set of the Manchester Face Database, which contains images with variation in appearance, expression, 3-D head orientation, and scale. 3.5. Shape-free facial models In their work on deformable facial templates, Lanitis et al. [38] distort facial images in order to achieve the best correspondence between the person's facial features and those of an abstract shape-free face (Fig. 10). The shape-free representation was originally proposed by
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
1167
features. Initially, we present an approach that uses a technique similar to the PCA to analyze localized image features across a large number of facial images. As a result, the whole set of features is represented using a much smaller number of extracted features. The other described approach uses elastic links to preserve relational information between localized image features. 4.1. Local feature analysis Fig. 10. Deformation of a facial image (left) to the shape-free face (right) using landmarks (from Lanitis et al. [38]).
Craw and Cameron [42]. They use the assumption that because the model of the principal component analysis is a linear space model, the faces themselves should form a linear space, that is the sum or average of two faces should itself be a face. Consequently, Craw [43] performs PCA of the faces that are preprocessed by distorting them to the shape-free form. More recently, the shapefree representation is used by other researchers (for example, Ref. [25]). The novelty of the approach by Lanitis et al. [38] is the utilization of the valuable shape information, which is used as another recognition cue (Section 3.4). The shape-free form as proposed by Craw contains the distribution of gray-scale values. As an alternative, Grudin [44] proposes a shape-free model that contains a distribution of intrapersonal variations, which are related to the high-level facial features. Indeed, corresponding facial features of di!erent people should exhibit similar intrapersonal variations. Humans use such information to identify a person, his/her emotions, and other characteristics. In the computerized recognition of human faces, estimated intrapersonal variations of facial features can be used to select the salient facial features from a single image of a person (Section 5.2.2).
4. Analysis of localized features Many approaches that use whole face processing also integrate information about local features. In order to improve recognition performance, Moghaddam et al. [25] consider using eigenfeatures in addition to eigenfaces. The eigenfeatures of the eyes, the nose, and the mouth outperformed eigenfaces for a small number of principal components. In the system developed by Lanitis et al. [38], the local image pro"les provided better recognition cues than the shape parameters. We would like to distinguish between localized image features and localized facial features. Whereas facial features are composed of image features, they also use a priori knowledge about faces. This section describes two approaches that rely on analysis of localized image
One of the biggest problems of utilizing localized features for the face recognition task is to select a subset of features that could reliably discriminate face in a large number of environments. In order to represent an image using a small number of local decorrelated features, Penev and Atick [45] proposed a technique called the Local Feature Analysis (LFA). The LFA produces a lowdimensional representation of visual objects that resembles the representation of the PCA. By enforcing the localization criterion, it becomes impossible to achieve perfect decorrelation between localized components; nonetheless, the reconstruction error for the LFA representation approaches that of the PCA representation. As a result, local features are de"ned at each point in the image. However, there is still a signi"cant residual correlation between such localized features. In order to reduce redundancy of the LFA representation, each localized population of the image features is represented by a single feature, while the rest of the features are suppressed. The resulting representation is sparse in a sense that the reconstruction of the most essential information in the image can be performed using a few features, which are distributed over the image. Fig. 11 shows the locations of the local features, the value of the topographic kernel and of a residual correlation for each of the localized features. The LFA has been applied to face location and recognition. Fig. 12(a) illustrates utilization of the LFA to locate a face on a light uniform background. The local features used to locate the face are illustrated as dark dots. It is di$cult to predict the method's performance in a cluttered scene, since this approach relies on the features located on the face outlines. In order to locate faces in cluttered background, some other studies use inner facial features (for example, Ref. [46]). Fig. 12(b) illustrates 25 most prominent localized features that are used to recognize the facial image. Notice that many points are located near the head contour, and are therefore prone to noise due to in-depth head rotations or changes in expression. The change in the head orientation may also a!ect relative positions of the localized features that are located within the facial outlines. The authors propose to compensate this by recovering the 3-D head structure from a set of eigensurfaces. If such a 3-D structure can be recovered, it presents a signi"cant advantage over recognizing faces in 2-D, since the shape
1168
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
Fig. 11. (a) Locations of localized features; (b) representation kernel (top row) and residual correlations (bottom row) of the LFA (from Penev and Atick [45]).
Fig. 12. (a) Face location using the LFA; (b) estimation of the most distinguishing features (from Penev and Atick [45]).
is independent of the image formation process. Another e$cient approach to deal with changes in expressions and head rotations is described below. 4.2. Elastic graph matching A problem that is central to recognition of faces as well as all other 3-D objects is preservation of the visual topology during the changes that arise due to di!erent object projections or shape deformations. One of the e!ective solutions to this problem is the Dynamic Link Architecture (DLA) [47]. As designed, the DLA can be applied to recognition of most visual objects. But this approach has been applied to recognize faces and therefore it is described here in detail. In a generic implementation, the DLA can be viewed as a regular grid, whose nodes contain a multiresolution description in terms of localized spatial frequencies. Each node contains several feature detectors that are based on
modi"ed Gabor-based wavelets [48]. Those detectors describe the gray-level distribution locally with high precision and more globally with lower precision. The grid nodes are connected with elastic links. Those connections group features into higher-order arrangements, which code for visual objects. The elasticity makes it possible to accommodate object distortions and changes in the viewing projection. A new face is enrolled by manually positioning the grid over the face area (Fig. 13(a)). More than one graph may be stored for one person in order to accommodate di!erent facial appearances. When a test image is presented, it is transformed into a grid of vectors, called an image domain. The image recognition is performed by matching all stored prototypic graphs to the image domain, the goal being minimization of the cost function between individual pairs of nodes. If the prototypic graph(s) of one person matches signi"cantly better than all the other graphs, the face in the image is considered as recognized.
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
1169
Fig. 13. (a) Initial position and (b) deformation of the elastic graph (images courtesy of Rolf WuK rtz).
The graph matching is performed in two stages. Initially, the face is located in the image using the nondistorted grid. Once the grid is positioned over the face, its structure is deformed in such a way that each node achieves a minimum of the cost function within a certain vicinity around that node. The "nal cost of matching is a weighted combination of the cost of matching each node in the grid and of the amount of the grid deformation. The system thus generalizes over moderate changes in size and orientation (Fig. 13(b)). The generality of the DLP has its downsides. Lades et al. [47] claimed that a face could be located using only the lowest frequency band of the prototypic graph. Yet the system is designed to compare all frequency bands each time a match is performed, even during the face location stage. In addition, bundling several di!erent frequency bands into a single vector makes it more di$cult to select salient features, since many features allow good di!erentiation only within a certain range of frequencies. To a large extent, these downsides of the DLA can be eliminated by changing the architecture so that it accommodates intrapersonal variations such as di!erent appearances and lighting conditions [4]. Some of the systems that address those problems are presented in the next section.
5. Accommodation of the intra-class variations of the localized features One of the di$culties in designing reliable face recognition systems is that di!erent people often look more similar to each other when captured in the same conditions than the same face captured under very di!erent conditions [4]. The problem of dealing with intraper-
sonal variations is addressed in many recent face recognition techniques. This section presents several techniques that use intrapersonal variations during the feature selection process. We start with a description of a mathematical model of a neural system, followed by a group of methods related to the DLA. 5.1. Dynamically Stable Associative Learning (Dystal) One of the interesting systems that accommodates intrapersonal variations of local features is based on neurophysiological research. It is a computational model of the mechanisms identi"ed in marine snail and rabbit hippocampus [49]. In that research, a network called Dynamically Stable Associative Learning (Dystal) investigates interactions between two inputs, namely the Conditioned Stimulus (CS) and the UnConditioned Stimulus (UCS). A single layer Dystal network consists of a group of elements referred to as output units, equal in number to the number of components in the UCS vector. Each output unit receives input from a receptive "eld (a subset of CS inputs) and one (scaled) component of the UCS input vector (Fig. 14). The UCS can be a classi"cation signal, or it can have the same size as the CS input to for the purpose of pattern completion. All the patterns learned by Dystal are stored in a set of patches; however, each patch individually stores only a single association between a CS input and its associated UCS component. A patch is composed of: (1) a patch vector, the running average of CS input patterns that share similar UCS values; (2) the running average of similar UCS values; and (3) a weight that re#ects the frequency of utilization of the patch. Unlike in other networks, the weight is not used in the computation of
1170
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
presentations of each face and later tested on "ve other presentations. It was able to associate new appearances of the stored faces according to their most salient features. Signi"cant changes in facial expressions did not a!ect the reconstruction of the original, e.g., changes in the mouth expression were ignored. This illustrates the ability of the network to concentrate on the stable features. 5.2. Graph-based techniques
Fig. 14. Schematic representation of a single output neuron and its associated patches. Adopted from Alkon et al. [49].
the output of a neuronal element. The weight is used for patch merging and patch deletion. All patches whose weight has decayed to less than the patch retention threshold are removed. Each output neural element computes the similarity function as the correlation factor between the CS pattern and each patch vector. The number and content of the patches are determined during training and are a function of the content of the training sets and of the global network parameters. During the training stage, the most similar patch is modi"ed in order to increase its resemblance to the current CS pattern. As a result, each patch contains the average of all the CS inputs that are similar to that particular patch. As the training progresses, the e!ect that a novel CS input patch makes decreases. Dystal has been applied to recognize hand-written postal index and hand-printed Japanese Kanji characters [24]. Prior to recognition, the digits were segmented, scaled and rotated to roughly the same orientation. The network was trained by presenting each pattern in the training set once. Dystal correctly classi"ed 98% of previously unseen hand-written digits. When similarly trained to classify Kanji characters, it is able to learn 40 people's handprinting of 160 di!erent characters to 99.8% accuracy. Such an approach might replace optical character recognition by optical word recognition, which is much faster due to reduced complexity of the segmentation problem. When applied to face recognition, Dystal was able to correctly classify 100% of a small set of faces, which were exposed to variances in expression [50]. The faces need to be scaled and adjusted in their position prior to training and recognition. Dystal was trained on four
Among methods that analyze intrapersonal variations perhaps the largest share belongs to methods inspired by the DLA (Ref. [51}55] and others). This occurred because the generic method of the DLA conveniently allows analysis the image data on multiple resolutions and elegantly accommodates invariance to object deformations. It is possible to outline two major directions of the development in this area. The "rst group of approaches is based on associating a small number of feature vectors with high-level facial features. The second group exploits the multiscale nature of image processing to reduce redundancy of the image representation. 5.2.1. The Topological Face Graph One of the directions in the further research on attributed graphs is based on associating graph nodes with the high-level facial features [51,52]. That is, the same node corresponds to the same facial feature in di!erent faces. As in the DLA, each node in such a graph contains image information on multiple resolutions. Di!erent face orientations are encoded by graphs with di!erent topology (Fig. 15). All vectors in the Face Bunch Graph (FBG) referring to the same facial feature (called xducial point) are bundled together in a bunch (Wiscott et al. [51]). Each "ducial point is represented by several alternatives in order to account for many possible variations in the appearance of that feature. Fig. 16 shows a sketch of an FBG. Each of the nine nodes is labeled with a bunch of six vectors, which together can potentially represent 6HH9" 10077696 di!erent faces. When the FBG is matched to a face, a single vector (indicated in gray) that best encodes the appearance of the corresponding feature is selected from each bunch. The resulting image graph can be e$ciently compared to large galleries without a need of repeated image search. In the KruK ger's approach, the feature vectors are also associated with facial features. However, only a single appearance is stored for each feature. The objective of that work is to evaluate typical discrimination abilities of di!erent features. As a result, it is possible to design a similarity function that would assign di!erent features certain weights, which are proportional to that feature's discrimination ability. Such a similarity function would
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
1171
where Sim is the normalized dot product of the feature HCR vectors (jets) in the nodes I and J , b is the signi"cance I I I weight of node k. The approach was used for face recognition and pose estimation. The results, illustrated in Fig. 17, indicate that the weights for the discrimination problem (Fig. 17(a)) di!er from the weights for the location problem (Fig. 17(b)). The eyes are more important for discrimination of frontal and half pro"le views than the mouth and chin. The nodes corresponding to the top of the head are very insigni"cant for pose estimation. The tip of the nose is the most signi"cant feature, followed by the lips for the frontal and half-pro"le views and the chin for the pro"le view. The eyes were shown to be insigni"cant for that task.
Fig. 15. Flexible graphs for di!erent facial orientations and sizes (from KruK ger [52]).
Fig. 16. Sketch of a face bunch graph (from Wiskott et al. [51]).
have a form: L Sim (I, J)" b ) Sim (I , J ), RMR I HCR I I I
(1)
5.2.2. Multiresolution analysis of facial images Another major extension of the DLA is based on the idea that information on di!erent image resolutions should be treated independently. Contributions in this direction were made by WuK rtz [53] and Grudin [54]. They use pyramidal architecture of the attributed graphs, which corresponds to the multiresolution nature of the image data. The sampling of each hierarchical grid is proportional to the size of the receptive "elds in the nodes of that grid, in line with the requirements of the redundancy reduction [56]. The "rst of these approaches is motivated by (1) the necessity to reduce redundancy of the DLA structure; (2) the need to remove the background information, (3) use only the nodes that have correspondences in the input image. In that approach, each level of the graph is manually clipped so that the nodes whose receptive "elds contain background information are removed. The nodes over the hair region are also removed in order to reduce dependency on the change in person's appearance. In this multi-layered graph architecture, the nodes are linked using hierarchical and spatial links. The hierarchical links exist only between the parent and its children. The spatial links exist between the children of the same parent, forming a square if all four nodes are present. The matching is performed in a top-down manner. That is, the coarser level of the graph hierarchy is matched "rst, followed by matching the "ner levels. Information from the coarse resolutions contributes to search on "ner resolutions by setting the initial positions of the children relative to their parent. This reduces the computational complexity of matching the graph to the image and yet avoids the local minima of the cost function. WuK rtz uses an assumption that only a subset of the graph nodes have a good correspondence in a new image of the same face. Some nodes do not have any correspondence at all, because the features they encode are changed or do not exist in the new image. Therefore, during the matching procedure, only the nodes with a
1172
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
Fig. 17. (a) A weight matrix for the comparison of frontal with half pro"le views; (b) (left to right) the learned weight matrices for the pose estimation task for frontal (left), half pro"le (center), and pro"le (right) views (from KruK ger [52]).
high degree of the correspondence are retained (Fig. 18). Such a feature selection approach can improve the system performance if the matching score of the correct attributed graph increases faster than those of the other graphs. Although the WuK rtz's approach does not deal with intrapersonal variations in a direct manner, a closely related technique by Grudin [44] uses hierarchical graphs to estimate the distinguishing features from a single facial image. He uses the fact that humans can pick the most distinquishing facial features from a single image of a person. Humans use a priori knowledge to direct their attention to the areas that are very di!erent between faces of di!erent people and yet preserve certain predictable properties between di!erent appearances of the same face. This approach also uses a graph scheme, which di!ers from the previous approach in the implementation of the hierarchical structure and in the process of graph matching. Here, spatial links are used to connect direct neighbors at each level. This signi"cantly improved the stability of the graph matching approach under considerable distortions [44]. In addition, the nodes on each level that have higher initial correspondence are matched before the others, thus providing anchors for the whole grid. Most of the previously implemented graph matching techniques relied on random or centrifugal sequence of matching the grid nodes. Similarly to the previous approach, the nodes outside of the facial boundaries are manually removed during the enrollment stage. However, the nodes over the hair region are preserved, since they are expected to be automatically removed during the feature selection stage. The feature selection stage uses a Bayesian rule to estimate the discrimination con"dence of individual localized features. The result of the feature selection is a sparse attributed graph. Each level in the graph contains features that are estimated to provide more discriminative information about the person's identity. The retained set of features is unique for each face.
Fig. 18. (a) The original mapping of the high-resolution grid (left) and the distorted mapping over another appearance of the same face; (b) the retained good-matching nodes, shown over the original face and over the di!erent face appearance (from WuK rtz [53]).
Ideally, a salient (or discriminative) feature should exhibit large interpersonal and low intrapersonal variations, with the value of the feature being very speci"c for the particular face. It is possible to estimate typical intrapersonal variations of the high-level facial features from a limited number of images. In practice, the distributions of the intrapersonal variations are computed for each person in the training set, mapped into the shapefree form, and averaged. The average shape-free distribution is subsequently used to estimate intrapersonal vari-
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
ations of new faces. To do that, the shape-free distribution is distorted to match shape characteristics of a particular face. The experiment is performed in seven steps: 1. Obtain a single model graph from a person's image. Manually track a pre-de"ned set of facial features. 2. Compute the interpersonal distribution of image responses for each facial location. 3. For each person in the training set, match the person's model graph to other images of that person. Compute the mean and the standard deviation of the intrapersonal matching cost for each graph node. 4. Remap the distributions obtained in step 3 to a new representation, where positions of the facial features obtained in Step 1 correspond to pre-de"ned (shapefree) physical locations. Average the shape-free distributions obtained from the persons in the training set. 5. Remap the average shape-free distribution of intrapersonal variation so that the abstract facial features correspond to features of each particular person. The resulting new distribution contains intrapersonal variations that are related to the facial features of that person. 6. Use the Bayesian rule to compute the discrimination con"dence for each node. Remove the nodes with lower con"dence. 7. Match the sparsi"ed model graphs to images in the test set. In the hierarchical recognition scheme, the same facial region might exhibit di!erent intrapersonal variation on di!erent resolutions. Therefore, the distribution of intrapersonal variation on each processing scale is stored
1173
in a separate shape-free form. In Fig. 19, the lighter regions correspond to larger intrapersonal variations. The dark points in each image indicate positions of the eye centers and the mouth corners. As illustrated in Fig. 19, some facial features exhibit signi"cantly larger variations than others. On all scales, hair region exhibits large variations. The nose is also shown as less reliable, due to signi"cant di!erences in its appearance under side-to-side head rotations. This occurs because the nose is the farthest feature from the spinal cord, which is the center of such rotations. At the same time, because the nose is the closest feature to the camera, its projections under di!erent rotations are seen to contain more variations. The eyes and the mouth are seen as more stable features. However, their stability depends on the image resolution } for example, the iris movement on the high resolution contributes to higher intrapersonal variations of the eye region (Fig. 19(c)). Fig. 20 illustrates the discrimination con"dence of facial regions as computed by the Bayesian rule. The light regions correspond to more salient features. For illustrative purposes, the background areas are automatically "lled in white. As a result of the feature selection stage, features with low discriminative power are removed, thus making the graph structure sparse. When applied to the test set of the database, the sparse graphs recognized 85% of the facial images, compared with 78% for the non-sparse (complete) graphs. Fig. 21 illustrates examples of the graph matching on images taken in di!erent conditions. Fig. 21(a) shows a complete graph being matched to the facial image. Fig. 21(b) shows the sparse graph matched to that image. In this and in the next image, the remaining nodes in the sparsi"ed grid are illustrated as white squares. Fig. 21(c) illustrates the same sparse graph matched to another image of the
Fig. 19. The shape-free distributions of the intrapersonal variation at three resolutions (coarse to "ne). Distribution (a) is obtained from low-resolution images, while the right image is obtained from images processed on a high resolution (From Grudin [44]). Large intrapersonal variations correspond to the lighter regions. The dark points indicate positions of the eyes and the mouth corners.
1174
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
Fig. 20. Estimation of the discrimination power of feature descriptors (from Grudin [44]). The light regions correspond to more discriminative features.
same person. Fig. 21(d) shows the complete graph being matched to the di$cult image of that person. It is tempting to increase the sparsi"cation ratio in order to improve the recognition performance while using fewer features. However, large values of the sparsi"cation ratio increase the number of loose nodes, which do not have any adjacent neighbors and hence are not attached to the rest of the grid by the spatial links. If not removed, those loose nodes are not restricted in their movements over the image and introduce additional uncertainty in the matching score. Therefore, sparsi"cation of the grid will improve the recognition performance only within a certain range of the sparsi"cation factor. Although the performance of this approach was worse than the performance of the approach by Lanitis et al. [37] when applied to the same test set of the Manchester face database, it was better than the performance of the best single recognition technique used in the system developed by Lanitis et al. [37]. However, although both techniques used the same face database, the performance
cannot be compared directly due to di!erent speci"cations of the test and training sets. Among the advantages of this technique is generation of a face model from single facial image. The major disadvantage is the necessity to upgrade the shape-free distribution of intrapersonal variation when a new set of environmental constraints, such as di!erent lighting conditions, is introduced. 6. Discussion and conclusions The complexity of the face recognition task makes it impossible for any single currently available approach to achieve 100% accuracy. Future successful face recognition systems will consist of multiple techniques, each being used to analyze a certain facial cue or a combination of cues. In such an implementation, the choice of the most appropriate technique will depend on the image context. Although it is impossible to select a single best face recognition method, we can outline some guidelines that
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
1175
Fig. 21. Fitting of non-sparse (a, d) and sparse (b, c) grids to di!erent images of the same person (from Grudin [44]).
are essential to achieve a high recognition ratio. The following discusses the set of principles that this paper has considered: 1. As was discussed in Section 2, the approaches based on feature measurements provide less reliability than the template-matching techniques. As a result, most of the recent research in face recognition was performed using template-based techniques. 2. In most cases, whole-face processing techniques achieve inferior performance compared to the methods that use localized features. One of the problems of processing the localized features is selecting the scale on which the features exhibit the most discriminating characteristics. Future face recognition systems will be addressing this problem on a regular basis. 3. The face recognition systems will be more sophisticated in terms of integrating the intrapersonal variations and reducing illumination dependency. At
present, most of the systems are rather sensitive to such variations and these are the areas that might signi"cantly improve the recognition accuracy. 4. Many existing techniques use localized image features, while some others use localized facial features. Although both might sometimes correspond to the same image region, there are certain di!erences between the two. If a system relies on a pre-de"ned set of localized facial features (e.g. nose or mouth), it neglects information about features that are speci"c to a particular person, such as birthmarks. On the other hand, if the systems that use localized image features do not incorporate information about the corresponding facial features, their performance would quickly degrade with changes in the image. Future designs will establish correspondence between the image features and the facial features in order to use a priori knowledge about facial features to select the most discriminating image features. In addition, future techniques are likely to consider how appearances of facial features
1176
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177
change under di!erent transformations of facial images, such as in-depth head orientations and expressions. 5. Another cue that will be used more often in the future face recognition systems is shape information. Integration of such information will make it possible to compensate for the changes in facial expression. The shape characteristics may "nd greater usage in face recognition; in addition, they are likely to improve the image compression algorithms that are used in areas such as telecommunications.
Acknowledgements The author wishes to acknowledge the research grant from the School of Engineering, Liverpool John Moores University. He would like to thank his supervisors Dr. David Harvey, Prof. Paulo Lisboa, and Dr. Mike (Showers) Shaw, and members of the Coherent and Electro-Optic Research Group (CEORG) in the Liverpool JMU for their constant support. He is grateful to Prof. Chris Taylor (University of Manchester), and Drs. Ben Dawson and James Kottas for their useful comments. The author would also like to thank Drs. Atick, Brunelli, Edwards, KruK ger, O'Toole, Poggio, Pentland, Sirovich, Taylor, Wiskott, and WuK rtz for help with obtaining high-quality illustrations. Figs. 7, 8 and 10 are reprinted from Image and Vision Computing, Volume 13, A. Lanitis, C.J. Taylor, and T.F. Cootes, Automatic Face Identi"cation System Using Flexible Appearance Models, 743}756, Copyright 1997, with permission of Elsevier Science.
References [1] A. Samal, P.A. Iyengar, Automatic recognition and analysis of human faces and facial expressions: a survey, Pattern Recognition 25 (1992) 65}77. [2] D. Valentin, H. Abdi, A.J. O'Toole, G.W. Cottrell, Connectionist models of face processing: a survey, Pattern Recognition 27 (1994) 1209}1230. [3] R. Chellappa, C.L. Wilson, S. Sirohey, Human and machine recognition of faces: a survey, Proc. IEEE 83 (1995) 705}740. [4] J. Daugman, Face and gesture recognition: overview, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 675}676. [5] R. Brunelli, T. Poggio, Face recognition: features versus templates, IEEE Trans. Pattern. Anal. Mach. Intell 15 (1993) 1042}1052. [6] W.W. Bledsoe. The model method in facial recognition, Panoramic Research Inc., Tech. Rep. PRI:15, Palo Alto, CA, 1964. [7] A.J. Goldstein, L.D. Harmon, A.B. Lesk, Identi"cation of human faces, Proc. IEEE 59 (1971) 748.
[8] Y. Kaya, K. Kobayashi, A basic study on human face recognition, in: S. Watanabe (Ed.), Frontiers of Pattern Recognition, Academic Press, New York, 1972, pp. 265}289. [9] T. Poggio, F. Girosi, Networks for approximation and learning, Proc. IEEE 78 (1990) 1481}1497. [10] I. Craw, H. Ellis, J.R. Lishman, Automatic extraction of face features, Pattern Recognition Lett. 5 (1987) 183}187. [11] M. Bichsel. Strategies of robust object recognition for identi"cation of human faces. Ph.D. thesis, Eidgenossischen Technischen Hochschule, Zurich, 1991. [12] X. Jia, M.S. Nixon, Extending the feature vector for automatic face recognition, IEEE Trans. Pattern. Anal. Mach. Intell 17 (1995) 1167}1176. [13] R.J. Baron, Mechanisms of human facial recognition, Int. J. Man. Mach. Stud. 15 (1981) 137}178. [14] K. Karhunen, Uber lineare methoden in der wahrscheinlichkeitsrechnung, Ann. Acad. Sci. Fennicae Ser. A1, Math. Phys. 37 (1946). [15] L. Sirovich, M. Kirby, Low-dimensional procedure for the characterisation of human face, J. Opt. Soc. Amer. 4 (1987) 519}524. [16] M. Kirby, L. Sirovich, Application of the Karhunen} Loeve procedure for the characterization of human faces, IEEE Trans. Patt. Anal. Mach. Intell. 12 (1990) 103}108. [17] M.A. Turk, A.P. Pentland, Face recognition using eigenfaces, Proceedings of the International Conference on Pattern Recoginition, 1991, pp. 586}591. [18] J. Zhang, Y. Yan, M. Lades, Face Recognition: Eigenface, Elastic Matching and Neural Nets, Proceedings of the IEEE 85 (1997) 1423}1435. [19] A.P. Pentland, B. Moghaddam, T. Starner, M.A. Turk, View-based and modular eigenspaces for face recognition, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1994, pp. 84}91. [20] D.J. Beymer, Face Recognition Under Varying Pose, A.I. Memo No. 1461, MIT, New York, 1993. [21] A.J. O'Toole, K.A. De!enbacher, J. Barlett. Classifying faces by race and sex using an autoassociative memory trained for recognition, Proceedings. of the 13th Annual Conference Cognition Science Society, Hillsdale, Erlbaum, NJ; 1991. [22] A.J. O'Toole, H. Abdi, K.A. De!enbacher, D. Valentin, A low dimensional representation of faces in the higher dimensions of space, J. Opt. Soc. Amer. A 10 (1993) 405}411. [23] A.J. O'Toole, H. Abdi, K.A. De!enbacher, D. Valentin, A perceptual learning theory of the information in faces, in: T. Valentine (Ed.), Cognitive and Computational Aspects of Face Recognition, Routledge, London, 1995, pp. 159}182. [24] K.T. Blackwell, T.P. Vogl, S.D. Hyman, G.S. Barbour, D.L. Alkon, A new approach to hand-written character recognition, Pattern Recognition 25 (1992) 655}666. [25] B. Moghaddam, A. Pentland, Probabilistic visual learning for object representation, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 696}710. [26] P. Werbos. Beyond regression: new tools for prediction and analysis in the behavioral sciences. Ph.D. Thesis, Applied Mathematics, Harvard University, 1974.
M.A. Grudin / Pattern Recognition 33 (2000) 1161}1177 [27] E. Oja, A simpli"ed neuron model as a principal component Analyzer, J. Math. Biol. 13 (1982) 267}273. [28] G.W. Cottrell, P. Munro. Principal component analysis of images via backpropagation. Proc. Soc. Photo-Optical Instrum. Engng. (1988) 1070}1076. [29] M. Kosugi. Robust identi"cation of human face using mosaic pattern and BNP, Proceedings of the International Conference on Neural Networks for Signal Processing, 1992, pp. 209}305. [30] P.G. Luebbers, O.A. Uwechue, A.S. Pandya, A neural network based facial recognition system, Proc. SPIE 2243 (1994) 595}606. [31] O. Nakamura, S. Mathur, T. Minami, Identi"cation of human faces based on isodensity maps, Pattern Recognition 24 (1991) 263}272. [32] J.M. Vincent, J.B. Waite, D.J. Myers, Automatic location of visual features by a system of multilayered perceptrons, IEE Proc.-F 139 (1992) 405}412. [33] F. Goudail, E. Lange, T. Iwamoto, K. Kyuma, N. Otsu, Fast face recognition method using high order autocorrelations, Proceedings of the International Joint Conference on Neural Networks, 1993, pp. 1297}1300. [34] M.F. Augusteijn, T.L. Skufca. Identi"cation of human faces through texture-based feature recognition and neural network technology, Proceedings of the IEEE Conference on Neural Networks, 1993, pp. 392}398. [35] G.W. Cottrell, M.K. Fleming, Face recognition using unsupervised feature extraction, Proceedings of the International Conference on Neural Networks. Paris, 1990, pp. 322}325. [36] B.A. Golomb, D.T. Lawrence, T.J. Sejnowski, Sexnet: a neural network identi"es sex from human faces, in: D.S. Touretzky, R. Lipmann (Eds.), Advances in Neural Computation Processing Systems, vol. 3, Kaufmann, San Mateo, 1991, pp. 572}577. [37] A. Lanitis, C.J. Taylor, T.F. Cootes, Automatic interpretation and coding of face images using #exible templates, IEEE Trans. Pattern Anal. Machine Intell. 19 (1997) 743}756. [38] A. Lanitis, C.J. Taylor, T.F. Cootes, Automatic face identi"cation system using #exible appearance models, image and vision comput. 13 (1995) 393}401. [39] A. Yuille, D. Cohen, P. Hallinan. Feature extraction from faces using deformable templates. Proc. IEEE Computer Society Conference on Computer Vision and Pattern. Recognition 1989 104}109. [40] B.F.J. Manly, Multivariate Statistical Methods, a Primer, Chapman & Hall, London, 1986. [41] T.F. Cootes, C.J. Taylor, A. Lanitis. Active shape models: evaluation of the multiresolution method for improving
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49] [50]
[51]
[52]
[53]
[54]
[55]
[56]
1177
image search, Proceedings of the 5th Br. Machine Vision Conference. BMVA Press, 1994, pp. 327}336. I. Craw, P. Cameron. Parameterizing images for recognition and reconstruction, Proceedings of the BMVC 91 Glasgow, Scotland 1991, pp. 367}370. I. Craw, A manifold model of face and object recognition, in: T. Valentine (Ed.), Cognitive and Computational Aspects of Face Recognition: Explorations in Face Space, Routledge, London, 1995, pp. 183}203. M. Grudin, A compact multi-level model for the recognition of facial images Ph.D. Thesis Liverpool John Moores University, UK, 1997. P. Penev, J.J. Atick, Local feature analysis: a general statistical theory for object representation, Network: Comput. Neural Systems 7 (1996) 477}500. K.K. Sung, T. Poggio, Example-based learning for viewbased human face detection, IEEE Trans. Pattern Anal. Machine Intell. 20 (1998). M. Lades, J.C. Vorbruggen, J. Buhmann, J. Lange, C.v.d. Malsburg, R.P. Wurtz, W. Konen, Distortion invariant object recognition in the dynamic link architecture, IEEE Trans. Comput. 42 (1993) 300}311. J. Daugman, Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by 2D visual cortical "lters, J. Opt. Soc. Amer. (A) 2 (1985) 1160}1169. D.L. Alkon, Memory storage and neural systems, Sci. Am. 261 (1989) 42}50. D.L. Alkon, K.T. Blackwell, G.S. Barbour, S.A. Werness, T.P. Vogl, Biological plausibility of synaptic associative memory models, Neural Networks 7 (1994) 1005}1017. L. Wiskott, J.M. Fellows, N. KruK ger, C.v.d. Malsburg, Face recognition by elastic bunch graph matching, IEEE Trans. on Pattern. Anal. Mach. Intell. 19 (1997) 775}779. N. KruK ger, An algorithm for the learning of weights in discrimination functions using a priori constraints, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 764}768. R.P. WuK rtz, Object recognition robust under translations, deformations, and changes in background, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 769}774. M.A. Grudin, P.J.G. Lisboa, D.M. Harvey, Compact multi-level representation of human faces for recognition, Proceedings of IEE Conference on IPA-97 1997, 111}115. P. Kalocsai, H. Neven, J. Ste!ens. Statistical analysis of gabor-"lter representation, Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 1998, pp. 360}365. D.J. Field, Relations between the statistics of natural images and the response properties of cortical cells, J. Opt. Soc. Am. (A) 4 (1997) 2379}2394.
About the Author*DR. MAXIM GRUDIN received his Dipl. Eng. in Electrical Engineering from Vinnitsa State Technical University (Ukraine) in 1994. He received his Ph.D. degree from Liverpool John Moores University (United Kingdom) in 1997. His Ph.D. thesis is entitled `A Compact Multi-Level Model for the Recognition of Facial Imagesa. At present, Dr. Grudin is a scientist at Miros, Inc., a Massachusetts-based company that develops security solutions based on face recognition technology. He has authored and co-authored ten papers and two patent applications.
Pattern Recognition 33 (2000) 1179}1198
Integral opponent-colors features for computing visual target distinctness Xose R. Fdez-Vidal , Rosa Rodriguez-Sanchez, J.A. Garcia*, J. Fdez-Valdivia Departamento de Fn& sica, Aplicada, Facultad de Fn& sica, Universidad de Santiago de Compostela, 15706 Santiago de Compostela, Spain Departamento de Ciencias de la Computacio& n e I.A., E.T.S. de Ingeniern& a Informa& tica, Universidad de Granada, 18071 Granada, Spain Departamento de Ciencias de la Computacio& n e I.A., E.T.S. de Ingeniern& a Informa& tica, Universidad de Granada, Avda Andalucia 38, 18071 Granada, Spain Received 24 December 1997; received in revised form 19 October 1998; accepted 6 April 1999
Abstract This paper presents a computational target distinctness model to predict human visual search performance for a set of color images. Visual distinctness metrics can be used to compare and rank target detectability, and to quantify background or scene complexity. The subjective ranking induced by the psychophysically determined visual target distinctness for 64 human observers is adopted as the reference order. The ranking produced by the proposed measure and several computational models are compared with the subjective rank order. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Visual target distinctness; CIELAB space; LogGabor functions; Activated "lters; Fixation points; Integral features; Quick pooling; Maximum-output rule
1. Introduction Visual distinctness metrics have been proposed to compare and rank target detectability, and to quantify background or scene complexity. If they give good predictors of target saliency for humans performing visual search and detection tasks, they may be used to compute visual distinctness of image subregions (target areas) from digital imagery. Recent studies show that simple image metrics do not give good predictive results when applied to highly resolved targets in complex background scenes. Although the root mean square error (RMSE) has a good physical and theoretical basis, it is often found to correlate very poorly with subjective ratings. A demonstration of this fact can be found in the following example in which the relative distinctness of targets and their immediate surrounds is computed with di!erent measures.
* Corresponding author. Fax: #34-958-243317. E-mail address:
[email protected] (J.A. Garcia).
Fig. 3 shows six complex natural scenes containing a single target, and other six pictures of the same scenes with no target. The targets are of di!erent degrees of visibility. The visual distinctness of the targets within the image as given by photointerpreters is shown in Table 1. This subjective rank order was based on the psychophysically determined visual target distinctness for 64 human subjects, as described in Section 5. The comparative results of the RMSE are also presented in Table 1 (see Section 5 for further details). By simply viewing the data, it can be determined that the RMSE correlates poorly with the subjective rating by human observers. This result may be due to several facts. On the one hand, targets which are similar either to their local background or to many details in other parts of the scene are harder to detect than targets which are highly dissimilar to these structures. Also, the visual distinctness of a target decreases with increasing variability of the scene [1]. On the other hand, the human visual system does not process the image in a point-by-point manner but rather in a selective way according to the decisions made on a cognitive level, by choosing speci"c data on which to make
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 0 9 - 0
1180
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
judgments and weighting them more heavily than the rest of the image [2,3]. Our previous attempts of tune visual distinctness metrics to the properties of the human visual system provided responses to several interesting questions: E Firstly, Fdez-Vidal et al. [4] were intended to analyze what the qualitative di!erences in performance of distinctness metrics are, in the case the images to be processed are compared in a selective rather than in a point-by-point way. A main conclusion of this work was that a phase-congruency model of feature detection [5] induces an error measure in the corresponding perceptual domain that improves the correlation between subjective rating and a pixel-by-pixel error metric (i.e., RMSE) and consequently better captures the response of the human visual system. E Secondly, the analysis of how conjunctions of features [6] can be incorporated in a metric for image discriminability which corresponds to the human observer's evaluation, led Fdez-Vidal et al. [7] to measure the discriminability between two images based on the distance (a b-norm) between their statistical structure, computed over those pixels which form `"xations pointsa of the reference image (i.e., points of maximum phase-congruency for the gray-level representation of the reference). E Thirdly, the analysis of how physical objects and scenes should be coded for exploiting the image correlation structure [8], led Martinez-Baena et al. [9,10] to suggest that there are three key points further to produce a perceptual measure of visual distinctness: (i) a data-driven multisensor design that provides a means to compare the information content of a pair of images in multiple spatial locations, resolutions and orientations; (ii) a method for selection of strong responding units in the multisensor organization that increases the signal-to-noise ratio and makes the matching of features between two images more successful; and (iii) a set of "lters modeling the activated sensor responses that resembles the receptive "elds of cells in the striate cortex and exploits the basic properties of 2D spatial-frequency tuning and spatial selectivity. Other relevant computational models of early human vision typically process an input image through various spatial and temporal bandpass "lters and analyze "rstorder statistical properties of the "ltered images to compute a target distinctness metric [11}14]. This paper presents a new computational model for quantifying the visual distinctness of a target in its surround. This computational target distinctness model (it has been termed `Integral Opponent-color Featurea error, or IOF error) extends our early attempts to produce a perceptual measure for perceiving image
discriminability. To this aim, in addition to the analysis of color attribute, the new model incorporates the main points analyzed in our previous works: (i) a preattentive stage incorporates the processing described in [4,9,10] in order to perform the decomposition of the color space and the "xation point detection; (ii) an integration stage is deployed to analyze and integrate the separable representations at "xation points, as suggested in Ref. [7]. The IOF error predicts the target distinctness by the di!erence between the signal from the target-andbackground scene and the signal from the backgroundwith-no-target (i.e., one image has the target and the other does not). The IOF model requires a non-target image which is identical to the image with the target everywhere except within the target region. But this is not a serious constraint because di!erent utilities may be used to create a synthetic non-target image from the target image [15]. The IOF model may be described in terms of three di!erent stages: preattentive stage, integration stage, and decision stage (see Fig. 1). E Firstly, an early preattentive stage performs (see Fig. 2): (i) a RGB to CIE Lab transformation; (ii) a representation for each opponent-color component by using 2D logGabor "lters; (iii) the selection of activated "lters by lateral inhibition (an activated analyzer from the "lter bank is sensitive to some meaningful structure in the target image); and (iv) the detection of stimulus locations that indicates potentially interesting image regions, or `"xation pointsa. E Secondly, an integration stage is deployed to analyze and integrate the separable representations at "xation points (see Fig. 2). The resultant integral opponentcolors features are used to measure visual target distinctness in the search experiments. For each activated "lter, the visual distinctness between the target and non-target responses is computed by nonlinear pooling of the di!erences in integral features over "xation points. E Thirdly, a decision stage produces the system output (see Fig. 2): (i) First, the opponent-colors channel output is computed as the maximum of the outputs over activated "lters on the channel; (ii) the visual target distinctness is "nally given as the maximum of the outputs over opponent-colors channels (i.e., luminance, red}green and yellow}blue). The rest of the paper is organized as follows. Section 2 describes the preattentive stage. Section 3 presents the integration stage. Section 4 describes the use of the maximum-output rule to produce the system output. The results of the experiments are shown in Section 5. Finally, the main conclusions of the paper are summarized in Section 6.
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
1181
Fig. 1. General diagram showing how the data #ows through the three stages of the IOF model.
2. Preattentive stage In this stage, the RGB values are "rstly transformed into coordinate values in an CIELAB space. Also in this "rst stage, each opponency channel is analyzed by a "lter bank created by logGabor functions. Spatial information in the system is analyzed by multiple units but only the response of strongly responding units will be used to measure visual target distinctness. For each of them, "xation points are computed as local energy peaks on the "ltered response. 2.1. RGB to opponent-color encoding transform The psychophysical basis of this transform is that somewhere between the eye and the brain, signals from the cone receptors in the eye get coded into light}dark, red}green, and yellow}blue signals [16]. Hence, any measure de"ned in the RGB space is not appropriate to quantify the perceptual error between images. Thus it is important to use color spaces which are closely related to the human perceptual characteristics and suitable for de"ning appropriate measures of perceptual errors. Among these, perceptually uniform color spaces are the most appropriate to de"ne measures of perceptual error [17]. CIE standardized two uniform color spaces for practical applications: the CIE 1976 L*u*v* (CIELUV) space and the CIE 1976 L*a*b* (CIELAB) space. Both CIELAB and CIELUV spaces employ an opponentcolor encoding and use white-point normalizations that partly explain color constancy. In both spaces, the Euclidean distances provide a color-di!erence formula for evaluating color di!erences in perceptual relevant units. The IOF error is de"ned in the CIELAB color space, since the uniform color-opponent system selection is not critical to the success of the overall approach. In fact, the results shown in Section 5 do not change using a CIELUV space in the de"nition of the IOF error.
The RGB to CIELAB transform has to be executed into two steps: (i) RGB to CIEXYZ color coordinate; and (ii) CIEXYZ color coordinate to CIELAB coordinate. (i) This operation converts RGB images to the (X, >, Z) CIE standard color coordinate system, so that standard data on the sensitivity of the visual system to stimuli in the CIE standard color coordinate system could be used in this model. The images used in this paper have been digitalized using a KODACK PCD Film Scanner 2000. This process is described in Ref. [18] where the gamma correction and the RGB to CIE XYZ transformation is provided, calibrated on the same screen that it was used for the experiments with human observers. This transformation is
X
0.040 0.084 !0.041
R
> "
0.021 0.107 !0.038
G .
Z
!0.005 0.011
0.024
(1)
B
(ii) This second operation transforms the input from the absolute CIE XYZ color coordinate system to the internal luminance/color opponent coordinate system of the neural receptive "elds. Here, we used the transform CIELAB [19], de"ned by
¸"
116
> > !16 if '0.008856 > > L L
> 903 > L
(2)
> if )0.008856 > L
,
(3)
,
(4)
a"500 f
b"200 f
X > !f X > L L > Z !f > Z L L
1182
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
Fig. 2. A schematic drawing showing the main features of the IOF model.
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
where
f (t)"
t
if t'0.008856, 16 7.787t# if t)0.008856. 116
(5)
2.2. 2D bank of logGabors design The "lterset used in the decomposition of the luminance/color opponent components consists of logGabor "lters of di!erent spatial frequencies and orientations [20]. LogGabor functions, by de"nition, have not DC component. The transfer function of the logGabor has extended tail at the high-frequency end. The logGabor, having extended tails, should be able to encode natural images more e$ciently than ordinary Gabors functions, which would over-represent the low-frequency components and under represent the high-frequency components in any encoding process. Another argument in supporting of the logGabor functions is the consistency with measurements on mammalian visual system which indicate we have cell responses that are symmetric on the log frequency scale [21]. Each one of the logGabor "lters can be represented as a Gaussian in the spatial frequency domain around some central frequency (r , h ). This "lter has an interesting M M property: they are polar separable. In this way, logGabors "lters could be written as spatial frequency (radial) and orientation (angular) dependent terms.
(log(r/r )) M G(r , h )"G(r )G(h )"exp ! M M M M 2(log(p /r )) P M (h!h ) M , ;exp ! (6) 2p F where h is the orientation angle of the "lter, r is the M M central radial frequency, p and p are the angular and F P radial sigma of the Gaussian, respectively. The convolution of a logGabor function (whose real and imaginary parts are in quadrature) with a real image results in a complex image. Its norm is called energy and its argument is called phase. The local energy of an image analyzed by a logGabor "lter (hereafter, "ltered response) can be expressed as [22]:
E(x, y)"(O (x, y)#O (x, y), (7) CTCL MBB where O (x, y) is the image convolved with the evenCTCL symmetric logGabor "lter and O (x, y) is the image MBB convolved with the odd-symmetric logGabor "lter at point (x, y). The bank of "lters should be designed so that it tiles the frequency plane uniformly (the transfer function must be a perfect bandpass function). The length-to-width ratio of the 2D wavelets controls their directional selectivity. This ratio can be varied in conjunction with the
1183
number of "lter orientations used in order to achieve a coverage of the 2D spectrum. Furthermore, the degree of blurring introduced by the "lters increases with their orientation selectivity, and so the orientation selectivity of the "lters must be carefully chosen to minimize the blurring. Hence we consider a "lter bank with the next features: (i)
The spatial frequency plane is divided into 6 di!erent orientations. (ii) The radial axis is divided into 5 equal octave bands, i.e., in a band of width 1 octave, spatial frequency increases with a factor 2. The highest "lter (for each direction) is positioned near the Nyquist frequency in order to avoid ringing and noise. The wavelength of the "ve "lters in each direction is set at 3, 6, 12, 24, and 48 pixels, respectively. (iii) The radial bandwidth, p is chosen as 2 octaves. P (iv) The angular bandwidth, p is chosen as 253. F Under our construction, six di!erent angles under every resolution are chosen and "ve di!erent resolutions are used. The resultant "lter bank can be named as +G , . G G Given an image, let +E , be the corresponding local G G energy maps, where E denotes the local energy of the G image analyzed by the log Gabor "lter G , for each G i"1,2, 30 (see Eq. (7)). 2.3. Activated xlters from the bank Let +E*, , +E?, , and +E@, be the local energy G G G G G G maps of the target image on the luminance/colors opponency channels (¸, a, b). For each of the three channels, the signi"cance of the "ltered responses is analyzed by classifying "lters into two classes: the activated "lters and the non-activated ones. Firstly, let us consider the selection of activated "lters on the luminance opponency channel ¸(x, y) of the target image. The same process will be applied on the colors opponency channels. Each "lter G should be described by a vector of G features that can successfully characterize it. Here a feature vector for G , with i"1,2, 30, is de"ned as follows: G
m p G G , = "( G , G )" G max +m , max +p , H2 H H2 H
(8) with 1 , , m" E*(x, y), G N G V W
p" G
1 , , (E*(x, y)!m ), G G N!1 V W
(9)
(10)
1184
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
with N;N being the image size; and where E*(x, y) is the G local energy map of the luminance opponency channel analyzed by the logGabor "lter G . G Of course other features intended to capture relevant characteristics of the "ltered response are conceivable. To evaluate these frequency domain features according to their ability to discriminate activated "lters, a method of successive selection and deletion based on Wilks criterion may be used [23]. Finally, we have found that the vector = , as given in Eq. (8), provides an e!ective feature G for discriminating the set of "lters on training sets. A dissimilarity measure for these vectors is de"ned as dist(= , = )"max +" G! H",. G H J J J
(11)
Cluster analysis is then used to group "lters together, since unsupervised learning may exploit the statistical regularities of the "ltered responses. Once the activated "lters have been obtained on the luminance channel, the same selection process is applied on the two colors opponency channels of the target image (see Fig. 2). Let Active*, Active?, and Active@ be the activated "lters on the luminance, red}green, and yellow}blue opponency channels of the target image. To illustrate the selection of activated "lters, Fig. 5 shows the three opponency channels for a target image and the respective activated "lters from the LogGabor bank. Also, in Fig. 6, the activated "lters on the three channels are given for six target images. 2.4. Fixation points on each xlter response The selection of the locations to examine involves a rapid assimilation of information from the entire target image and allocation of attention to interesting locations. The IOF error is computed at locations that are likely to attract human "xation because they are seen as characteristic features. The locations chosen to compute the error metric correspond to points of maximal energy of the most activated "lters for each channel. Developing further the concept of specialized detectors for both mayor types of image features, lines and edges [22,5], proposed a local-energy model of feature detection. This model postulates that features are perceived at points in an image where the Fourier components are maximally in phase, and successfully explains a number of psychophysical e!ects in human feature perception [5]. It is interesting to note that this model predicts the conditions under which Mach bands appear, and the contrast necessary to see them. To detect the points of phase congruency an energy function is de"ned. The energy of an image may be extracted by using the standard method of squaring the outputs of two "lters that are in quadrature phase [24] (903 out of phase). Features, both lines and edges, are
then signaled by peaks in local energy functions. In fact, energy is locally maximum where the harmonic components of the stimulus come into phase * see Ref. [22] for proof. The local-energy model is implemented here as suggested in Ref. [8]. Given the luminance opponency channel ¸(x, y) for a target image, the local energy map E* for G the activated "lter G , as given in Eq. (7), provides us G with a representation in the space spanned by the two functions, O (x, y) and O (x, y), where O (x, y) CTCL MBB CTCL is the image convolved with the even-symmetric logGabor "lter and O (x, y) is the image convolved with MBB the odd-symmetric logGabor "lter at point (x, y). Hence, the detection of local energy peaks on E* acts as a G detector of signi"cant features from the viewpoint of the "lter G . G The same process is also applied on each activated "lter of the red}green and yellow}blue opponency channels, a(x, y) and b(x, y) (see Fig. 2). To illustrate the results of the selection of "xation points, Fig. 7 shows, for a target image, the subset of activated "lters on the opponency channels, the corresponding energy maps and the "xation points obtained as described above.
3. Integration stage In the IOF model, for each activated "lter, the visual distinctness between the target and the non-target "ltered responses is measured as the distance between their integral features computed over local energy peaks of the target response. An integral feature [6] refers to the result of recombining the separable representations (i.e., response energy, phase, contrast, entropy and standard deviation) at stimulus locations. Since one image has target and the other does not, the visual distinctness results solely from the target and will be measured by integration over the attentional points. 3.1. Integral opponent-colors features Given an opponency channel, for each activated "lter G the target scene is represented at each "xation point by G "ve separable features: (i) the phase value at the "xation point, O (x, y) , ¹G (x, y)"arctan CTCL O (x, y) MBB
(12)
where O (x, y) is the opponency channel of the CTCL target image convolved with the even-symmetric logGabor "lter of G and O (x, y) is the channel of G MBB the target image convolved with the odd-symmetric logGabor "lter of G at point (x, y); G
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
(ii) a normalized measure of local energy at "xation point, ¹G (x, y), as given by E (x, y) G ¹G (x, y)" (13) + H E (x, y) H% ZARGTC, H where E (x, y) denotes the local energy at (x, y) for G "lter G , and Active is the set of activated "lters of G the opponency channel (the normalized local energy so de"ned incorporates lateral interactions among activated "lters to account for between-"lter masking); (iii) the local standard deviation of the normalized local energy at "xation point, ¹G (x, y), de"ned as ¹G (x, y)"
1 Card[=(x, y)]
(¹G (p, q)!k), NOZ5VW
(14)
D[¹G(x, y), NG(x, y)] de"ning a normalized distance measure in the integral features ¹G(x, y)"(¹G(x, y)) ) , J WJ and NG(x, y)"(NG(x, y)) as given by the equation: J WJW D[¹G(x, y), NG(x, y)]"
d(¹G(x, y), NG(x, y)) J J , max H +d(¹H(p, q), NH(p, q));(p, q)3FP( j), H% ZARGTC J J J (16) where FP( j) denotes the "xation points for the activated "lter G , and Active denotes the set of activated "lters of H the opponency channel under analysis; and where sin(¹G (x, y)!NG (x, y)) d(¹G (x, y), NG (x, y))"arctan , cos(¹G (x, y)!NG (x, y)) (17)
where
and where, for each l"2, 3, 4, 5, we have
1 k" Card[=(x,y)]
d(¹G(x, y), NG(x, y))"(¹G(x, y)!NG(x, y)). J J J J
¹G (p, q), NOZ5VW and ¹G (p, q) as given in Eq. (13). The neighborhood =(x, y) is de"ned as the set of pixels contained in a disk of radius r centered at (x, y), with the radius disk r being r"d[(xK, yK);(x, y)], where (xK, yK) is the nearest local minimum to (x, y) on the energy map E , and with d[ ) , ) ] being the Euclidean disG tance. Since the nearest local minimum to (x, y) on the local energy map marks the beginning of another potential structure, our selection for the neighborhood =(x, y) avoids interference with such a structure while calculating the local variation. (iv) the local contrast of the normalized local energy at "xation point (x, y), ¹G (x, y), de"ned as ¹G (x, y) ¹G (x, y)" , k
(15)
where 1 k" Card[=(x, y)]
¹G (p, q), NOZ5VW (v) the local entropy of the normalized local energy within =(x, y), ¹G (x, y). For each activated "lter G on an opponency channel, G the "ltered response of the non-target image is also represented by using these separable features. Let NG(x, y), J with l"1,2, 5, be the respective "ve features (phase, local energy, standard deviation, contrast, and entropy) computed on the non-target "ltered response at point (x, y). Once we have de"ned the separable features for measuring visual distinctness, we need to specify how the di!erences in each separable feature are pooled into an overall di!erence at "xation points. We take
1185
(18)
Therefore, the di!erences from distinct separable representations are assumed to pool nonlinearly. It implies that the overall discriminability is most heavily contributed to by the most discriminating features but that the less discriminating features contribute somewhat. 3.2. Target distinctness on each activated xlter Given a luminance/colors opponency channel, the magnitude of the target distinctness for each channel activated "lter is computed by using Quick pooling (see Fig. 2). It is the most common model of integration over spatial extent, and is essentially the square root of the sum of the squares except that the exponent is not restricted to the value of 2. The Quick pooling can be viewed as a metric in a multidimensional space, and it is sometimes known as Minkowski metric. [25] shows that Minkowski metrics can be used as a combination rule for small impairments like those usually encountered in digitally coded images. In fact, Minkowski metrics have already been employed in many "elds of human perception research [26,27]. For each activated "lter G , the visual distinctness G between the target and the non-target "ltered responses is measured by the nonlinear pooling of the di!erences in their integral features over "xation points: 1 ¹D(i)" Card[FP(i)]
"D[¹G(x, y), NG(x, y)]"@ VWZ$.G
@ , (19)
where FP(i) denotes the set of "xation points on the target "ltered response, and Card[ ) ] is the number of points. The default value of the exponent b in IOF error is 3 [28] discussed at some length several interpretations
1186
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
of the Quick pooling formula and the selection of the pooling exponent. Recalling that Active*, Active?, and Active@ are the activated "lters on the luminance, red}green, and yellow}blue opponency channels of the target image, we have that, for each "lter G in Active*, ¹D*(i) denotes the G target distinctness between the target and non-target "ltered responses, as given in Eq. (19). In a similar way, ¹D?(i) and ¹D@(i) denote the target distinctness for activated "lters in Active?, and Active@, respectively.
4. Decision stage There are number of assumptions of the IOF model so far described which are in agreement with spatial-frequency channels models quite successful for the detection of visual patterns [28]:
yellow}blue channel, respectively: ¹D*" max +¹D*( j),, H%HZARGTC*
(21)
¹D?" max +¹D?( j),, H%HZARGTC?
(22)
¹D@" max +¹D@( j),, H%HZARGTC@
(23)
with ¹D*( j), ¹D?( j), and ¹D@( j) being computed as given in Eq. (19); and Active*, Active?, and Active@ denoting the activated "lter sets on the luminance, red}green, and yellow}blue opponency channels of the target image, respectively.
5. Applications (i) Scenes get coded into light}dark, red}green, and yellow}blue channels. (ii) Spatial information on each channel is analyzed by multiple "lters, each of which is sensitive to patterns whose spatial frequencies are in a restricted range. (iii) The IOF model bases its responses on only those "lters sensitive to structures in the target scene. (iv) The error measures are not simply computed globally over the entire image support, but semi-locally at local energy peaks of the "ltered responses. (v) The output of an individual "lter can be represented as a single number by the Quick pooling of the di!erences in the target scene's and non-target scene's integral features over "xation points of the target "ltered response. In the IOF model, a further assumption produces the "nal decision stage: the system output is based on the maximum of the target distinctness over activated "lters of the luminance/colors opponency channels. With the maximum-output rule, the "lter outputs affect the decision variable (the maximum) only on those trials on which one of them happens to produce the maximum output. Hence, this decision rule makes the IOF model not much susceptible to both the e!ects of extra noisy "lters and extra signal carrying "lters. [28] shows that a system which is attempting to achieve the best performance possible might well use the maximum output rule in those experiments. The maximum-output rule produces the system output as given in the following: IOFerror"max+¹D*, ¹D?, ¹D@,,
(20)
where ¹D*, ¹D?, and ¹D@ are the maximum-output over activated "lters of the luminance, red}green, and
5.1. Distinctness of targets and their immediate surrounds In this "rst application the relative distinctness of targets and their immediate surrounds was computed using di!erent target distinctness measures. The digital color images were 12 images, as shown in Fig. 3: (i) six complex natural scenes containing a single target, and (ii) six images of the same rural backgrounds with no target (the empty image is everywhere equal to the target image, except at the location of the target, where the target has been replaced with the local background). Military vehicles of di!erent degrees of visibility serve as search targets. All the images are 256;256 pixels in size. The data set can be used to validate digital metrics that compute the visual distinctness of targets in complex scenes [18]. The reference rank order was a psycophysical distinctness measure developed at the TNO Human Factors Research Institute [29], which is experimentally de"ned as the minimal distance between target and eye-"xation at which the target is no longer distinguishable from its surroundings. This measure has been shown to be independent of viewing distance, consistent among observers, and meaningful in the sense that it correlates with search time. A subject can use di!erent criteria to determine whether the target is visible or not. The criterium used here is whether the spatial structure at the target location really originates from the target (it can be discriminated as being the target). This criterium yields a visual lobe for the identi"cation of the target. A total of 64 human observers participated in the experiment [18]. Table 1 lists, on column 2, the subjective ranking of the visual lobes for identi"cation. The identi"cation lobes range between 0.14 and 1.113.
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
1187
Fig. 3. The six image pairs used in the "rst experiment. Each pair shows a scene with the target and the corresponding scene without the target.
This experiment was performed to investigate: (i) The relation between the IOF error and the visual target distinctness measured by human observers. The visual distinctness of the target in each scene is computed with the IOF error by calculating the di!erence between the target scene and the nontarget scene. (ii) The comparative results of the IOF error and the Root Mean Square Error RMSE. The quantitative measure RMSE was de"ned as follows [19]: RMSE(¹argetImage, EmptyImage)
"
1 , (*s(x, y)), N VW
(24)
where (*s(x, y))"(*¸(x, y))#(*a(x, y))#(*b(x, y)) denotes the CIE 1976 L*a*b* color-di!erence formula applied on the luminance/colors opponency attributes of the target and non-target scenes at location (x, y), and with N;N being the image size. (iii) How much better (if at all) the color analysis is compared to a grayscale analysis. To this aim, it was performed a comparison of the rank order induced by the IOF error to those induced by the computational measures d , [9], and d , [7], where a gray ' scale analysis was used. Both d and d were shown ' [1,10]: (1) to predict human observer performance in search and detection tasks on complex natural
1188
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
Table 1 Experimental comparison of the six image pairs in Fig. 3. Column 1: Image pairs A}F. Column 2: Subjective ranking and visual lobes (see text for details about the meaning of visual lobe). Columns 3, 4, 5 and 6: Ranking by the RMSE, d , d , and the proposed IOF error ' Distinctness of targets and their immediate surrounds Image pair
LOBE
Subjective ranking
d
RMSE Value
Rank
d '
Value
A
1.11
1
1.05
2
8.95
B
1.07
2
0.91
3
12.99
C
0.71
3
1.39
1
31.24
D
0.25
4
0.55
5
3.12
E
0.25
5
0.79
4
1.09
F
0.14
6
0.41
1.51
imagery, and (2) to correlate with visual target distinctness estimated by human observers. The target distinctness in each scene is computed with the d and d by calculating the di!erence between the grayscale ' target image and the grayscale non-target scene. The comparative results of the IOF error and those of both quantitative and qualitative measures are presented in Table 1. The target distinctness values and the resulting rank order computed by the RMSE are listed on column 3. The RMSE performs poorly, which is to be expected. Most rank orders computed by this metric are signi"cantly out of order relative to the reference order induced by the psychophysical distinctness measure on column 2. On column 4, Table 1 show that the d introduced in Ref. [9], which performs a grayscale analysis, induces a rank order with a signi"cant order reversal: one target (from C) is ordered incorrectly. On the contrary, two targets (from B and D) are ordered correctly. The other targets have been attributed rank orders which do not di!er signi"cantly from the reference rank order based on the psychophysical measure. These results show that d may correlate with visual target distinctness by human observers but it is sensitive to variations in the size of the target. The target distinctness metric and the resulting rank order computed by the d proposed in Ref. [7], which ' also performs a grayscale analysis, are listed in the column with the header d in Table 1. This model produces ' a ranking which contains three insigni"cant order reversals, since they are in a single cluster of target distinctness (see the corresponding visual lobes). It correctly ranks targets from A, B and C as the three most distinct targets. Summarizing, for the data set in this study, the d appears '
Rank 3
1
Value
IOFerror Rank
Value
Rank
4.98
0.023
3.30
0.014
3
2.69
0.014
2
1.86
5
0.012
6
1.71
6
0.011
5
1.97
4
0.008
to compute a visual target distinctness rank ordering that correlates with human observer performance. The target distinctness metric and the resulting rank order computed by the IOF model, which performs a color analysis, are listed in the column with the header IOFerror in Table 1. It correctly ranks targets from D, E and F as the three least distinct targets. The IOF error correctly ranks the target from A, which represents the most visible target. It permutes the rank order of targets from B and C. Since the permuted image pair is in a single cluster of target distinctness, this permutation is not signi"cant, and therefore the IOF model shows the best overall performance of all models and metrics tested in this experiment. To illustrate the robustness of the IOF model against selection of the pooling exponent b in Eq. (19), the IOF error was computed using b values in the range [3,4.5]. The results are shown in Table 3. Therefore, di!erent selections of the pooling exponent (in [3,4.5]) will lead to the same rankings as the default value b"3. 5.2. Distinctness of targets in noisy environments This second experiment was performed to investigate the relation, in the presence of noise, between the IOF error and the psychophysical target distinctness. To this aim, three corrupted image pairs were used. The three pairs, G, H, and I, as shown in Fig. 4, were generated by adding Gaussian noise to the original image pairs at values of the variance of the Gaussian set to 0, 120, 240, respectively. The three image pairs were shown to 64 human subjects who ranked them based on the visual target distinctness. Table 2 lists, on column 2, the subjective ranking of the visual lobes for identi"cation. The identi"cation lobes
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
1189
Fig. 4. The three image pairs used in the second experiment corrupted by adding Gaussian noise. Each pair shows a scene with target and the corresponding scene without target.
1190
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
Table 2 Experimental comparison of the three image pairs in Fig. 4. Column 1: Image pairs G}I. Column 2: Subjective ranking and visual lobes (see text for details about the meaning of visual lobe). Columns 3, and 4: Ranking by the RMSE, and the proposed IOF error Distinctness of targets in noisy environments Image pair LOBE Subjective RMSE IOFerror ranking Value Rank Value Rank G
2.65
1
6.46
H
0.96
2
18.04
I
0.09
3
20.02
3
0.0103 0.0090
1
0.0071
range between 0.09 and 2.653 (Table 2). Although noise was added to the original images, the visual target distinctness did not change signi"cantly: target from image G still represents the most visible target, whereas target from image I represents the least visible target. This same table illustrates that the subjective ranking still correlates well with the IOF error, but for the RMSE, it does not. The RMSE correctly ranks target from image H, but it induces a permutation of the rank order of targets from G and I. This permutation is signi"cant because these images belong to di!erent clusters. On the contrary, the IOF model induces a visual target distinctness rank ordering identical to the one resulting from human observer performance.
The conclusion that can be drawn from this second experiment, is that in the presence of noise the IOF model induces a target distinctness ranking that agrees with human visual perception. Again to illustrate the robustness of the IOF error against selection of the pooling exponent b, the experiment was repeated using b values in the range [3.5,5] (see Table 4 for further details).
6. Conclusions The main conclusion drawn in this pilot study is that the Integral Opponent-Colors Features model induces a visual target distinctness rank ordering that agrees with human visual perception for a set of digital color images of complex natural scenes. Hence, a model to predict human performance in target detection may incorporate the following basic characteristics: (i) Scenes are coded into luminance/colors opponency channels; (ii) Spatial information on each opponency channel is analyzed by multiple "lters, each of which is sensitive to patterns whose spatial frequencies are within a restricted range; (iii) The model bases its responses on only those "lters sensitive to structures in the target scene; (iv) The output of an individual "lter can be represented as a single number by nonlinear pooling of the di!erences between target scene's and non-target scene's integral features over "xation points that are likely to attract human attention; (v) the target distinctness can be based on the maximum of the outputs over activated "lters on each luminance/colors opponency channel. Further research is planned to analyze whether a computational model that correlates with human visual
Table 3 Robustness of the IOF error against selection of the pooling exponent b in the "rst experiment Distinctness of targets and their immediate surrounds Image pair
Subjective ranking
IOFerror b"3.0 Value
b"3.5 Rank
Value
b"4.0 Rank
0.019
Value
b"4.5 Rank
0.017
Value
Rank
A
1
0.023
0.0159
B
2
0.014
3
0.012
3
0.011
3
0.0094
3
C
3
0.014
2
0.012
2
0.011
2
0.0093
2
D
4
0.012
0.011
0.010
0.0092
E
5
0.011
0.009
0.008
0.0079
F
6
0.008
0.006
0.006
0.0055
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
1191
Table 4 Robustness of the IOF error against selection of the pooling exponent b in the second experiment Distinctness of targets in noisy environments Image pair
Subjective ranking
IOFerror b"5.0 Value
b"3.5 Rank
Value
b"4.0 Rank
Value
b"4.5 Rank
Value
G
1
0.0103
0.0149
0.0127
0.0113
H
2
0.0090
0.0146
0.0119
0.0102
I
3
0.0071
0.0118
0.0095
0.0081
search performance should utilize "ltered images [9,10], integral features [7] or visual patterns in the image representation. In this new study we will investigate the relation between the visual target distinctness in complex natural scenes measured by human observers and three computational visual distinctness metrics which are, respectively, based on (1) "ltered representations, (2) integral-feature representations, or (3) visual patterns in the complex scene. The computational target distinctness model based on visual patterns will be an error metric applied to the images after transformation to a new perceptual domain in which the images are organized in accord with a constraint of invariance in integral features across frequency bands [30]. In this perceptual domain a visual pattern in the scene is represented by the "ring of a subset of activated "lters. A constraint of invariance across frequency bands binds together, in a mutually coherent way, all the Gabor-like "lters actively responding to di!erent aspects of a perceived pattern. The resultant target distinctness model will have perceptual access to visual patterns in the scene but not to "ltered images.
7. Summary This paper presents a new computational model for quantifying the visual distinctness of a target in its surround. Visual distinctness metrics can be used to compare and rank target detectability, and to quantify background or scene complexity. The subjective ranking induced by the psychophysically determined visual target distinctness for 64 human observers is adopted as the reference order. The ranking produced by the proposed measure and several computational models are compared with the subjective rank order. The computational target distinctness model (it has been termed `Integral Opponent-colors Featurea error,
Rank
or IOF error) extends our early attempts to produce a perceptual measure for perceiving image discriminability. To this aim, in addition to the analysis of color attribute, the new model incorporates the main points analyzed in our previous works: (i) a preattentive stage incorporates the processing described in Refs. [4,9,10] in order to perform the decomposition of the color space and the "xation point detection; (ii) an integration stage is deployed to analyze and integrate the separable representations at "xation points, as suggested in Ref. [7]. The IOF error predicts the target distinctness by the di!erence between the target-and-background scene and the background-with-no-target scene (one image has the target and the other does not). The IOF model may be described by means of three di!erent stages: (i) An early preattentive stage where the RGB values are transformed into coordinates values in an opponency space. The psychophysical basis is that somewhere between the eye and the brain, signal from the cone receptors in the eye get coded into light}dark, red}green, and yellow}blue signals [16]. Also in this "rst stage, each opponency channel is analyzed by a bank of "lters created by log Gabor functions. Spatial information in the system is analyzed by multiple units, each of which is sensitive to patterns whose spatial frequencies are in a restricted range. But only responses of the activated analyzers from the bank will be used to measure visual target distinctness. This can be implemented as a process of lateral inhibition among "lters. For each activated "lter, stimulus locations whereupon the focus of attention should be shifted, or `"xation pointsa, are computed as local energy peaks on the "ltered response. (ii) An integration stage which is deployed to analyze and integrate the separable representations at "xation points. For each luminance/colors opponency
1192
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
Fig. 5. The three opponent-colors channels for a target image and the respective activated "lters from the Log Gabors bank.
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
Fig. 6. The activated "lters on the three opponency channels for the six target images in the "rst experiment.
1193
1194
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
channel, the magnitude of the target distinctness on each activated "lter is computed by the Quick pooling of the di!erence between the target scene's and nontarget scene's integral features over "xation points. An integral feature refers to the result of recombining the separable representations (response
energy, phase, contrast, entropy and standard deviation) at "xation points. Since one image has target and the other does not, the visual distinctness results solely from the target and will be measured by integration over "xation points. The Quick pooling can be viewed as a metric in a multidimensional
Fig. 7. (a}c) For a target image, the activated "lters on the opponency channels, and the respective local energy maps and "xation points.
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
1195
Fig. 7. (Continued.)
space, and it is sometimes known as Minkowski metric [25] shows that Minkowski metrics can be used as a combination rule for small impairments like those usually encountered in digitally coded images. In fact, Minkowski metrics have already been employed in many "elds of human perception research [26,27].
(iii) A decision stage which produces the system output based on the maximum of the target distinctness over activated "lters on each luminance/colors opponency channel. With the maximum-output rule, the "lter outputs a!ect the decision variable (the maximum) only on those trials on which one of them happens to produce the maximum output. Hence,
1196
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
Fig. 7. (Continued.)
this decision rule makes the IOF model not much susceptible to both the e!ects of extra noisy "lters and extra signal carrying "lters. [28] shows that a system who is attempting to achieve the best performance possible might well use the maximum output rule.
The main conclusion drawn in this pilot study is that the Integral Opponent-colors Features model induces a visual target distinctness rank ordering that agrees with human visual perception for a set of digital color images of complex natural scenes. Hence, a model to predict human performance in target detection may incorporate
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
the following basic characteristics: (i) Scenes are coded into luminance/colors opponency channels; (ii) Spatial information on each opponency channel is analyzed by multiple "lters, each of which is sensitive to patterns whose spatial frequencies are in a restricted range; (iii) The model bases its responses on only those "lters sensitive to structures in the target scene; (iv) The output of an individual "lter can be represented as a single number by nonlinear pooling of the di!erences between the target scene's and non-target scene's integral features over local energy peaks on the "ltered response; (v) the target distinctness is based on the maximum of the outputs over activated "lters on each luminance/colors opponency channel. Acknowledgements This paper was prepared while Xose R. Fdez-Vidal was on leave from The Department of Applied Physics at Santiago University, visiting the Department of Computer Science and A.I. at Granada University. The authors would like to thank Lex Toet for providing the images in Figs. 3 and 4 as well as the subjective rankings. Thanks are also due to an anonymous referee for many helpful comments. This research was sponsored by the Spanish Board for Science and Technology (CICYT) under grant TIC97-1150. References [1] A. Toet, Computing visual target distinctness, TNO-Report TM-97-A039, TNO Human Factors Research Institute, 1997, 74 pp. [2] W.R. Uttal, On Seeing Forms, Lawrence Erlbaum Associates Publishers, Hillsdale, NJ, 1988. [3] B.A. Wandell, Foundations of Vision, Sinauer Associates Inc. Publishers, Sunderland, MA, 1995. [4] X.R. Fdez-Vidal, J.A. Garcia, J. Fdez-Valdivia, Using models of feature perception in distortion measure guidance, Pattern Recognition Lett. 19 (1998) 77}88. [5] M.C. Morrone, D.C. Burr, Feature detection in human vision: a phase-dependent energy model, Proc. R. Soc. Lond. B 235 (1988) 221}245. [6] A.M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychol. 12 (1980) 97}136. [7] X.R. Fdez-Vidal, J.A. Garcia, J. Fdez-Valdivia, R. Rodriguez-Sanchez, The role of integral features for perceiving image discriminability, Pattern Recognition Lett. 18 (1997) 733}740. [8] J. Fdez-Valdivia, J.A. Garcia, J. Martinez-Baena, X.R. FdezVidal, The selection of natural scales in 2D images using adaptive Gabor "ltering, IEEE Trans. Pattern Anal. Mach. Intell. 20 (5) (1998) 458}469. [9] J. Martinez-Baena, J. Fdez-Valdivia, J.A. Garcia, X.R. FdezVidal, A new distortion measure based on a data-driven multisensor organization, Pattern Recognition 31 (8) (1998) 1099}1116.
1197
[10] J. Martinez-Baena, A. Toet, X.R. Fdez-Vidal, A. Garrido, R. Rodriguez-Sanchez, Computational visual distinctness metric, Optical Eng. 37 (7) (1998) 1995}2005. [11] A.J. Ahumada, A.M. Rohaly, A.B. Watson, Image discrimination models predict object detection in natural backgrounds, Suppl. Invest. Ophtalmol. Visual Sci. 36 (4) (1995) S439. [12] T.J. Doll, S.W. McWhorther, D.E. Schmieder, Target and background characterization based on a simulation of human pattern perception, in: Proceedings SPIE Conference on Characterization, Propagation, and Simulation of Sources and Backgrounds III, Vol. 1967, 1993, pp. 432}454. [13] G. Gerhart, T. Meitzler, E. Sohn, G. Witus, G. Lindquist, J.R. Freeling, Early vision model for target detection, in: Proceedings SPIE Conference on Infrared Imaging Systems: Design, Analysis, Modeling, and Testing VI, Vol. 2470, 1995, pp. 12}23. [14] R. Hecker, Camaeleon}Camou#age assessment by evaluation of local energy, spatial frequency and orientation, in: Proceedings SPIE Conference on Characterization, Propagation, and Simulation of Sources and Backgrounds II, Vol. 1687, 1992, pp. 342}349. [15] G. Witus, M. Cohen, T. Cook, M. Elliot, J.R. Freeling, P. Gottschalk, G. Lindquist, TARDEC Visual Model Version 2.1.1 Analyst's Manual, Report OMI-552. OptiMetrics Inc, Ann Arbor, MI: 1995. [16] E. Hering, in: L.M. Hurvich, D. Jamenson (Translators), Outlines of a Theory of the Light Sens, Harvard University Press, Cambridge, MA, 1964. [17] G. Sharma, H.J. Trussell, Digital color imaging, IEEE Trans, Image Processing 6 (7) (1997) 901}932. [18] A. Toet, P. Bijl, F.L. Kooi, J.M. Valeton, Image data set for testing search and detection models, TNO-Report TM-97-A036, TNO Human Factors Research Institute, 1997, 35 pp. [19] G. Wyszecki, W.S. Stiles, Color Science: Concepts and Methods, Quantitative Data and Formulae, 2nd edition, Wiley, New York, 1982. [20] D.J. Field, Relations between the statistics of natural images and the response properties of cortical cells, J. Optical Soc. of Am. A 4 (12) (1987) 2379}2394. [21] P. Kovesi, Image features from phase congruency, Technical Report 95/4, Department of Computer Science, The University of Western Australia, 1995. [22] M.C. Morrone, R.A. Owens, Feature detection from local energy, Pattern Recognition Lett. 6 (1987) 303}313. [23] C.R. Rao, Linear Statistical Inference and Its Applications, Wiley, New York, 1973. [24] D. Gabor, Theory of communication, J, Inst. Electr. Eng. 93 (1946) 429}457. [25] H. Ridder, Minkowski-metrics as a combination rule for digital image coding impairments, SPIE Vol. 1666 Human Vision, Visual Processing, and Digital Display III, 1992, pp. 16}26. [26] P.E. Green, F.J. Carmone, S.M. Smith, Multidimensional Scaling: Concepts and Applications, Allyn and Bacon, Boston, 1989. [27] L.A. Olzak, J.P. Thomas, Seeing spatial patterns, in: Handbook of Perception and Human Performance, Vol. 1, Wiley, New York, 1986.
1198
X.R. Fdez-Vidal et al. / Pattern Recognition 33 (2000) 1179}1198
[28] N. Graham, Visual Pattern Analyzers, Oxford Psychology Series, No. 16, Oxford University Press, Oxford, 1989. [29] A. Toet, F.L. Kooi, P. Bijl, J.M. Valeton, Visual conspicuity determines human target acquisition performance, Optical Eng. 37 (7) (1998) 1969}1975.
[30] R. Rodriguez-Sanchez, J.A. Garcia, J. Fdez-Valdivia, X.R. Fdez-Vidal, The RGFF Representational Model: a scheme that learns functions for extracting visible patterns, DECSAI Technical Report 98-03-10, Department of Computer Science and Arti"cial Intelligence, University of Granada, Spain, 1998.
About the Author*XOSE R. FDEZ-VIDAL was born in Lugo, Spain. He received the M.S. and Ph.D. degrees both in Physics from the University of Santiago de Compostela in 1991 and 1996, respectively. Since 1992 he has been with the Applied Physics Department at Santiago de Compostela University where he is now an Assistant Professor. His current interests include image proccesing, multiresolution methods, wavelets, and distortion measures. Dr. Fdez-Vidal is a member of the IAPR Association. About the Author*ROSA RODRIGUEZ-SAD NCHEZ was born in Granada, Spain. She received the M.S. and Ph.D. degrees both in Computer Science from the University of Granada in 1996 and 1999, respectively. Currently she is with the Informatics Department at the University of Jaen where she is now an Assistant Professor. Her current interest includes image processing and visual perception. About the Author*J.A. GARCID A was born in Almeria, Spain. He received the M.S. and Ph.D. degrees both in Mathematics from the University of Granada in 1987 and 1992, respectively. Since 1988 he has been with the Computer Science Department (DECSAI) at Granada University where he is now a Lecturer. His current interest includes computer vision and visual perception. Dr. GarcmH a is a member of the IAPR Association and the IEEE Computer Society. About the Author*J. FDEZ-VALDIVIA was born in Granada, Spain. He received the M.S. and Ph.D. degrees both in Mathematics from the University of Granada in 1986 and 1991, respectively. Since 1988 he has been with the Computer Science Department (DECSAI) at Granada University where he is now a Lecturer. His current interest includes image processing, visual perception and biomedical applications. Dr. Fdez-Valdivia is a member of the IAPR Association, the IEEE Computer Society, and the ACM Computer Society.
Pattern Recognition 33 (2000) 1199}1217
Learning-based constitutive parameters estimation in an image sensing system with multiple mirrors W.S. Kim , H.S. Cho* FA Research Institute, Production Engineering Center, Samsung Electronics Co. Ltd, 416, Mae-Tan-3Dong, Paldal-Gu, Suwon City, KyungKi-Do, 441-742, South Korea Department of Mechanical Engineering, Korea Advanced Institute of Science and Technology, 373-1, Kusung-dong, Yusong-gu, Taejon, 305-701, South Korea Received 5 February 1999; accepted 23 April 1999
Abstract A sensing system sometimes requires a complicated optical unit consisting of multiple mirrors, in which case it is important to estimate accurately constitutive parameters of the optical unit to enhance its sensing capability. However, the parameters include generally uncertainties since the optical unit cannot avoid the "xing and aligning errors and the manufacturing tolerance of its components. Accordingly, it should construct a projective model of the complicated sensing system accurately and build up an estimation method of tangled parameters. However, it is not easy to estimate complicated constitutive parameters from an accurate model of an optical unit with multiple mirrors, and moreover, they are sometimes changed during operation due to unexpected disturbance or intermittent adjustments such as computer control zoom, auto focus, and mirror relocation. Due to these operational circumstances, it is not easy to take apart components of the assembled system and directly measure the components. Therefore, an indirect and adaptive estimation method, taking all the components into simultaneous consideration without disassembling the sensing system, is needed for calibrating the uncertain and changeable constitutive parameters. In this paper, we propose not only a generalized projective model for an optical sensing system consisting of n-mirrors and a camera with a collecting lens, but also a learning-based process using the model to estimate recursively the uncertain constitutive parameters of the optical sensing system. We also show its feasibility through a series of calibration of an optical system. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Image sensing system; Multiple mirror; Learning algorithm; Estimation; Uncertainty
1. Introduction An optical unit with multiple mirrors is necessary in order to swing the ray of a light which is usually a laser [1,2], to guide the light path [3], to eliminate an occluded e!ect [4] and to obtain a stereo image by using a single camera [5,6]. It is often incorporated into a position sensor or a camera with a collecting lens looking o!-axis
* Corresponding author. Tel.: #42-869-3213; fax: #42-8693210. E-mail addresses:
[email protected] (W.S. Kim),
[email protected] (H.S. Cho).
for a light spot. Measurement precision of such an optical system, consisting of multiple mirrors and a camera with a collecting lens, depends on accurate estimation of positions and orientations of mirrors "xed at designed locations and a correct calibration of the camera. However, it is not easy to accurately estimate mirror positions and orientations when an optical unit has been already assembled, because its dimensional uncertainties caused by manufacturing tolerances and the distortions and aberrations of a lens cannot be avoided. This results in uncertainties in constitutive parameters associated with the position and orientation of a mirror, and the intrinsic and extrinsic parameters of a camera. The uncertainties in turn incur measurement errors when an object is
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 1 - 9
1200
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
measured by using such an optical sensing system. Hence, it is necessary to accurately calibrate the constitutive parameters of the sensing system. Generally, the constitutive parameters can be measured directly by using precise instruments one by one. They are, however, subject to change during operation by unexpected disturbance or intermittent adjustments such as computer-controlled zoom, auto-focus, and mirror relocation. This makes it di$cult to accurately measure them by using precise instruments. To this end, an indirect and adaptive estimation method, taking all the components into simultaneous consideration without disassembling a sensing system, is needed for measuring the uncertain and changeable constitutive parameters. To date, the estimation problems of sensing systems have been extensively studied [7}9]. They have mainly considered the calibration problem of a camera. The camera calibration computes the transformation from a camera space to any given world space. The calibration methods for a camera are classi"ed into six categories [9,10]: techniques involving full-scale nonlinear optimization; techniques involving computation of perspective transformation matrix and use of linear equation solving, the two-plane method, two stage method, and adaptive self-calibration. In addition, a few works have been executed for calibrating the constitutive parameters of an optical unit with mirrors [11,12]. Zhuang [11] constructed a model on a mirror center o!set, and took its sensitivity into account. Tamura [12] proposed a method for correcting errors of two-axis mirror scanner parameters by using coarse-"ne parameter search. Since they have mainly dealt with a sensing system with a single mirror or two mirrors, it is necessary to generalize its method to a sensing system with multiple mirrors. In this paper, we propose not only a generalized projective model for an optical sensing system consisting of n-mirrors and a camera with a collecting lens, but also a learning-based process using the model to estimate recursively the uncertain constitutive parameters of the optical sensing system. This paper is organized as follows; Section 2 describes the generalized projective model, Section 3 derives a learning-based recursive algorithm for correctly estimating the uncertain constitutive parameters of the optical sensing system. Finally, Section 4 shows feasibility of the algorithm through the calibration results of the optical sensing system consisting of four mirrors such as a pair of plane mirrors and a pair of conic mirrors.
an optical unit attached to the front of the camera, it is necessary to correctly estimate the system constitutive parameters, as outlined in the Introduction. The parametric estimation, as one of methods for attaining this objective, is generally used [13]. This method requires a model for representing an image sensing system. In this section, therefore, we will derive a general image transformation model for the image sensing system with an optical unit consisting of n-mirrors. Fig. 1 illustrates the geometry of the optical system consisting of n-mirrors and a camera with a collecting lens. Let us set a world frame +=, at the object and a sensor frame +S, centered at the optical center O . The 1 frame +C,, parallel to the frame +S,, is an image coordinate system centered at the point O which is the intersec! tion point of the image plane and the z-axis through the optical center. The mirror frames are sequentially denoted as +1,, +2,,2,+n,. The e!ective focal length b is de"ned as the distance from the optical center to the image plane. In addition, patterns +M, and +P, are used to estimate constitutive parameters of the optical unit and a camera. Let us assume that X and S are a point vector and G G a direction cosine de"ned relative to the mirror frame +i,, respectively. Let us also assume that the ray of a light, starting from a point X with a direction cosine S on + + an object M, is projected onto a point XS in the image +! plane of a camera with the incident direction cosine S through an optical unit consisting of n-mirrors. +L However, a lens has generally distortions such as the radial and tangential ones [14]. Accordingly, the lay of a light, starting from the point X on the object M in + a direction cosine S , is actually projected onto the + point XB in the front image plane which is di!erent +! from the ideally projected point XS in the same plane. +! This relation can be modeled as XS "XB #D , +! +! +
where the superscript u denotes the ideally projected point without considering lens distortions, the subscript C represents the image frames +C,, d represents the actually projected point under consideration of lens distortions, XB "(XB , >B , b,1) is a distorted point un+! KA KA der lens distortion, b denotes the e!ective focal length, and D is the radial lens distortion. In the above, only + the radial lens distortion is considered since the tangential distortion is found to be negligible [9]. The radial lens distortion D is calculated by [9] + D "(XB k r, >B k r, 0, 0) + KA KA
2. The image sensing system model In order to accurately execute measuring tasks by using an image sensing system consisting of a camera and
(1)
(2)
where r"((XB )#(>B ) is the radial distance from KA KA the image center to the distorted point XB and k is the +! distortion factor of a lens. In addition, the actual projected point XB is generally detected on a digitized image +!
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
1201
Fig. 1. A re#ection model of an optical system having n multiple mirrors and an optical sensor.
plane. Accordingly, the point XB can be converted in +! pixel as [9] XD "(S dR\XB #C , d\>B #C ) (3) +! V V KA V W KA W N ? d "d (4) V VN DV where XD "(XD , >D ) is the pixel point on the image KA KA +! plane which corresponds to the point XB , the super+! script f denotes a frame grabber, (C , C ) is the central V W pixel of the digitized image plane, d and d are the mean V W distances between adjacent sensing elements on a CCD cell in the X and > directions, respectively, and N and AV N are the number of sensible elements and the number DV of pixels in a scanned line, and S is the horizontal image V scale factor [9].
Hence, the projective relation between the starting point X of the lay of a light and its actually projected + point XB can be formulated by the consecutive re#ec+! tive relation [4]. Then the forward image transformation from an object to a camera can be represented as follows [15,17]; 1(S )"1(!F )1(LF )21(F )1(S ), +! L L\ + +
(5)
1R 1t 5 5X !1(D ), 1(XB )"1(AT ) 5 +! + 0 + + 1
(6)
where the left superscript S represents the description of a vector relative to the sensor frame +S,, the subscript M,0 is used to denote the position and the departure direction cosine of the ray of a light when the ray starts
1202
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
from the object M, M, C also denotes the position of the ray impinges on the image plane,
N and the position vector G>P -0% are also described G G with respect to the sensor frame +S, as follows:
1(S )"(QS , QS , QS ), + KV KW KX
1R ( , t , h )n GXWV G G G G , 1N " (9) G #1R ( , t , h )n # GXWV G G G G 1(PG> )"¹RANS(tG> ,tG>,tG>)P -0%, (10) G-0% VG WG XG G where ORG denotes the origin of the frame. In the above ) is the position vector of the mirror equation, 1(PG> G-0% +i#1, described relative to the sensor frame +S,,n , is G a local normal vector of the ith mirror relative to its frame +i,, (Qt"¹RANS(Qt , Qt , Qt ), and 1R ( , t , h ) G GV GW GX GXWV G G G are the translation vector of the mirror +i, and the rotation matrix about the ZYX-Euler angles , t h , G G G [16] relative to the sensor frame +S,, respectively. The initial and "nal direction cosines are given by
1(S )"(QS , QS , QS ), +! KVA KWA KXA 5(X )"(Ux , Uy , Uy ), + K K K 1(XB )"(QXB , >B , b, 1), +! KA KA 1(D )"(QXB k r, Q>B k r, 0, 0) + KA KA where r"((QXB )#(Q>B ), 1(!T )"1(!T )1(LT )2 KA KA + L L\ 1(T ) and the superscript t denotes the transpose of + a vector. In the above transformation, the rotation matrix 1 R, 5 the translation vector 1 t are de"ned as follows: 5
r r r 1 R" r r r 1 , t"(t , t , t ) 5 5 V W X r r r , and G>T and G>F are also the homogenous matrix [16] G G and the re#ection matrix [17], respectively. Representing these matrices with their components yields
1!2(nG>) !2nG>nG> !2nG>nG> V V W V X G>F " !2nG>nG> 1!2(nG>) !2nG>nG> , V W W W X G !2nG>nG> !2nG>nG> 1!2(nG>) V X W X X
(7)
nG> nG> nG> m G>sG 1! V sG ! W sG ! X sG V V V o V o o o G> G> G> G> nG> nG> nG> m G>sG ! V sG 1! W sG ! X sG W W W o W o o o G> G> G> G> G>T " , G nG> nG> nG> m V W X G> ! sG ! sG 1! sG sG X X X o X o o o G> G> G> G> 0
0
0
1
(8)
where N "(nG>, nG>, nG>) is a normal vector of the G> V W X (i#1)th mirror, o "S ) N is the normal componG> G G> ent of the incident direction cosine relative to the mirror +i#1,, m "PG> )N is the normal distance beG> G-0% G> tween the mirror +i, and the mirror +i#1,, and PG> (pG>, pG> , pG> ) is the mirror position of the mirror G-0% VG WG XG +i#1, relative to the mirror +i,. The normal vector
OR !1(X ) + , 1(S )" Q (11) + #OR !1(X )# Q + 1(XS )!O +! Q , 1(S )" (12) +L #1(XS )!O # +! Q where O is the virtual optical center that represents the 1 optical center O obtained equivalently when the ray 1 paths re#ected from the mirrors are stretched relative to the object M. In order to reversely estimate three-dimensional coordinates in the object space corresponding to a detected point on an image plane, the inverse transformation is required. The inverse transformation is given by
5R 5t 1 1 (+T )1(XS ), 5X " 1 + ! +! 0 1 where
(13)
d d 1 (XD !C ) V #(XD !C ) V k r KA V S KA V S V V 1(XS )" (>DKA!CW)dW#(>DKA!CW)dWk, r . +! b 1
Having obtained the projective model, the next thing is to determine the constitutive parameters contained in the model. They include the mirror position 1(PG> ), the G-0% mirror orientation ( , t , h ) or the mirror normal 1N , G G G G the intrinsic parameters; the distortion factor k , the scale factor S and the e!ective focal length b, and the extrinsic V parameters; the rotation matrix 1 R and the translation 5 vector 1 t. Most of these parameters are uncertain due to 5 not only alignment errors and manufacturing tolerances of the optical systems components, but also their unexpected or intermittent changes as described in the Introduction. Therefore, they should be accurately calibrated in order to minimize the measurement error caused by these uncertainties.
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
3. Learning algorithm for the parameter estimation 3.1. Delta learning algorithm When the ray of a light re#ected from an object is projected onto a camera through such an optical unit consisting of multiple mirrors, the location of the ray detected at the camera is not coincident to the calculated one from the sensing system model due to the uncertainties of the constitutive parameter. Therefore, it is necessary to devise a parameter modi"cation procedure for estimating the correct constitutive parameters. However, it is not easy to solve the parameter modi"cation problem in this case because the system is inherently nonlinear and its image transformation is complex. In this study, the solution of this nonlinear and complex problem is attempted by using a technique minimizing a criterion function, or energy function based on the gradient or steepest descent procedure [18]. For this purpose we use the delta learning rule [19] which is one of supervised learning ones. The basic procedure of the learning technique is quite simple. Starting from an estimated parameter vector #) initially corresponding to the designed parameter vector # , the gradient E(#) ) of the current error function E is computed. The next values of #) can be obtained by moving in the direction of the negative gradient along the multidimensional error surface. The algorithm is summarized as below: #) (k#1)"#) (k)#g E(#) (k))#a #) (k!1),
(14)
where g, a, k, and E(k) are a positive learning constant, a user-selected positive momentum constant to accelerate convergence of the learning algorithm [20], a learning step, and the gradient of the current error criterion function in the kth training step, respectively. Let us assume that the estimated parameter vector #) (k) of an optical system consists of the p constitutive parameters at the learning step k. Then, it is de"ned as #) (k)"(hK ,hK ,2,hK ) NI
(15)
where the superscript t and the subscript k denote the transpose of a vector and the learning step, respectively. When the ray of a light starting from a feature point X (k) M on a calibration pattern is projected onto a position X (k) B in the image plane of a camera through an optical system, the constitutive parameter uncertainties of the optical system have direct in#uence on calculation of the position X (k). Accordingly, the true X (k) is di!erent from B B a point X) (k) calculated from the projective model deB scribed in Section 2. This di!erence needs to be minimized by an adaptive modi"cation procedure whose mathematical model is given by X) (k)"f (#) (k), X (k)), B G
(16)
1203
where #) (k) is the estimated parameter of the ith mirror and X (k) denotes a point in an object in the learning step k. Let us de"ne the error E(k) in the kth training step as the squared di!erence between the desired position X (k) B and the calculated position X) (k). The error function to B be minimized is then represented by 1 E(k)" e2(k)e(k) 2
(17)
where e(k)"X (k)!X) (k), X (k)"(x , y , z , 1)R , B B B B B B I X) (k)"(x( , y( , z( ,1) . The gradient of the error function B B B B I required in the learning algorithm (14) is therefore given as
E(k)"!(X (k)!f (#) (k), X (k)))f (#) (k), X (k)), B where
(18)
*f *f *f , ,2, *hK *hK *hK N I and its dimensions is 4;p. Finally, the complete delta learning rule results in f (#) (k), X ))"
#) (k#1)"#) (k)#g (X (k)!X) (k)) B B
*f , *hK
*f *f ,2, #a*#) (k!1), (19) *hK *hK I N where the g"diag(g , g ,2, g )R , in which symbol diag NI represents the diagonal matrix and the dimensions are as follows: #) : p;1, X !X) : 1;4, [*f/*h ] : 1;4, g : p;p, B B G and a: scalar. In the above learning algorithm, the choice of learning constants g depends strongly on the class of the learning problem and the model function. Generally speaking, for a large value, the learning speed can be drastically increased, but the learning may not be exact, with tendency to overshoot, or it may never be stabilized at any minimum. Therefore, the learning constants to be determined need consideration of the ray position sensitivity *f/*hK , G which is de"ned as hK (k) G g (k)"K , (20) G G mean(*f /*hK )#std(*f /*hK ) I G I G where K is a proportional constant and the symbols of G mean and std represent the mean value and the standard deviation, respectively. In addition, the user-selected positive momentum constant a is typically chosen between 0.1 and 0.8 [20]. 3.2. Estimation of the constitutive parameters Now, the estimation problem of the constitutive parameters can be solved by using the two stage approach:
1204
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
the "rst is to estimate the camera uncertain parameters such as the intrinsic and the extrinsic parameters and the second is to estimate the constitutive parameters of an optical unit such as the mirror positions and orientations. 3.2.1. First stage In this stage, the intrinsic and the extrinsic parameters of a camera to be solved are S , k , b,1 R(t, , h), and V 5 1 t"(t , t , t ). The estimation procedure here is divided 5 V W X into two substeps as proposed in Tsais method [9]: the "rst is for # "(t , t , t , t, , h) and the second is for V W X #) "(t , b, k ). This procedure is usually executed X through the consideration of relationship between object points in a calibration pattern and their projected points on an image plane. As shown in Fig. 1, the pattern P is located in the "eld of view. Then, the ray of a light started from a point X with a direction cosine S in a pattern P is directly N . projected onto a point XS on the image plane of a cam.! era without being projected via the optical unit with multiple mirrors. In this case, the forward image transformation can be derived similarly as that of the case of the pattern M, as described in Section 2. The transformation is written by 1(S )"1(!F )1(S ), .! . . 1R 1t 5 5X !1(D ), 1(XB )"1(!T ) 5 .! N 0 . . 1
(21)
(22)
where 1(XB )"(QXB , Q> , b, 1), the points 5X on .! NA NA . the pattern P relative to the world frame +=, is de"nes as 5X "(Ux ,Uy ,Uz ,1), and 1(D )"(QXB k r, . N N N . NA Q>B k r, 0, 0). The perspective transformation !T beNA N tween the image plane and the pattern P is de"ned as
0
(23)
As described in Section 2, the inverse transformation of Eq. (22) is given by
5R 5t 1 1(NT )1(XS ), 5X " 1 ! .! N 0 1 where
(24)
1 (XD !C ) dV #(XD !C ) dV k r .! V S .! V S V V D D 1(XS )" (>.!!CW)dW#(>.!!CW)dWkr . .! b 1
1(X) B )"f (1(X) B ), 5X , #) (k)), .A .! .
(25)
where X) B "X) B , Y) B , b, 1), 5X "(5x , 5y , .! .! .! . . . 5z ) and #) (k)"(tK , tK , SK , tK , K , hK ) , and f (1(X) B ), 5X) , . V W V I .! . #) (k))"BK CK . BK and CK are given by B) "(>K B Ux >K B Uy >K B Uz >K B !XK B Ux !XK B Uy NA N NA N NA N NA NA N NA N !XK B Uz ), NA N C) "(tK \SK r( tK \SK r( tK \SK r( tK \SK tK tK \r( tK \r( W V W V W V W V V W W tK \r( ). W In the above equation, the projected point X) B "(XK B , >K B , b, 1) can be rewritten as .! NA NA XK B "d SK \(XD !C ) and >B "d (>D !C ) where NA V V NA V NA W NA W (C , C ) is the image center which can be estimated [7]. V W In addition, Eq. (25) has a unique solution for the number of object points much larger than "ve [9]. In order to estimate the unknowns parameters # "(t , t , S , t, , V W V h), the basic concept of the learning algorithm (14) should be modi"ed with respect to the above equation, in which case the modi"ed algorithm can be given as
#) (k#1)"#) (k)#g(XB (k)!X) B (k)) .A .A
1
Step 1. The mapping relation (22) can be simply rewritten according to the radial constraint(RAC) [9] between the points 5X and their projected points 1(XB ) on the . .! image plane.
*f *f , , *tK *tK V W
*f *f *f *f , , , #a*#) (k!1), *SK *tK * K *hK V I
Qx Qx 1 0 ! N b N Qz Qz N N Qy Qy !T " 0 1 ! N b N . . Qz Qz N N 0 0 1 b 0 0
The unknown parameters # "(t , t , t , t, , h) and V W X # "(t , b, k ) are then estimated by relations (22) and X (24).
(26)
where g"diag(g , g ,2, g ), and the derivative of the estimated position 1(XK B )"f(1(X) B ), 5X , #) (k)) with reNA .! . spect to the parameters can be easily derived (see Appendix A). This learning procedure is the forward identi"cation and its concept is shown in Fig. 2(a). Step 2. The unknown parameters to be determined here are # "(t , b, k ). When a point 5X on an object X . is captured by using the optical system, the projected point 1(XB ) on the image plane is detected easily. If there .! are no uncertainties in the parameters # "(t , b, k ), the X computed point 5X) will correspond to the object point . 5X) . Normally, they do not coincide, which is a cue . identifying the parameters. Provided that the projected points are reversely projected by using the inverse transformation model, the points 5X) in an object can be . obtained. The point 5X) is computed as . 5X) "f (5R, 5t) ,HK (k), 1(XS (k))), . 1 1 .!
(27)
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
1205
3.2.2. Second stage This is a stage to determine the constitutive parameters of an optical unit with n-mirrors. The system model of (13) can be rewritten by 5X) "f (5R, 5t, #) (k), 1(XS (k))). + 1 1 +!
(30)
In the above,
5R 5t 1 f (5R, 5t, #) (k), 1(XS (k)))" 1 1 1 +! 0 1
1 (XD !C ) dV#(XD !C ) dV k r +! V S +! V S V V 1(+T ) (>D+!!CW)dW#(>D+!!CW)dWkr . ! b 1
(31)
and the estimated parameter vector #) (k)"(hK , hK ,2, hK ) NI consists of the ith mirror parameters as follows: hK (k)"(P) MPE, NK ) G GI
where PK -0%"(p( G, p( G, p( G) and N) "(n( G , n( G , n( G ) . The algoG V W X G V W X rithm requires the derivative of the estimated position 5X) "f (5R, 5t) , #) (k), 1(XS (k))), with respect to the + 1 1 +! parameters. The derivative of the position function obtained by the model has the six components as shown below (see Appendix C):
Fig. 2. The learning architectures.
where f (5R, 5t) ,HK (k),1(XS (k)))" 1 1 .! Q (XD !C ) dV#(XD !C ) dV k r .! V S .! V S V V 5R 5t 1 1 1(.T ) (>D !C )d #(>D !C )d k r . .! W W .! W W ! 0 1 bK
1
(32)
(28)
In the above, the rotational matrix 5R and the compo1 nents t , t of the translation vector have been already V W computed by the procedure of step 1. Consequently, the delta learning rule of (14) can be modi"ed as follows: #) (k#1)"#) (k)#g(5X (k)!5X) (k)) . . *f *f *f ; , , #a*#) (k!1), *tK *bK *kK X J I (29)
where g"diag(g , g , g ) and the derivative of the esti mated position 5X) "f (5R, 5t) #) (k)1(XS (k))) with re. 1 1 .! spect to the unknown parameters # "(t , b, k ) is sim X ply obtained (see Appendix B). This learning procedure is estimated by using the backward identi"cation, as shown in Fig. 2(b).
*f " *hK
*f *f *f *f *f *f , , , , , , *p( G *p( G *p( G *n( G G *n( G G *n( G G V W X V W X I
i"1,2,2,n,
(33)
where n denotes the number of mirrors. Finally, the complete delta learning rule yields #) (k#1)"#) (k)#g(5X (k)!5X) (k)) + +
;
*f *p( G V *f *p( L V
*f , *p( G W *f , *p( L W
*f , *p( G X *f , *p( L X
#a*#) (k!1),
*f *n( J G V *f , *n( L W ,
*f *n( J G W *f , *n( L W ,
*f *n( J G X I *f , *n( L W I ,
(34)
where g"diag(g , g ,2, g ), and the dimensions are L given by #) : 6n;1, 5X !5X) : 1;4, [*f/*p( G] : + + V 1;4, [*f/*n( G] : 1;4, g : 6n;6n, and a: scalar. This learnV ing procedure is the backward identi"cation, as shown in Fig. 2(b). This completes the estimation of the constitutive parameters of the optical unit.
1206
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
4. Evaluation of the proposed estimation algorithm
4.2. System implementation and modeling
4.1. System conxguration and features
Fig. 4 shows a prototype of the sensing system. The sensing system uses a camera with a small aperture size of about H1.8 mm in order to reduce the blurring e!ect on the input image caused by spherical aberration of the conic mirrors [14]. However, this yields a lack of brightness on the input image. To this end, this system is provided with an illuminator consisting of an LED array of a ring type, as well as four halogen lamps at intervals of 903. The mirrors of the system are specially manufactured of aluminum in order to avoid the refracting error. Fig. 5 illustrates the geometry of the optical system consisting of four mirrors and a camera with a collecting lens. Let us set an object frame +=, on an object space and set a sensor frame +S,centered at the optical center O . The frame +C,, parallel to the sensor frame +S,, is an 1 image coordinate system centered at the point O which ! is the intersection point of the image plane and the z-axis through the optical center. The mirror frames are denoted by +M ,, +M ,, +M ,, +M , at the intersection points between the optical axis and mirrors, one by one. The e!ective focal length b is de"ned as the distance from the image plane to the optical center. Let us also assume that X and S are a point vector and G G a direction cosine vector de"ned relative to each mirror
As shown in Fig. 3(a), the omni-directional image sensing system for assembly (OISSA) [4] is an example of the system with the optical unit consisting of multiple mirrors. It is developed to obtain the image of an omni-directional shape of an object without selfocclusion [4,15]. It consists of three components: a camera, a tool for parts handling, and an optical unit. The optical unit is made up four mirrors: a pair of plane mirrors, and a pair of an inside-conic and an outsideconic mirrors. The conic mirror corresponds to the con"guration concept composed of in"nite plane mirror patches, as shown in Fig. 3(b). The inside-conic mirror having a re#ective surface inside is used in order to obtain a 2n image of the object. The 2n image is collected on the inside-conic mirror by using an outside-conic mirror placed co-axially along with the center axis of the inside-conic mirror. The collected 2n image is again projected onto the image plane of the camera on o!-axis by using two plane mirrors. According to such a principle, a 2n planar image can immediately obtain without self-occlusion, as shown in Fig. 3(c).
Fig. 3. The schematic diagram of the proposed sensing system: (a) the con"guration, (b) the enlarged con"guration of the inside-conic mirror, (c) a typical image obtained for a pair of a peg and hole.
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
Fig. 4. The prototype of the constructed optical sensor system when used for robotic assembly.
1207
frame +M ,. In addition, a calibration pattern is placed G on a region visibly through the optical unit from the camera. Provided that the ray of a light re#ecting from a point 5X on the calibration pattern M with a direc+ tion cosine S , is projected onto a point XS on the + +! image plane of a camera with the incident direction cosine S through the optical unit consisting of four + mirrors, then the forward and inverse transformations of the system can be derived by Eqs. (6) and (13), respectively [4]. When the ray of a light re#ecting from a point X on a pattern P with a direction cosine S is directly . . projected onto a point XS on the image plane without .! being projected via the optical, the forward and inverse image transformations can be obtained by the Eqs. (22) and (24), respectively. In this system, the parameters to be calibrated are the intrinsic parameters of #"(b, S , k ) and the extrinsic V parameters of the rotation 5R(t, , h) and the transla1 tion 5t"(t , t , t ). The constitutive parameters to be 1 V W X searched are 26 variables since the optical unit consists of four mirrors with six parameters including three positions and three orientations per a mirror and the conic mirror has two additional parameters of vertex angles.
Fig. 5. The coordinates system of the sensing system.
1208
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
Fig. 6. The calibration pattern and the distorted pattern of the image sensing system due to variation of the measurement height.
These parameters are searched by using the two-stage approach based on the delta learning algorithm, as described in Section 3. 4.2.1. First stage In this stage, camera parameters to be determined are S , k , b, 1 R(t, , h), and 1 t"(t , t , t ). This procedure V 5 5 V W X needs the calibration pattern P, as shown in Fig. 5. When the points 5X on the pattern P are directly projected . onto the camera without intervention of the optical unit, it needs the relation between the calibration point 5X on . the pattern P and its projected point 1(XB ) at the camera. .! It can be similarly represented by the Eqs. (22) and (24), as derived in Section 3. The unknowns parameters # "(t , t , S , t, , h) are recursively estimated by us V W V ing the learning algorithm (26) in a similar way to the case of an optical system with n-mirrors.
Next, the unknown parameters # "(t , b, k ) are also X recursively estimated by using the algorithm (29), in which case the number of mirrors is four. 4.2.2. Second stage This is a stage to determine the constitutive parameters of an optical unit. The constitutive parameters can be recursively estimated by applying the learning algorithm (34) with respect to the image sensing system model (30). In the model, the estimated parameter vector #) (k) is given by #) (k)"(hK , hK ,2, hK ) and its components are I represented by the ith mirror parameters hK (k)"(P) MPE , N) ) i"1, 2, 3, 4, G G GI where P) -0%"(p( G, p( G, p( G)R and N) "(n( G , n( G , n( G )R. G V W X G V W X
(35)
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
1209
Fig. 7. Convergence characteristics of the learning scheme with changing the momentum constant, a and the learning constant, g.
Fig. 8. Convergence characteristics of the learning method with increasing the number of calibration points.
Fig. 9. The convergence feature with increasing of the learning step.
1210
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
Table 1 Designed con"guration parameters Camera parameters
Mirror parameters
Translation: (unit, mm) 1 t"[!50.0, !84.0, 340.23] 5 Rotation
Plane mirror +M4,: (position unit, mm) 1P -0%"[0.000, 0.000, 76.987], 1N "[0.000, !0.707, 0.707] Plane mirror +M3, : (position unit, mm) 1P -0%"[0.000, !74.000, 76.987],1N "[0.000, 0.707, 0.707] Outsise conic mirror +M2, : (position unit, mm) 1P -0%"[0.000, !74.000, 116.507],1N "[0.000, 0.000, 1.000] N Inside conic mirror +M1, : (position unit, mm)
0.5000 0.8659 !0.0151
1 R" !0.8660 0.5000 !0.0087 5 0.0000 0.0175 0.9998 E!ective focal length:b"57.987 mm Scale factor: S "1.0 V Distortion: k "0.0 Image center: (256, 240) pixel
1P -0%"[0.000, 74.000, 58.587],1N "[0.000, 0.000, 1.000] N Conic mirror vertex angle: a "75.93, a "90.03
Table 2 Initial guess of the con"guration parameters Camera parameters
Mirror parameters
Translation: (unit, mm) 1 t"[!30.0, !60.0, 400.0] 5 Rotation
Plane mirror +M4,: (position unit, mm) 1P -0%"[0.000, 0.000, 76.987], 1N "[0.000, !0.707, !0.707] Plane mirror +M3,: (position unit, mm) 1P -0%"[0.000, !73.700, 75.500], 1N "[0.000, 0.707, 0.707] Outside conic mirror +M3,: (position unit, mm) 1P -0%"[0.000, !73.700, 115.500], 1N "[0.000, 0.000, 1,000] N Inside conic mirror +M1,: (position unit, mm) 1P -0%"[0.000, !73.700, 57.500], 1N "[0.000, 0.000, 1.000] N Conic mirror vertex angle: a "75.93, a "90.03
0.8659 0.5002
0.0064
1 R" !0.4999 0.8657 !0.0238 5 !0.0175 0.0174 0.9997 E!ective focal length: b"52.0 mm Scale factor: S "1.5 V Distortion: k "0.0 Image center: (256, 240) pixel
Table 3 Estimated results of the con"guration parameters Camera parameters
Mirror parameters
Translation: (unit, mm) 1 t"[!49.8954, !84.1123, 339.2857] 5 Rotation
Plane mirror +M4, : (position unit, mm) 1P -0%"[0.000, 0.005, 77.090], 1N "[0.000, !0.707, !0.707] Plane mirror +M3,: (position unit, mm) 1P -0%"[0.000, !74.005, 76.982], 1N "[0.000, 0.710, 0.697] Outside conic mirror +M2, : (position unit, mm) 1P -0%"[!0.125, !73.980, 116.663], 1N "[0.000, !0.000, 0.999] N Inside conic mirror +M1, : (position unit, mm) 1P -0%"[0.112, !74.1809, 58.1305], 1N "[!0.000, 0.000, 1.000], N Conic mirror vertex angle: a "75.9013, a "89.9993
0.5003 0.8654 !0.0228
1 R" !0.8658 0.5004 !0.0057 5 0.0065 0.0226 0.9996
E!ective focal length:b"57.0678 mm Scale factor: S "0.9999 V Distortion: k "!0.0031 Image center: (256, 240) pixel
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
1211
In the following section, we will evaluate performance of the algorithm by carrying out a series of experiments by implements the image sensing system. 4.3. Experiments for parameter estimation
Fig. 10. Comparison between the calibration points on the grid pattern and the points calculated from the model with the estimated parameters before and after the learning: (a) the estimated points data obtained by the designed con"guration parameters before the learning, and (b) the compensated points data obtained by the estimated con"guration parameters after the learning.
The vertex angle of the conic mirror is determined by the estimated normal vector of the mirror surface. As shown in Fig. 5, there is a cross-sectional diagram of a conic mirror with an azimuth angle u and the principal J axis denoted by P . Let us assume that the ray of a light ?VGQ re#ecting from a point 5X on the object M strikes . a point X on the conic mirror +M ,. Then, the vertex axis G G of the conic mirror is computed by * N GJ P )X ?VGQ G , i"1, 2 P "!J and a "cos\ ?VGQ A ¸ #P ) X # ?VGQ G . (36)
The learning procedure requires the training data to estimate the uncertain parameters of the sensing system. As shown in Fig. 6(a), the grid pattern is made of meshes of 5 mm;5 mm as the calibration pattern usually devised for calibrating an image sensing system. The calibration pattern is placed at the height interval by 2 mm from Z !12 mm to Z #12 mm where Z "238 mm is the reference height showing no distortion in the sensing system [4,15,21]. The distance is measured from the optical center, and then, the calibration pattern image is captured at every height of the interval of 2 mm by using the image sensing system. After obtaining the intersection points on the grid pattern image through the image processing techniques such as the edge detection, the thinning, and the 8-conectivity investigation [22,23], we construct the training data consisting of the 208 pairs of (XB ,5X ) and (XB ,5X ), respectively; the 16 inter+! + .! . section points are respectively selected from the calibration pattern and its image at each height of the interval of 2 mm. The points, detected at the camera, include the features of the image sensing system such as a distortion and the parameter uncertainties, as shown in Fig. 6(b) and (c). These features are estimated by the learning procedure for parameter modi"cation, as described in Section 3. Fig. 7 shows the convergence characteristics of the learning scheme with change of the learning constant g and the momentum constant a in each stage. The Figs. 7(a), (c) and (e) show the convergence characteristics of each stage when the momentum constant a is increased with respect to a learning constant g. The results show that the case of (c) has the characteristics converging faster and more stably than the other cases. The Figs. 7(b), (d) and (f) show the convergence characteristics with increasing the learning constant g while the momentum constant a is kept constant 0.8. The results show that the constants g and a are becoming larger, the energy function is converged within a few learning steps. Fig. 8 shows the convergence characteristics at the learning iterations of 10 with increasing the number of calibration points. The result shows that the errors in each stage are continuously decreasing with increasing the number of calibration points. Fig. 9 shows the relation between the learning step and the convergence. It shows that the errors are also decreasing with increase of the number of the learning step. Conclusively, the results obtained here implies that larger the number of learning iteration, better the convergence characteristics and larger the amount of learning data, faster the convergence rate.
1212
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
Fig. 11. Error analysis between the calibration points on the grid pattern and the points calculated from the mode before and after the learning; (a) and (b) are the position errors before and after the learning, respectively, and (c) and (d) are the average errors with variation of measurement height and azimuth angle, respectively.
Table 1 shows the system design parameters, while Table 2 shows initial guess for the parameter estimation. Table 3 shows the resultant parameters learned during 200 times iteration by using randomly selected 50 training data among 208 data. Though the initial guessing parameters are quite far from the designed values, the algorithm is rapidly converged to the target parameters. The algorithm starting at the initial parameters # in Table 2 continues to learn the training data until the square errors of (17) between the ray positions X) S and +! X) S calculated from the model and the ray positions .! XS and XS detected at the camera are within an +! .! allowable range E , respectively. The allowable range K E is calculated by K N(p#m ) ? ?T , E " K 2
(37)
where N, p , and m are the number of data, the allow? ?T able standard deviation of position errors, and the mean of the position errors, respectively.
Fig. 10 presents the comparison between the designated points of 208 and the points calculated from the model with regard to the constitutive parameters before and after the learning. In this learning, all the constitutive parameters are learned for the constant momentum constant a"0.8 and the stepwise learning constant value g kept the same as in the cases of Figs. 8 and 9. It shows G that the misalignment between their points is compensated su$ciently by the learning procedure. Fig. 11 shows these results in detail. As shown in Fig. 11(a) and (b), the error analysis shows that before learning, the mean and the standard deviation of the errors were e "(0.15, 0.00) mm and p "(0.8632, 0.9077) mm, respectively. However, after the learning, they are reduced to e "(0.01, 0.03) mm and p "(0.0943, 0.0843) mm. Fig. 11(c) shows the magnitude of errors with change of the measurement distance before and after learning. Before learning, as measurement distance increases, the estimated error is getting larger. However, after learning, the estimated error does not depend on the measurement distance. Fig. 11(d) shows the in#uence on errors with change of an azimuth angle. Before learning,
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
1213
Fig. 12. Comparison of the performance before and after learning through the inversely projected rectangular shape of an object of 30;30 mm placed at Z"Z ("238) mm; (a) a captured image of the rectangular object, (b) an inversely projected images of (a), and (c) the error result before and after learning.
as the azimuth angle increases, the estimated error gets larger, but after learning, decrease in the azimuth angle does not show much in#uence. Figs. 12 and 13 illustrate the experimental results for evaluating the compatibility of the learned parameters. Let us assume that an object with rectangular shape of 30 mm;30 mm is located at a reference vertical distance Z "238 mm, showing no distortion (see Ref. [21]). Fig. 12(a) is a typical image captured from the omnidirectional image system. The captured image is, again, projected inversely onto the object space by using the projection model of the omni-directional sensing system. During the procedure, the constitutive parameters are utilized before and after learning, respectively. Fig. 12(b) is a compared result between the original rectangular shape and the inversely projected shape with respect to the constitutive parameters before and after learning. Fig. 12(c) is the result showing the deviation of matched shape along azimuth angle between the original rectangular object and the inversely projected object shape. Fig. 13(a) is a captured image when the rectangularshaped object, located at Z"Z !20 mm lower than
the previous case is projected onto the image plane through the omni-directional image system. The distortion in the image is caused by the feature of the omnidirectional image sensing system (see Ref. [21]). The results are found to be similar to those of the previous case. All the results show that after learning the parameters, the error is reduced by about 73%. This improvement is lower than 89.9% of the typical learning pattern consisting of point data, as shown in Fig. 11(b). It is caused by image processing errors such as variation of an illumination and the line extraction error.
5. Conclusions We proposed a projective model of an optical sens1ing system consisting of n-mirrors and a camera with a collecting lens. Using the model, we presented a learning-based process to recursively estimate the uncertain constitutive parameters related to the optical sensing system. The learning algorithm utilizes the delta training rule based on the supervised learning. To
1214
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
Fig. 13. Comparison of calibration performance before and after through inversely projecting rectangular shape of 30;30 mm placed at Z"Z !20 mm; (a) a captured image of the rectangular shape placed at Z"Z !20 mm, (b) an inverse projection images of (a), and (c) error analysis result.
evaluate the proposal methodology the concept was implemented by constructing an image sensing system consisting of four mirrors. Due to this learning procedure the average position error of the detected points is reduced by more than 89%, compared with the result obtained without the learning. The learning is found to be converged to the small error less than one pixel for 50 iterations with respect to 50 calibration points. The result obtained herein implies that the proposed estimation method based on the projective model is practical in the sense that it does not require any other additional measuring devices for calibration.
Appendix A The gradient of the Eq. (30) with respect to the parameters becomes
*f " *#)
*f *f *f *f *f *f , , , , , *tK *tK *SK * K *tK *hK V W V
I
(A.1)
where *f "B) (0, 0, 0, tK \SK , 0, 0, 0), W V *tK V *f "B(!tK \SK r( ,!tK \SK r( ,!tK \SK r( , tK \SK tK , W V W V W V W V V *tK W tK \r( , tK \r( , tK \r( ), W W W *f "(0, 0, 0, 0, S\XB Ux , S\XB Uy , S\XB Uz )C V .A . V .A . V .A . *SK V #B) (tK \r( , tK \r( , tK \r( , tK \tK ,0, 0, 0), W W W W V
!tK \SK cos K sin tK W V tK \SK cos K cos tK sin hK W V tK \SK cos K cos tK cos hK W V *f "B) 0 , *tK V !tK \ sin K sin tK W tK \ sin K cos tK sin hK W tK \ sin K cos tK cos hK W
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
!tK \SK sin K cos tK W V tK \SK (!sin K sin tK sin hK !cos K cos hK ) W V tK \SK (!sin K sin tK cos hK #cos K sin hK ) W V *f "B) 0 * K V tK \ cos K cos tK W tK \(cos K sin tK sin hK !sin K cos hK ) W tK \(cos K sin tK cos hK #sin K sin hK ) W
0
In the above, the vector B is de"ned by
0
Taking into account the image transformation model (17) and the matrix T of (4), the derivatives of (35) with respect to the constitutive parameters are written by *f " *#) G
!XK B Ux ,!XK B Uy , !XK B Uz ), NL> N NL> N NL> N
Appendix B
*G\T G GT (XB #D) T G\ *p ! +! VG *G\T G GT (XB #D) T G\ *p ! +! WG *G\T G GT (XB #D) T G\ *p ! +! XG *O\T O OT (XB L T #D) ! +L> O O\ *nG V *O\T O OT (XB L T #D) ! +L> O O\ *nG W *O\T O OT (XB L T #D) ! +L> O O\ *nG X
where
*f *f *f , , *tK *bK *kK X
, I
*f 5X . *tK X
0 0 0
*GT G\" 0 0 0 *p( VG 0 0 0
where
Q (XD !C ) dV#(XD !C ) dV kK r KA V S KA V S V V "(0 0 1 0)1(NT ) (>DKA!CW)dW#(>DKA!CW) dWkK r , ! bK 1
1R *f " 5 *bK 0
0 0 0
*GT G\" 0 0 0 *pL WG 0 0 0
(B.1)
0
1t 0 5 .T , ! 0 1 0
(B.3)
Appendix C
B) "(>K B Ux , >K B Uy , >K B Uz , >K B , NL> N NL> N NL> N NL>
*f " *#)
d (XD !C ) V r L> V S V 1t 5 .T (>D !C )d r , L> V W ! 0 0
1R *f " 5 *bK 0
tK \SK (cos K sin tK cos hK #sin K sin hK ) W V tK \SK (!cos K sin tK sin hK #sin K cos hK ) W V *f "B) 0 . *hK V 0 tK \(sin K sin tK !cos sin h) W tK \(!sin K sin tK sin hK !cos K cos hK ) W
1215
0 0 0
(B.2)
*GT G\" 0 0 0 *pL XG 0 0 0
nL G V s( G\ oL V G nL G V s( G\ , oL W G nL G V s( G\ oL X G nL G W s( G\ oL V G nL G W s( G\ , oL W G nL G W s( G\ oL X G nL G X s( G\ oL V G nL G X s( G\ , oL W G nL G X s( G\ oL X G
(C.1)
I
1216
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217
*OT O\, *n( G V
"
*OT O\ *n( G X
s( O\ V sL O\ W sL O\ X 0
o( !n( O s( O\ n( G s( O\ n( Os( O\ tK O o( !m( s( O\ V O OV ;!O V V ! W V ! X V , o( o( o( o( O O O O for q"i,
!4n( O s( O\!2n( Os( O\!2n( Os( O\ VV WW XX !2n( Os( O\ WV !2n( Os( O\ XV 0
n( O n( O n( O m( ; ! V! W! X O , o( o( o( o( O O O O
s( O\ V sL O\ W sL O\ X 0
n( G s( O\ o( !n( Os( O\ n( Os( O\ tOo !m( s( O\ VW ! O WW XW W O OW o( o( o( o( O O O O for q"i,
;
!2n( O s( O\ VW !2n( O s( O\!4n( Os( O\!2n( Os( O\ VV WW XX !2n( Os( O\ XW 0
n( O n( O n( O m( ; ! V! W! X O o( o( o( o( O O O O for qOi,
o( !n( Os( O\ tOo( !m( s( O\ n( G s( O\ n( Os( O\ XX VV WX ! O X O OX o( o( o( o( O O O O for q"i,
;
!2n( O s( O\ VX !4n( Os( O\ WX !2n( O s( O\!2n( Os( O\!4n( Os( O\ VV WW XX 0
n( O n( O n( O m( ; ! V! W! X O , o( o( o( o( O O O O for qOi.
References
for qOi,
*OT O\ *n( G W
"
"
s( O\ V sL O\ W sL O\ X 0
[1] Y.K. Ryu, H.S. Cho, New optical sensing system for obtaining the 3D shape of specular objects, Opt. Engng. 35 (1996) 1483}1495. [2] M. Rioux, Laser range "nder based upon synchronized scanners, Appl. Opt. 23 (1984) 3837}3844. [3] I.S. Jung, H.S. Cho, An active omni-directional range sensor for mobile robot navigation, IFAC Conf. on Control of Industrial Systems, 1997. [4] W.S. Kim, H.S. Cho, A novel omnidirectional image sensing system for assembling parts with arbitrary cross-section shapes, IEEE/ASME Trans. Mechatronics 3 (1998) 275292. [5] M. Inaba, T. Hara, H. Inoue, A stereo viewer based on a single camera with view-control mechanisms, Proc. of Int. Conf. on Intelligent Robots and Systems, 1993, 1857}1864. [6] J.Y. Kim, H.S. Cho, S. Kim, Measurement of parts deformation and misalignments by using a visual sensing system, IEEE int. Symp. on Computational Intelligence in Robotics and Automation, 1997, pp. 362}367. [7] R.K. Lenz, R.Y. Tsai, Technique for calibration of the scale factor and image center for high accuracy, IEEE Trans. Pattern Anal. Machine Intell. 10 (1988) 713}720. [8] M.H. Han, S.R. Rhee, Camera calibration for three-dimensional measurement, Pattern Recognition 25 (1992) 155}164. [9] R.Y. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using o!-the shelf TV camera and lenses, IEEE Trans. Robotics Automation, RA- 3 (1987) 323}344.
W.S. Kim, H.S. Cho / Pattern Recognition 33 (2000) 1199}1217 [10] O. Faugeras, Three Dimensional Computer Vision, MIT Press, Cambridge, MA, 1993. [11] H. Zhuang, Modeling gimbal axis misalignments and mirror center o!set in a single-beam laser tracking measurement system, Int. J. Robotics Res. 14 (1995) 211}224. [12] S. Tamura, E.K. Kim, Y. Sato, Error correction in laser scanner three-dimensional measurement by two-axis model and coarse-"ne parameter search, Pattern Recognition 27 (1994) 331}338. [13] R.O. Duda, P.E. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, 1973, pp. 44}129. [14] E. Hecht, Optics, Addison-Wesley, Reading, MA, 1987. [15] W.S. Kim, A new omni-directional image sensing system for assembly of parts with complicated cross-sectional shapes (OISSA), Ph. D. Dissertation, Korea Advanced Institute Technology, Korea, 1997. [16] J.J. Craig, Introduction to Robotics, Addison Wesley, Reading, MA 1986, pp. 15}53.
1217
[17] R. Kingslake, Applied Optics and Optical Engineering: Optical Components, Vol. 3, Academic Press, New York, 1965, pp. 269}308. [18] J.S Arora, Introduction to Optimum Design, McGraw Hill, New York, 1989. [19] J.M. Zurada, Introduction to Arti"cial Neural Systems, West Publishing Company, New York, 1992. [20] R.A. Jacobs, Increased rates of convergence through learning rate adaptation, Neural Networks 1 (1988) 295}307. [21] W.S. Kim, H.S. Cho, S. Kim, Distortion Analysis in an omni-directional image sensing system for assembly, Proc. IEEE Int. Symp. on Assembly and Task Planning, CA, USA, August 7}9, 1997. [22] I. Pitas, Digital Image Processing Algorithms, Prentice Hall, New York, 1993, pp. 223}252. [23] D.H. Ballad, C.M. Brown, Computer Vision, PrenticeHall, Englewook Cli!s, NJ, 1982.
Pattern Recognition 33 (2000) 1219}1237
The fuzzy c#2-means: solving the ambiguity rejection in clustering Michel MeH nard*, Christophe Demko, Pierre Loonis Laboratoire d'Informatique et d'Imagerie Industrielle, Universite& de La Rochelle, Avenue Marillac, 17042 La Rochelle Cedex 1, France Received 18 December 1997; received in revised form 7 July 1998; accepted 16 March 1999
Abstract In this paper we deal with the clustering problem whose goal consists of computing a partition of a family of patterns into disjoint classes. The method that we propose is formulated as a constrained minimization problem, whose solution depends on a fuzzy objective function in which reject options are introduced. Two types of rejection have been included: the ambiguity rejection which concerns patterns lying near the class boundaries and the distance rejection dealing with patterns that are far away from all the classes. To compute these rejections, we propose an extension of the fuzzy c-means (FcM) algorithm of Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. This algorithm is called the fuzzy c#2-means (Fc#2M). These measures allow to manage uncertainty due both to imprecise and incomplete de"nition of the classes. The advantages of our method are (1) the degree of membership to the reject classes for a pattern x are learned in the iterative clustering problem; (2) it is not necessary to compute other I characteristics to determine the reject and ambiguity degrees; (3) the partial ambiguity rejections introduce a discounting process between the classical FcM membership functions in order to avoid the memberships to be spread across the classes; (4) the membership functions are more immune to noise and correspond more closely to the notion of compatibility. Preliminary computational experiences on the developed algorithm are encouraging and compared favorably with results from other methods as FcM, FPcM and F(c#1)M (fuzzy c#1-means: clustering with solely distance rejection) algorithms on the same data sets. The di!erences in the performance can be attributed to the fact that ambiguous patterns are less accounted in for the computing of the centers. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Fuzzy clustering; Fuzzy c-means; Ambiguity rejection; Distance rejection
1. Introduction In various information processing, as for example image understanding applications, signal or image processing, or diagnosis of a static or dynamic process, a pattern recognition approach is often used. The p parameters observed are used to build up the pattern vector. The system analysis is directly linked to the pattern
* Corresponding author. Tel.: #5-46-458296; fax: #5-46458242. E-mail addresses:
[email protected] (M. MeH nard),
[email protected] (C. Demko), pierre.loonis@l3i. univ-lr.fr (P. Loonis).
classes to be discriminated in the p-dimensional representation space. The pattern recognition process generally includes clustering, pattern classi"cation and decision. We deal in this paper with clustering problem that refers to a broad spectrum of methods which subdivide a family of unlabeled objects into subsets, or clusters, which are pairwise disjoints, all non-empty, and produce the original data set via union. This problem can be de"ned as follows: E let X"(x ) be the family of objects where I IZ L
x "(x , x ,2, x ) is a pattern described by p feaI I I IN tures (i.e. x 3RN); I E let )"(u ) be a family of classes. G GZ A
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 1 0 - 7
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1220
Objects belonging to a same class share common properties that distinguish them from objects belonging to other classes. Then, clustering techniques in pattern recognition area consist of searching for a function f such that: f : XP), x C f (x ), I I where f (x ) denotes the class associated with x . The I I more an object x belongs to a class, the closer it is to the I class, in the representation space (i.e. f is usually a function of distance). In the literature, most of the clustering algorithms can be classi"ed into two types: E Hard or crisp. In this case, algorithm assigns each features vector to a single cluster and ignores the possibility that this features vector may also belong to other clusters. Such algorithms are exclusive and the cluster labels are hard and mutually exclusive; E Fuzzy. Fuzzy clustering algorithms consider each cluster as a fuzzy set, while a membership function measures the probability that each features vector belongs to a cluster. So, each features vector may be assigned to multiple clusters with some degree of sharing measured by the membership function. In the "rst case, classes may be described by a family of functions F"( f ) , where x is classi"ed with a hard G GZ A
I cluster label, such that f : XP0, 1, G 1 if x 3u , I G x C I 0 otherwise
verifying ∀x 3X, A f (x )"1 (i.e. mutual exclusive I G G I property). Clustering is widely performed by an iterative algorithm which is known as the crisp c-means algorithm [1,2] and which makes it possible to "nd a local optimal family of c-centers clustering the family X"(x ) . I IZ L
In this paper we deal with the second case. Fuzzy set theory may be used to compute a family of functions called membership functions ;"(k ) such that G GZ A
k : XP[0, 1], G x C k (x ) I G I verifying ∀x 3X, A k (x )"1. This constraint will be I G G I discussed later. Fuzzy clustering algorithms generate a fuzzy partition providing a measure of membership degree of each pattern to a given cluster. These methods are less prone to local minima than the crisp clustering algorithms since they make soft decisions in each iteration through the use of membership functions. Many membership functions have been de"ned for this purpose [3}5]. The "rst fuzzy clustering algorithm was developed
in 1969 by Ruspini [6]. Following this, a class of fuzzy ISODATA clustering algorithm has been developed which includes fuzzy c-means [7] (FcM). The classical FcM problem involves "nding a fuzzy partition of the family X. It is su$cient therefore to "nd a family of membership functions which minimize the criterion: A L J (;, v)" kK d , (1) K GI GI G I where m'1 is a fuzzi"er exponent, k "k (x ) and GI G I d """x !v "" , G is a norm. GI I G % The FcM algorithm was developed to solve this minimization problem. It consists of choosing a random initial partition ; and iterating the two following equations: L kRKx vR>" I GI K I, G L kR I GI 1 kR>" (2) GI A (dR>/dR>)K\ H GI HI Given a "nite set of objects X, the fuzzy clustering of X into c clusters is a process which assigns membership degree for each of the object to every cluster. This algorithm converges under some conditions [7] to a local minimum of Eq. (1). During the past years, this has been the object of several extensions and utilization. More particularly, we quote the algorithm from Gustafson and Kessel [8] allowing to take the covariance matrix of each class u into account. They argue that the G use of fuzzy covariances is a natural approach to fuzzy clustering. For example, in the case of well-known Fisher's IRIS (cf. 4.5), experimental results indicate that more accurate clustering may be obtained using fuzzy covariances. The iteration then becomes: L kRKx vR>" I GI K I, G L kR I GI R> L kRK(x !vR>)(x !vR>) G I G " I GI I , L kRK G I GI R> N R>\ GR>"o , (3) G G G G 1 kR>" GI A (dR>/dR>)K\ HI H GI where R> is the fuzzy covariances matrix for the G cluster u and p is the features space dimension. Typically G o "1, i"1, 2,2, c. Then, the distance function in the G
In the presented paper, we use this norm in the case of Fisher's IRIS.
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
objective function 1, d """x !v "" is de"ned as GI I G % d """x !v ""G ""x !v "". GI I G G I G However, since the memberships are generated by a probabilistic constraint originally due to Ruspini [6]: A k (x )"1 G I G i.e. sum to 1 over each column of ;, the FcM algorithm su!ers from several drawbacks:
∀x 3X, I
E the memberships are relative numbers. The membership of a point in a class depends on the membership of the point in all other classes. So, the cluster centers estimates are poor with respect to a possibilistic approach. This can be a serious problem in situations where one wishes to generate membership functions from training data; k cannot be interpreted as the GI typicality of x for the ith cluster; I E the FcM algorithm is very sensitive to the presence of noise. The membership of noise points can be signi"cantly high; E The FcM algorithm cannot distinguish between `equally highly likelya and `equally highly unlikelya [9]. To overcome this problem, Krishnapuram and Keller [9] proposed a new clustering model named possibilistic c-means (PcM), where the constraint is relaxed. In this case, the value k should be interpreted as the typicality GI of x relative to cluster i. But PcM is very sensitive to I good initializations, and it sometimes generates coincident clusters. Moreover, values in ; can be very sensitive to the choice of the additional parameters needed by the PcM. We reformulate the fuzzy clustering problem by including reject options to decrease these drawbacks. We propose a model and companion algorithm to optimize it, which we will call fuzzy c#2-means (Fc#2M) because it requires two additional clusters. This paper is organized as follows. In Section 2, to avoid the memberships to be spread across the classes and to allow to distinguish between `equally likelya and `unknowna, we de"ne partial ambiguity rejections which introduce a discounting process between the classical FcM membership functions. We modify the objective function used in the FcM algorithm and we derive the membership and prototype update equations from the conditions for minimization of our criterion function. In Section 3, to improve the performance of our algorithm in the presence of noise, we de"ne an amorphous noise cluster. This class allows us to reject an individual x when it is very far I from the centers of each class. So, our membership functions are more immune to noise and the membership functions correspond more closely to the notion of compatibility. The proof of this new theorem is presented.
1221
We have chosen to propose an extension of the fuzzy c-means, (FcM), because E the crisp k-means algorithm provides an iterative clustering of the search space and does not require any initial knowledge about the structure in the data set; E the use of the fuzzy sets allows to manage uncertainty on measures, lack of information,2 all characteristics which bring ambiguity notions; E most fuzzy clustering algorithms are derived from the FcM algorithm, which minimizes the sum of squared distances from the prototypes weighted by the corresponding memberships [8,10}12]. These algorithms have been used very successfully in many applications in which the "nal goal is to make a crisp decision such as pattern classi"cation. Moreover, we may interpret memberships as the probabilities or degrees of sharing. The advantages of our method are: 1. Contrary to Dubuisson, these rejects are introduced in the clustering or learning stage and not in the decision processing [13,14]. 2. The membership degree to the ambiguity reject class of a pattern x is explicit and, above all, this value is I learned in the iterative clustering problem. 3. No characteristic anymore is necessary to compute the reject and ambiguity degrees. 4. The location of cluster centers may be modi"ed (according to FcM) because the ambiguous and reject patterns are less taken into account. Section 4 illustrates our approach on various examples in order to show the interest of clustering conditioned by reject measures. We "rst present two examples to provide insights to our approach. We then present in the third example, the results obtained with the FcM [15] (algorithm without reject option), FPcM [16] (algorithm with typicality), F(c#1)M [17}19] (algorithm with solely distance rejection), and F(c#2)M on the diamonds data set. Another example shows the behavior of F(c#2)M with a 2-D dataset concerning three classes. An example more realistic deals with a twofold comparison on a classical real data set. On the one hand, we compare the results obtained by di!erent unsupervised clustering algorithm with reject option or not (FcM, FPcM, F(c#1)M, F(c#2)M). On the other hand, we compare the behavior of membership functions according to the both parameters, o and o, which makes it possible to control the ? power of distance and ambiguity rejections of our algorithm.
2. Clustering with ambiguity rejection To reduce an excessive error rate due to noise and other uncertain factors inherent in any real-world system,
1222
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
clustering with an ambiguity rejection is a solution. In most papers, proposed rules in order to reject a pattern lying near the classes boundaries are based on threshold values in the decision processing and not on the clustering or learning stage. In order to specify this decision making, it is common to characterize a pattern x with an ambiguity concept. I The ambiguity rejection has been introduced by Chow [20,21] in the decision processing. The goal is to measure the proximity of x from decision boundaries. In the case I of Bayes' rule, the problem is to de"ne a new class u , called reject class, associated to a constant reject cost Cr"C(0/j)∀j3[1, c]: e(x )"i I
if P(u /x )" max (P(u /x ))'1!Cr, G I H I HZ A
e(x )"0 otherwise, (4) I where (P(u /x )) are the posterior probabilities. The H I HZ A
Cr term, which controls the reject rate, must be chosen between 0 and (c!1)/c. This rule shares the representation space in (c#1) classes. Dubuisson [13] argues this rejection is an ambiguity reject since the corresponding area in the representation space is always located between the classes. The main drawback of these de"nitions, and their use, is that each of these rejects consists of prior "xed thresholds, without taking the `shapea of the learning classes into accounts. An extension of this simple rejection is the class-selective rejection introduced by Ha [22]. In this section, we construct an appropriate objective function in order to take explicitly into account ambiguity rejection in fuzzy clustering. Let )"+u ,2, u , be the set of classes u . Since there A G are 2A subsets in a set of c elements, we obtain 2A!1 regions, excluding the empty set, in a c-class problem. We attempt to partition the pattern or features space into regions; each of these regions corresponds to a subset of classes. Thus, in a c-class problem there are c single clusters, (+u ,) , and to exercise the ambiguity rejecG GZ A
tion option, we associate an ambiguity class with each subset of classes A32/+, (+u ,) ,. In Fig. 1 there G GZ A
are 2!1 regions corresponding to the subsets +u ,, +u ,, +u ,, u "+u , u ,, u "+u , u ,, u "+u ,u , and u "+u , u , u ,. De5nition 1. Let )"+u ,2, u , be the set of classes u . A G We associate an ambiguity class with each subset of classes A32!+, (+u ,) ,. We may consider that G GZ A
there is a collection of (A ) disjoint ambiguity classes I I in a c-class problem. Another way to characterize the ambiguity classes is to express the distance d of a pattern x to the ambiguity I I class A: let a pattern x be located at the same distance I from the set of t single classes that constitute
Fig. 1. (a) Illustrates the partition of the pattern space X into three regions; each of these regions corresponds to a single class. (b) Illustrates the partition of pattern space when the ambiguity rejection is exercised. We associate an ambiguity class with each subset of classes A32!+, (+u ,) ,. G GZ
A"+(u ) ,. If d "d ∀i, j3[1, t] and O OZ R
GI HI d (d ∀i3[1, t], ∀u , A (i.e. x is the closest to these GI HI H I single classes in the representation space) then d )d ∀i3[1, t]. The rate a"d /d may characterize I GI I GI the potential ambiguity chosen by the user. The choice of a will be discussed later. De5nition 2. Let )"+u ,2, u , be the set of classes u . A G The membership function k of a pattern x to the A>I I global ambiguity class (denoted w ) is obtained by A> a m-ary aggregation operator FK on the membership functions of this pattern according to each ambiguity class A32!+, (+u ,) , (with m" (A )). G GZ A
I I We will drop the superscript (m) when there is no fear of ambiguity. De5nition 3. In order to parametrize the proposed algorithm, we de"ne a global ambiguity reject rate o 3[0, 1] ? based on the whole patterns set X"(x ) . I IZ L
2.1. Ambiguity concept: boundary regions and regularization in clustering problems We deal in this section with the problem of de"ning boundary region between clusters for regularization purposes in clustering problems. We distinguish two kinds of boundary regions: the "rst is crisp and is de"ned by the rough set theory, the second is obtained by fuzzifying this concept. One uses equivalence relations, the other uses real-valued functions. Based on such relationships, we extend the boundary region concept to the fuzzy boundary region concept in fuzzy partitions. We use the Pawlak's theory of rough sets [23] to specify the ambiguity concept and de"ne the objective function. The rough set theory is concerned with ambiguities in knowledge attributable to the granularity of knowledge, that is, to indicernibility and approximations.
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237 Table 1 Example of Information system for the concept `belonging to u a x I
B
Decision: u
1 2 3 4 5 6 7
Yes No No Yes Yes No Yes
+u , +u , +u , u u u u
An information system is a pair A"(;, A), where ; is a non-empty, "nite set called the universe and A, a non-empty, "nite set of attributes, i.e. a : ;P< ∀a3A, ? where < is called the value set of a. By < we denote the ? set 6+< : a3A,. With every subset of attributes B-A, ? an equivalence relation, denoted by INDAB called the B-indicernibility relation, is associated and it is de"ned by INDAB"+(u, u)3;: a(u)"a(u) ∀a3B,. Objects u, u satisfying relation IND(B) are indicernible by attributes of B. If A"(;, A) is an information system, B-A is a set of attributes and X-; is a set of objects, then the sets: +u3;: [u] -X, and +u3;: [u] 5XO , are called the B-lower and the B-upper approximation of X in A, and they are denoted by BX and BM X, respectively. The set BN (X)"BM X!BX will be called the Bboundary of X. The set BX is the set of all elements of ; which can be with certainty classi"ed as elements of X, having the knowledge represented by attributes from B, whereas the upper approximation of X, BM X, is the set of all elements that possibly belong to X. The set BN (X) is the set of elements which one can classify neither to X nor to !X having knowledge B. Information about pattern of interest is often available in a form of data tables, known also as information systems. Suppose we are given patterns x with I k3[1,2, 7], as shown in Table 1. Let )"+u , u , u , be a set of classes. Each pattern x will "rst be assumed to possess a cluster label I I-+1, 2, 3, indicating with certainty its membership to one subset in 2. Thus, there are four regions corresponding to the ambiguity clusters u , u , u , u and three regions corresponding to the single clusters +u ,, +u , and +u ,. Let the crisp decision be presented
1223
in the third column. For x 3u , the attribute B is `yesa if I ' 13I and `noa otherwise. In Table 1 the set of patterns +x , x , x , x , are in dicernible with respect to the attribute B. Hence, the attribute B generates two elementary sets +x , x , x , x , and +x , x , x ,. x displays informations which enable us to classify it with certainty as belonging to class u . Thus, the lower approximation of the set of patterns belonging to class u is the set +x ,"+x 3+u ,, and I the upper approximation of this set is the set + x , x , x , x , " + x 3 u : ∀ u 3 2 ! a n d 1 3 I , . I ' ' The boundary region of the concept `belonging to u a is the set +x , x , x ,"+x 3u : ∀u 32!+, +u ,, I ' ' and 13I,. This example o!ers simple de"nition of the boundary region of a cluster: De5nition 4. let )"(u ) be a family of classes. G GZ A
Let X"(x ) be the family of objects where I IZ L
x "(x , x ,2, x ) is a pattern described by p features I I I IN (i.e. x 3RN). We de"ne the boundary region of the cluster I u with i3[1, c] as: F G"+x 3A: ∀A32!+, +u ,, I G G S and u 3A, G In fuzzy clustering, each features vector x may be I assigned to multiple clusters with some degree of sharing measured by the membership function. So, we can de"ne the size of the fuzzy boundary region of a cluster u as G L kK d I I I Z!+ +SG,, UGZ L " kK d . I I I Z!+ +SH,HZ A , SGZ where d is the distance from the pattern x to the I I ambiguous cluster center v : d "d(x , v ), m is a I I weighting exponent and k is the grade of memberI ship of the feature point x in the ambiguous cluster A. I The choice of d will be discussed later. If, for I higher values of d , k takes higher values then the size I I of the fuzzy boundary region for the ith cluster must be high. Because a boundary region is shared by two clusters, the whole fuzzy boundary region in a fuzzy partition is de"ned as L J? (;, v)" kK d . K I I I Z!++SH,HZ A , From the regularization viewpoint, the clustering problem is said to be `ill-poseda if it fails to satisfy one or more of the following criteria: solution exists, is unique and depends continuously on the data. Additional prior assumptions have to be imposed on the solution to convert an ill-posed problem into a well-posed one. In
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1224
the Markov random "eld vision modeling the smoothness assumption can be incorporated into the energy function whereby the cost of solution is de"ned. For instance, Pavlidis [24] proposes an energy function in which is incorporated the boundary (i.e. G¸ (x)) of the VZ- G regions (O ) : G GZ A
A E " (u !g(x))#j¸ (x) , G G H G VZ-G where g(x) is the height of the surface at x, c is the number of regions, u is the label assigned to the region O . The G G concept of regularization can be extended to fuzzy clustering of overlapping clusters. So, we de"ne the objective function as J (;, v)"J (;, v)#J? (;, v) K K K L A " kK""x !v "" GH I G I G L # kK d . I I I Z!++SH,HZ A ,
(5)
(6)
We specify (u , g(x)) into E correspond to (v , x ) into G H G I J (;, v) with k "1 or 0 (hard clustering). The disconK GI tinuities A Gj¸ (x) correspond to J? (;, v)" G VZ- G K L !+ + H, HZ A ,kK d . I Z S I I
In the fuzzy clustering problem, ambiguous regions located between clusters can be viewed as discontinuities. To detect these discontinuities, we de"ne the distance from the pattern x to the ambiguous class I A32!+, (+u ,) , by: H HZ A
(7)
where, the term a consists of suitable positive numbers and is adjusted in order to detect discontinuities. d is the GI classical distance used in the fuzzy c-means algorithm. d plays the role of the derivative magnitude used in the I Markov random "eld vision modeling to detect discontinuities. By analogy with the general string model [25], the term a\ relates to the discounting factor with which the discounting process avoids spreading the membership across the classes. 2.3. Extended fuzzy c-means algorithm Then, this new algorithm is based upon the minimization of the fuzzy least-squared functional criterion: J (;, v)"J (;, v)#J? (;, v). K K K
L J (;, v)" kK d I I K Z! I with following constraints:
A k # F k G GI I Z!+ +SH,HZ A , " !k "1 (F" ), Z I ∀k3[1, n] L k I A>I"o , ? n
(9)
(10)
0)k )1∀A32!. I
The partitions generated satisfy A k )1 ∀k when G GI only the single clusters are taken into account. This objective function can be speci"ed as a sum of distances between the patterns and the corresponding prototypes of the single or ambiguity clusters. The "rst constraint con"nes the memberships to the hyperplane de"ned by ! k "1. In order to Z I parametrize the proposed algorithm, we de"ne the constraint L k /n"o where o 3[0, 1] is the I A>I ? ? global ambiguity reject rate based on the whole patterns set. The membership function k k " F I A>I Z!+ +SH,HZ A ,
2.2. Discontinuities
a ( d ) d " SGZ GI , I "A"( d )A SGZ GI
Thus, the objective function is
(8)
due to the ambiguity class u is the aggregation of the A> membership functions among the collection of ambiguity classes (u ) !+ + H, HZ A ,; it means that a pattern is Z S ambiguous if and only if it is partially ambiguous (i.e. with t'1 single classes). The aggregation operator may be chosen from any number of operators such as triangular norms and conorms, among the three classical behaviors: conjunction, disjunction and compromise [26]. We may consider the ambiguity rejection as a class-selective rejection schemes [22]. The reject options are desirable in applications where it is more costly to make a wrong decision than to withhold making a decision. So, a class-selective rejection is an extension of simple rejection. That is, when a pattern cannot be reliably assigned to one of the c classes, we do not reject the pattern from all classes but only from those classes that are most unlikely to issue the pattern. For instance, for a pattern lying on the separation plane between classes 1 and 2, while being very far away from the center of the third class, we should reject only the third class and declare that it belongs to the group composed of the "rst and second classes (i.e. x belongs to I the ambiguity class u ). So, for the ambiguity class, we use a disjunction operator: an addition operator. Because the membership measures are normalized, we have: k " !+ + H, HZ A ,k )1. A>I Z S I
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
2.4. Theorem of the extended fuzzy c-means algorithm The minimization of criterion (9) with constraints (10) gives the following solutions: ∀k3[1, n], ∀A32!, kH " I
L kK x vH" I I I, L kK I I where ∀A"+u ,, i3[1, c]d """x !v "" is the classiG I I % cal distance used in the fuzzy c-means algorithm and ∀A32!+, (+u ,) , H HZ A
a ( d ) d " SGZ GI , I "A"( d )A SGZ GI
(11)
where the term a is a suitable positive number. It plays a role of the scaling factor. The choice of a will be discussed later. Let d be the global ambiguity class distance, we A>I de"ne
K\ 1 K\ " . d d A>I Z!+ +SH,HZ A , I
2.5. Sketch of the proof By adding the constraint ∀k3[1, n], !k "1 to Z I criterion (9) with family of Lagrange multipliers (j ) , I IZ L
one obtains
L # j 1! k . I I I Z! 2.5.1. kH ∀A32! calculus I By deriving overline JM (;, v, j) with respect to k , one K I obtains *JM (;, v, j) K "mkK\d !j , I I I *k I *JM (;, v, j)/*k "0 implies that the membership degree K I of pattern x to the ambiguity class A is: I
jH K\ I kH " I md I
(12)
1 K\ "1 d Z! I which brings the expression of kH , ∀A32!, ∀k3[1, n]: I 1 kH " . I !(d /d )K\ Z I I To determine the parameter a such as (the second constraint): L L kH " kH "o n. A>I I ? I I Z!++SH,HZ A , We choose: a ( d ) d " SGZ GI (13) I "A"( d )A SGZ GI which veri"es the de"nition 1. We obtain (in what follows, we con"ne ourselves a "a constant): L 1/dHK\ E I aK\ (1/d )K\# (1/dH )K\ Z I E I I "o n ? with dH "(1/a)d . I I This previous relation can be expressed as L D "o n, ? aK\E(o)#D I where E(o)"
L JM (;, v, j)" kK d K I I ! Z I
1 K\ kH "j , I I d Z! Z! I where j "(jH/m)K\. I I With the "rst constraint, we obtain
∀k3[1, n], ∀A32!,
1
By summing over A:
jI
o !
1 d K\ I d I
1225
(14)
(1/d )K\ and I Z+SH,HZ A
1/dHK\. I Z!++SH,HZ A
The system de"ned by formula [14] can be solved by the Newton}Raphson method for a'0. At this point, a remark is indispensable. The special case o "0 (no rejection at all) implies aP ? #R(k "0), the proposed extension of fuzzy cA>I means algorithm reduces to the standard algorithm. D"
3. Ambiguity and distance rejections simultaneously We propose a parallel approach for the management of both types of rejects, quali"ed as distance and ambiguity rejection. They are considered at the same step and are explicitly considered as two additional classes.
1226
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
x to a class u M as d M "¹/d where ¹ is a threshold GI I G G GI G allowing the size of the reject class to be adjusted. The main idea is that the distance to the reject class is inversely proportional to the distance to the class. De5nition 7. Let )"+u , 2, u , be the set of classes A u . The membership function k of a pattern x to the G I I global reject class denoted u is obtained by a m-ary aggregation operator GK on the membership functions (with m"c). of this individual to the collection (u M ) G GZ A
We will drop the superscript (m) when there is no fear of ambiguity. Fig. 2. x cannot be rejected in distance in clustering algorithm I where the membership values are normalized.
De5nition 8. In order to parametrize the proposed algorithm, we de"ne a global reject rate o3[0, 1] based on the whole individuals set X"(x ) . I IZ L
In fuzzy pattern recognition, an individual x located I too far from a class has got a weak membership value according to this class. In order to reduce the misclassi"cation risk, reject option is used to avoid to classify this individual. In most papers, rules proposed in order to reject these individuals are based on threshold values in the decision processing (a threshold value for each class for example) and not in the learning step. So, in the clustering algorithm based on fuzzy c-means, where the membership values of x are normalized, the reject class I cannot exist. Indeed, if x is too far from centers then it is I located at about the same distance from all the classes. So k 1/c∀ 3[1, c]. (cf. Fig. 2). GI G To overcome this problem, Dave [27,28] proposed to include a noise cluster. In this approach, all points are considered to be equidistant from this cluster. However, using a single value for the distance of the noise cluster from all points is too restrictive and poor when the cluster sizes vary widely in the data set. The method that Demko [17}19] proposes attenuates this problem by introducing an amorphous cluster called class of distance rejection. This class allows us to reject an individual x when it is too far from the centers of each I class. In this paper, we show how both kinds of rejection can be taken into account simultaneously. First, let us recapitulate the de"nition given by Demko [18]:
This new algorithm, called fuzzy c#2-means, is then based upon the minimization of the fuzzy least-squared functional criterion:
De5nition 5. Let )"+u ,2, u , be the set of classes u . A G We can consider that there is a collection of c reject classes denoted u M , one for each single class u . The G G behavior of each reject class u M must be de"ned by a dual G way for each class u (i.e. there exists a monotone deG creasing function f such that f (k )"k M ) G G GI GI De5nition 6. Let X"(x ) be the family of objects I IZ L
where x "(x , x , 2, x ) is a pattern described by I I I IN p features (i.e. x 3RN). Another way to characterize the I reject classes is to express the distance d M of an individual GI
A L J (;, v)"J (;, v)# kPM dM K K GI GI G I with following constraints:
∀k3[1,n]
(15)
A A k # G k M # k F GI GI I G G Z!++SG,,
" 6 !k "1 (F" ) Z I L k I I"o, n L k I A>I"o , ? n
(16)
0)k )1∀A32!, I 0)k M )1, GI
where J (;, v) is the objective function described in preK vious section. This objective function can be speci"ed as a sum of distances between the patterns and the corresponding prototypes of the single and reject clusters. The membership function k "GA k M , due to the I G GI reject class, u is the aggregation of the membership functions among the collection of reject classes (u M ) ; G GZ A
it means that a pattern is rejected if and only if it is globally rejected by all the reject classes. The membership function k "F !+ + H, HZ A ,k , due to the ambiA>I Z S I guity class u is the aggregation of the membership A> functions among the collection of partial ambiguity classes (u ) !+ + H, HZ A ,; it means that a pattern is am Z S biguous if and only if it is partially ambiguous (i.e. with t'1 single classes). In this study, for the amorphous global reject class, we use the algebraic product (a conjunction operator) which is, in our sense, the most adapted operator to remove the
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1227
major drawback of the classical fuzzy c-means algorithm: the noise sensitivity [16]. Since we consider the ambiguity rejection as a class-selective rejection schemes, we use a disjunction operator for the ambiguity class: addition operator. In order to parametrize the proposed algorithm, we de"ne the constraint L k /n"o where o3[0, 1] is I I the global rejection based on the whole individuals set X"(x ) and the constraint L k /n"o I IZ L
I A>I ? where o 3[0, 1] is the global ambiguity reject rate based ? on the whole patterns set. The other constraints suggest a probabilistic interpretation for the membership functions.
(j ) , one obtains: I IZ L
L A L JM (;, v, j)" kK d # kPM dM K I I GI GI Z! I G I L # j 1! k , I I I Z6! By deriving JM (;, v, j) with respect to k , one obtains K I (cf. Section 3.3):
3.1. Theorem of the fuzzy c#2-means
3.2.1. kH calculus I By deriving JM (;, v, j) with respect to k M , one obtains K GI setting d "(c(A d M )A and k "A kHM (the I G GI I G GI proof is detailed by Demko [17,18]):
The minimization of criterion (15) with constraints (16) gives the following solutions (the r parameter has disappeared: it is equal to the product mc for a suitable solution of the problem): ∀k3[1, n], ∀A326#!, kH " I
1 K\ kH "j , I I d Z! Z! I where j "(jH/m)K\. I I
, Z B B IK\
jI
∀k3[1, n], ∀A32!+, #,,
or
L kK x vH" I I I, L kK I I
jI
where ∀A"+u ,, i3[1, c] d """x !v "" is the classiG I I % cal distance used in the fuzzy c-means algorithm and ∀A32!+, (+u ,) H HZ A , a ( d ) d " SGZ GI I "A"( d )A SGZ GI
(17)
where, the term a is a suitable positive number. It plays the role of the scaling factor. d is de"ned by: I ¹(o) d " I (A d )A G GI
(18)
where ¹(o) is the threshold which controls a chosen global reject rate [17] o. Let d be the global ambiA>I guity class distance, we de"ne
K\ 1 K\ " . d !+ + , , A>I I Z SH HZ A
1
d
3.2. Sketch of the proof By adding the constraint ∀k3[1, n], 6 !k "1 Z I to criterion (15) with family of Lagrange multipliers
(19)
1 K\ k "j . I I d I With the "rst constraint, we obtain
1
6 ! I
(20)
1 K\ 1 K\ # "1 d d I Z! I
1 K\ "1 6 ! d I Z which brings the expression of kH , ∀A326#!, I ∀k3[1, n]: kH " I
1 (d /d )K\ Z6! I I
.
3.2.2. Ambiguity rejection To determine the parameter a such as (third constraint): L L kH "o n. kH " A>I I ? I I Z!+ +SH,HZ A , We choose: a ( d ) d " SGZ GI (21) I "A"( d )A SGZ GI which veri"es the de"nition 1. We obtain (in what follows, we con"ne ourselves a "a constant): 1 E L dHK\ I aK\ 6 (1/d )K\# (1/dH )K\ Z I E I I "o n ? with dH "(1/a)d . I I
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1228
This previous relation can be expressed as L D "o n, ? aK\E(o)#D I
(22)
(1/d )K\ and D" I Z++SH,HZ A 6, !+ + , ,1/dHK\. I Z SH HZ A
The last step consists in computing the value of (d ) , such that L k "on (second constraint): I IZ L
I I
where
E(o)"
L k "on, I I L 1 "on, 6 !(d /d )K\ Z I I I
Fig. 3. Two normalized gaussian classes with a"1 and o"0.10.
L 1 "on. 1# !(d /d )K\ Z I I I
4. Numerical Examples
Using [18] d "(c(A d M )A and d M "¹/d , we G GI GI G GI I obtain L I 1#
1
(c(A ¹/d )A K\ G G GI d I
Z!
(23)
"on Setting (c(A ¹)A"¹ (23) becomes G G L I 1#
1 "on ¹ K\ d (A d )A I G GI
Z!
(24)
This relation can be expressed as L 1 "o n, ? 1#¹K\E(o ) ? I
(25)
where
1 K\ E(o )" . ? A d ( d )A I G GI Z! The system de"ned by formulas (22) and (25) can be solved by the Newton}Raphson method for a'0 and ¹'0. To apply this method, we initialize o , by resolv ing the equation given by the fuzzy c#1-means [17]. This value is near the solution. At this point, a remark is indispensable. The special case o "0 and o"0 (no rejection at all) implies ? aP#R and ¹P#R(k "0 and k "0), the A>I I proposed extension of fuzzy c-means reduces to the standard algorithm.
4.1. 1-D study We "rst present a simple example to provide insight into the ambiguity reject approach. Here we discuss the shape of membership functions in the 1-D case. Fig. 3 presents two normalized gaussian classes with a global reject rate of 10% and a"1. k and k , the membership functions are as per the theoretic study speci"ed. We notice a particular point which is totally ambiguous, characterized by the equality: k "k " I I k . This can be explained as follows: from the I equation (d #d ) I , d "a I GHI 4(A d )A S SI we can notice that a totally ambiguous pattern veri"es d "d , so (A d )A"d .Then (a"1) I I S SI I (d ) d " I "d GHI I (d ) I which means that the point is located at the same distance from the two classes and from the ambiguity class u , which is well represented in Fig. 3. Fig. 4 presents the case a"0.5 with the same reject rate as previously. As shown before, a point located at the same distance from the two classes is de"ned by: k )k and k "k . We notice too that the reject I I I I class u is not very perceptible to a modi"cations, which means that the de"nition of ambiguity does not a!ect the distance reject class shape. Fig. 5 illustrates the behavior of the membership function of the ambiguity reject class among a variations for a "xed o value. The ambiguity rejection introduces
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1229
Fig. 4. Two normalized gaussian classes with a"0.5 and o"0.10.
Fig. 6. Example of a data set with three clusters in which x is ambiguous.
Fig. 5. Evolution of k (x) in function of a3[0.5, 1] and o"0.4.
Fig. 7. This "gure presents the behavior of membership functions k ∀A32! among o 3[0.01, 0.19]. We notice k are V ? V equal ∀A32! with a"1 (o close to 0.08). ?
a discounting process between the classical FcM membership functions. The normalization of membership functions leads to odd e!ects with respect to the shape of the membership functions but is not a serious problem if a hard partition such as pattern classi"cation is required. 4.2. 2-D study: 3 classes, 4 ambiguity reject classes, 1 distance reject class Fig. 6 shows a situation containing three clusters. The clusters are numbered in the order in which they would be encountered in the top to bottom, left to right scan of the "gure. Due to the ambiguous pattern x , the cluster centers estimates are poor by using FcM: for example, k O1. Fig. 7 illustrates the belonging of V x according to the single and ambiguity classes and among o . x veri"es d "d "d . According to ? V V V
the relation a( G d ) S Z GI , d " (26) I 9( d ) SGZ GI we may notice this point is totally ambiguous, k , V respectively d , are equal ∀A32! with a"1 and V )"+u , u , u ,. This point de"ned as ambiguous is less accounted for in the computing of the centers according to the classical FcM algorithm. So, when o increases (i.e. ? a decreases), the degree of membership k tends to 1. V 4.3. The diamonds data case In this example, we use the two-dimensional data set presented by Pal et al. [16] (see Figs. 8 and 9) whose coordinates are given in Table 2. It is separated in two
1230
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237 Table 2 Membership functions for 2 classes (m"2, o"0) Data Pt
Fig. 8. Data set X .
1 2 3 4 5 6 7 8 9 10 11 12
xy !5.00 !3.34 !3.34 !3.34 !1.67 0.00 1.67 3.34 3.34 3.34 5.00 0.00
0.00 1.67 0.00 !1.67 0.00 0.00 0.00 1.67 0.00 !1.67 0.00 10.00
FcM on X
FcM on X
k k I I
k k I I
0.95 0.94 1.00 0.94 0.91 0.50 0.09 0.06 0.00 0.06 0.05
0.05 0.06 0.00 0.06 0.06 0.50 0.91 0.94 1.00 0.94 0.95
0.93 0.97 0.99 0.90 0.92 0.50 0.08 0.03 0.01 0.10 0.06 0.50
0.07 0.03 0.01 0.10 0.08 0.50 0.92 0.97 0.99 0.90 0.94 0.50
Table 3 Sorted membership functions for 2 classes (m"2). FcM algorithm
!
Fig. 9. Data set X .
data sets: X and X "X 6+x , where x is a noisy point located far away from the two clusters. There are two diamond shaped clusters composed with "ve points each and with one middle point. To show the interest of algorithm, we have applied the same tests done by Pal et al. [16] and Demko [17] using the fuzzy c#2-means algorithm on both X data sets and X . 4.3.1. Limit of FcM Tables 2 and 3 show the fuzzy c-means problem illustrated by the x and x membership values. Though these two patterns are di!erently far away from the two clusters, they are characterized with equal membership values. In particular, Table 3 shows the set of points sorted among their membership values with o"0 (i.e. FcM case). Concerning the x and x points, the result is as we expected, because these 2 points are far away from the 2 classes. So the fact that x and x are di!erently
#
Class 1
Class 2
9 8 11 7 10 6 12 4 5 1 2 3
3 2 1 5 4 6 12 10 7 11 8 9
located according to class 1 and 2 (see Fig. 9) is not taken into account by FcM. The points x (respectively x ) strongly belongs to cluster 1 (respectively, cluster 2). x and x have globally the same behavior; i.e. their membership functions to the reject class increase when o increases. The essential di!erence between x and x is the magnitude of the deriva tive of the membership function to the reject class with respect to o, i.e. x is rejected faster than x . So x is rejected while x is not: this point belongs uniformly to the three classes. This involves a problem to the decision making: x does not belong to any class. This shows the interest to build a third class dealing with the ambiguity of the point. Table 4 presents the results obtained by Pal et al. [16] from the FPcM algorithm. Points are sorted by typicalities, so that x is the less representative point according
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1231
Table 4 Sorted typicalities for 2 classes (m"2 g"2.0) FPcM
!
#
Class 1
Class 2
12 11 10 8 9 7 6 4 1 5 2 3
12 1 4 2 3 5 6 10 11 7 8 9 Fig. 11. Data set X , o"0.25, x .
Fig. 10. Data set X , o"0.25, x .
to both classes. This is a result similar to the one of the F(c#1)M algorithm [17,18]. 4.3.2. Introduction of the ambiguity class: F(c#2)M Fig. 10 shows the di!erent membership values of x in the case o"0.25 and a3[0.5, 1]. Let us notice that for a"1, we have k "k "k . For an important ambiI I I guity rejection (weak a), the membership value of the ambiguity class is preponderant. Finally, let us notice that the belonging to reject class is not very perceptible to a variations. Fig. 11 shows the right behavior of our algorithm on the point x (o"0.25). Moreover, the membership value according to the reject class is not very perceptible to a variation. We can easily remark that the membership value of the point x according to the ambiguity class is higher than those of both classes 1 and 2 (as Fig. 9 shows).
Fig. 12. Data set X , o"0.25, x .
Fig. 12 presents the belonging of x according to the four classes and among a. Once again, it illustrates the properties of the chosen norm: k is not perceptible to V a. But the variations of membership measures among a are shared between k and k . V V Fig. 13 shows one of the main interest of our F(c#2)M algorithm according to the classes centers search. It deals with the in#uence of a on the localization of classes centers, represented by their distances from x , the mesh center. The points de"ned as ambiguous (a"0.5) taken into account are more and more for the centers calculus with the increasing a, which is noticed by weaker and weaker distances. Figs. 14 and 15 present the behavior of membership functions for x and x with a high reject rate. Once again we notice k "k "k when a"1. k is not I I I I very perceptible at a. For x , the decreasing of k is equivalently shared between k and k .
1232
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
Fig. 13. Data set X , o"0.25, distances from the centers.
Fig. 16. Data set X .
Our algorithm can distinguish between a moderately atypical member and an extremely atypical member. Moreover, the Fc#2M algorithm gives low memberships for the noise points (high memberships for the distance reject class) and the cluster centers estimates are quite acceptable. 4.4. The treye data case Here we work on a 2-D dataset concerning three classes, which is drawn on the Fig. 16. The study illustrates the right behavior of both distance and ambiguity rejections, which show the interest of the Fc#2M according to the "ability of the clustering.
Fig. 14. Data set X , o"0.4, x .
Fig. 15. Data set X , o"0.4, x .
E E E E
points marked 1, 2, 3, 4, 5 belong to class 1; points marked 8, 9, 10, 11, 12 belong to class 2; points marked 13, 14, 15, 16 and 18 belong to class 3. points marked 6 and 7 are partially ambiguous, points marked 17 is totally ambiguous.
Fig. 17 shows the behavior of the membership functions of point x and x when o 3[0.01, 0.2]. Point ? x gives what we awaited, for weak values of o : x ? mainly belongs to the class 1 (x mainly belongs to the class 2). The pattern x is characterized as strongly ambiguous for o '0.14. ? In the FcM algorithm, the membership of a point in a class depends on the membership of the point in all other classes. So, for example, k O1 for weak values V of o (i.e FcM case). That signi"cantly a!ects the esti? mates of the cluster centers. As can be seen (Fig. 17c), the center estimates are poor in this case. However, for a high value of o , the membership of x and x to the ambiguity ? classes increase (k , k and k , k ). So, for V V V V o close to 0.2, the performance is quite acceptable: ? d "d "d "5 where v "x , v "x , TV TV TV v "x are the prototypes.
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1233
Fig. 17. Data set X . Membership values of the points x and x and distances from the cluster centers to x . For a high value of o , ? the memberships of x to the ambiguity classes increase (respectively k and k ). So, for o close to 0.2, the center estimates is quite V V ? acceptable: (d "d "d "5) where v "x , v "x , v "x are the cluster centers. For, o '0.14 x is characterized as TV TV ? TV strongly ambiguous.
4.5. The Fisher's IRIS We have tested our method with the IRIS data set [29] which has extensively been used for evaluating the performance of pattern clustering algorithms. We compared it with results obtained by Pal et al. [16] and Demko et al. [17]. The IRIS #owers data set is composed of 150 patterns divided in three physical classes representing di!erent IRIS subspecies. Each class contains 50 patterns. One of the three clusters is clearly separated from the other two, while these two classes admit some overlap. In Tables 5}8 , we present results in terms of numbers of resubstitution errors. The errors committed by the three hardened label matrices are identi"ed by notation such as E(H (;)), where H stands for the hardening of +2 the matrix argument by either maximum memberships (MM) or maximum typicalities (M¹), and E stands for errors. Table 5 shows the results obtained by Pal et al. [l6]. In the other tables, E(H (;)) is the resubstitution ++ error of the fuzzy c#2-means, R (H (;) is the number " ++ of rejected patterns in distance with an Euclidean norm or a Mahalanobis norm and R (H (;) is the number of ++ rejected patterns in ambiguity with an Euclidean norm or
Table 5 Resubstitution errors on the IRIS data using FPcM and FcM. g is a user de"ned constant for the FPcM algorithm m
g
E(H (¹)) +2 FPcM
E(H (;)) ++ FPcM
E(H (;)) ++ PcM
1.5 1.5 2.0 2.0 3.0 5.0 5.0
1.5 3.0 2.0 5.0 3.0 2.0 5.0
17 15 15 15 14 12 12
17 17 16 16 14 14 15
17 17 16 16 15 15 15
a Mahalanobis norm. They are based on the comparison of the hardened versions of ; matrix to the correct crisp partition of IRIS. We explain these results in the following paragraphs. First, various general remarks can be noticed: E The unexpected result (0 pattern rejected) in some cases shows the interest to take the reject notions into account more during the learning step than during the
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1234
Table 6 Resubstitution error and number of rejected patterns for m"3, o"0.08 and o 3[0.19, 0.27] (Euclidean norm) ? o ?
E(H (;)) ++
R (H (;)) " ++
R (H (;)) ++
0.15 0.16 0.17 0.18 0.19 0.20 0.21 0.22 0.23 0.24
11 11 11 11 11 11 11 11 11 11
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1
Table 7 Resubstitution error and number of rejected patterns for m"1.2, o"0.02 and o 3[0.005, 0.04] (Mahalanobis norm) ? o ?
E(H (;)) ++
R (H (;)) " ++
R (H (;)) ++
0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04
3 3 2 1 1 1 1 2
3 3 3 3 3 3 3 3
0 0 2 3 3 4 4 5
Table 8 Behavior of R , for given m"1.1 and o "0.025: it means ? a weak variation according to o (Mahalanobis norm) o
E(H (;)) ++
R (H (;)) " ++
R (H (;)) ++
0.020 0.025 0.030 0.035 0.040 0.045 0.050 0.055 0.060 0.065 0.070 0.075
1 1 1 1 1 1 1 2 2 2 2 2
3 4 4 5 5 5 6 9 9 9 10 11
3 3 3 3 3 3 3 2 3 3 3 3
decision step. The patterns are globally rejected in distance or in ambiguity with a rate equal to o or o but the hardened label remains the best. Indeed, the ? centers location are better because they are far away from each other.
E For particular values of o, a (which is globally taken into account by o ) and m, 7 misclassi"ed or no classi? "ed points, were obtained once, distributed as follows: E(H (;))"1, R (H (;))"3 and R (H (;))"3. ++ " ++ ++ This shows the interest to use both the distance and the ambiguity rejections in the classi"cation step. So, the misclassi"ed rate is weak. The "nal centers produced by the FcM and F(c#2) algorithms are slightly di!erent. In the second case they are closed to the centers of the three physical classes. We show in Table 6, the resubstitution error and number of rejected patterns for m"3, o"0.08 and for o 3[0.19, 0.27] when we use an Euclidean norm in the ? distances calculus. It is worthy to note that the performance of the proposed algorithm increases signi"cantly with respect to the correct classi"cation (compare with Table 5). The distance or ambiguity reject rate makes it possible to modify the cluster centers (according to FcM) because the ambiguous and reject patterns are less taken into account. Table 7 shows, for the particular values of m"1.2 and o"0.02, the resubstitution error and number of rejected patterns using the Mahalanobis norm according to the Gustafson and Kessel method similar to [8]. In the light of the previous examples, the reject class, similar to the three ambiguity reject classes, allows to sharply locate the centers and to decrease E(H (;)), compared with ++ the classical FcM, F(c#1)M, and FPcM algorithms. Results are largely better; for instance the best classical FcM results with the same norm is 7 errors, where F(c#2)M gives E(H )"1. The cases ( o 3[0.02, ++ ? 0.025]) clearly show the good results obtained with the F(c#2)M, where the misclassi"ed (by the classical FcM) points, are now either right classi"ed (best centers location, the most signi"cant points are acutely taken into account), either distance rejected or ambiguity rejected. Moreover, in this last case, our algorithm is able to specify which composite class these patterns belong to. Table 8 shows the right behavior of R (H (;)) ++ among o3[0.02, 0.075]. It means a weak variation among this parameter. The distance reject rate o seems to be independent of ambiguity reject rate. In order to qualify the performance of the presented algorithm, Table 9 compares the results obtained by di!erent unsupervised clustering algorithm with reject option or not (FcM, FPcM, F(c#1)M, F(c#2)M). This table shows the error (misclassi"cation) measure, P , amC biguity reject measure, P and distance reject measure, P . ? P The proposed F(c#2)M algorithm leads to a decrease in the misclassi"cation measure. So, ambiguity rejection allows to reach the best performance with respect to misclassi"cation measure. The higher o , the more the ? patterns are ambiguity rejected, therefore, the lesser the misclassi"cation the higher the ambiguity probability. The higher the o, the more the patterns are rejected in
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1235
Table 9 The best performance of unsupervised clustering algorithms
P C P ? P P
FcM
FPcM
F(c#1)M
F(c#2)M
4.66 * *
4.66 * *
2 * 2.66
0.66 2 2
Fig. 20. The maximum membership function on the main factorial plane of IRIS data among the three classes with m"1.5 and o"0.08, o "0.05. Mahalanobis norm. ?
Fig. 18. The maximum membership function on the main factorial plane of IRIS data among the four ambiguity reject classes with m"1.5 and o"0.08, o "0.05. Mahalanobis norm. ?
Fig. 19. The membership function on the main factorial plane of IRIS data for the amorphous reject cluster with m"1.5 and o"0.08, o "0.05. Mahalanobis norm. ?
distance, therefore the lesser the misclassi"cation the higher distance the reject measure. For o P0 and oP0, ? F(c#2)M algorithm is equivalent to FCM algorithm. In Figs. 18 and 19 (respectively Fig. 20), are shown the form of the maximum membership functions of the "ve reject classes (ambiguity and distance rejection) (respec-
tively the three classes) in the main factorial plane. This example is provided to illustrate the partition of pattern space according to the (3#4#1) classes. In the third case the projection of this function in term of isomembership values seems to produce ellipsoidal classes which is the expected result.
5. Conclusion We started this paper by justifying the interest to "nd a fuzzy partition and to introduce two type of reject classes in the clustering process: the distance rejection dealing with patterns that are far away from all the classes and the ambiguity rejection dealing with patterns lying near the boundaries of classes. The method that we propose is formulated as a constrained minimization problem, whose solution depends on a fuzzy objective function in which reject options are introduced. To avoid the memberships to be spread across the classes and to allow to distinguish between `equally likelya and `unknowna, we de"ne partial ambiguity rejections which introduce a discounting process between the classical FcM membership functions. The model is de"ned based on the principle that wherever a discontinuity occurs, the interaction should diminish. To improve the performance of our algorithm in the presence of noise, we de"ne an amorphous noise cluster. The advantages of our method is that the membership values obtained are absolute values and are learned in the iterative clustering process. Moreover, it is not necessary to compute other characteristics in order to know the reject degrees. Preliminary computational experiences on the developed algorithm are encouraging and compared favorably with results from other methods as FcM and FPcM and F(c#1)M algorithms on the same data sets. The di!erences between the performance can
1236
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
be attributed to the fact that ambiguous patterns are less accounted for in the computing of the centers. So the "nal centers are closed to the centers of the physical clusters. The algorithm uses two additional parameters: o a chosen global reject rate which has been introduced in F(c#1)M algorithm and o a global ambiguity reject ? rate. The behavior is the following: The higher the o , the ? more the patterns are ambiguity rejected, therefore, the lesser the misclassi"cation the higher the ambiguity probability. The higher o, the more the patterns are rejected in distance and the lesser the misclassi"cation and the higher the distance reject measure. For o P0 ? and oP0, F(c#2)M algorithm is equivalent to FcM algorithm. Moreover our algorithm is able to specify the composite classes to which the patterns belong. Extensive investigation of the properties, the proposed approach owns, on image analysis may constitute further steps in our research. We are currently exploring this issue.
References [1] J. MacQueen, Some methods of classi"cation and analysis of multivariate observations, Proceedings of the 5th Berkeley Symposium on Math. Stat. and Prob. U, California Press, Berkeley, CA, 1967. [2] M.R. Andenberg, Cluster Analysis for Applications, Academic Press, New York, 1973. [3] S.K. Pal, D.D. Majumber, Fuzzy sets and decision making approaches in vowel and speaker recognition, IEEE. Trans. Systems Man Cybernet. 7 (1977) 625}629. [4] S.K. Pal, Fuzzy tools in the management of uncertainty in pattern recognition, image analysis, vision and expert systems, Int. J. Systems Sci. 22 (1991) 511}549. [5] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1987. [6] E. Ruspini, A new approach to clustering, Inform. Control 15 (1969) 22}32. [7] J.C. Bezdek, R.J. Hathaway, M.J. Sabin, W.T. Tucker, Convergence theory for fuzzy c-means: Counter examples and repairs, IEEE Trans. Systems Man Cybernet. 17 (5) (1987) 873}877. [8] D.E. Gustafson, W.C. Kessel, Fuzzy clustering with a fuzzy covariance matrix, Proceedingds of IEEE CDC, San Diego, CA, January 10}12 1979, pp. 761}766. [9] R. Krishnapuram, A possibilistic approach to clustering, IEEE Trans. Fuzzy Systems 1 (2) (1993) 98}110. [10] N.B. Karayiannis, M.Ravuri, An Integrated Approach to Fuzzy Learning Vector Quantization and Fuzzy c-Means Clustering, in: C.H. Dagli et al. (Eds), Vol. 4, ASME Press, New York, USA, March 1997, pp. 247}252.
[11] N.B. Karayiannis, Generalized fuzzy k-means algorithms and their application in image compression, SPIE Proceedings: Applications of Fuzzy Logic Technology II, Vol. 2493, Orlando, Florida, April 1995, pp. 206}217. [12] R.N. Dave, Generalized fuzzy c-shells clustering and detection of circular and elliptical boundaries, Pattern recognition 25 (1992) 713}721. [13] B. Dubuisson, Decision with reject option, European Signal Processing Conference, Barcelona, Spain, 1990. [14] B. Dubuisson, M.H. Masson, A statistical decision rule with incomplete knowledge about classes, Pattern Recognition 26 (1993) 155}165. [15] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [16] N.R. Pal, K. Pal, J.C. Bezdek, A mixed c-means clustering model; Proceedings of the Sixth IEEE International Conference on Fuzzy Systems, Barcelona, July 1997, pp. 11}21. [17] C.Demko, M. Menard, P. Loonis, The fuzzy c#1-means: introduction of reject in clustering, IEEE Trans. Pattern Anal. Mach. Intell. Under review. [18] C. Demko, P. Loonis, M. Michel, Les c#1-moyennes #oues: introduction du rejet en classi"cation, Actes des sixie`mes rencontres de la socieH teH francophone de classi"cation, Montpellier, France, September 1998. [19] C. Demko, Les c#1-moyennes #oues: eH limination du bruit en classi"cation. In Actes des rencontres francophone sur la logique #oue et ses applications, Rennes, France, November 1998. [20] C.K. Chow, On optimum recognition error and reject tradeo!, IEEE Trans. Inform. Theory 16 (1990) 41}46. [21] C.K. Chow. Recognition error and reject trade-o!, in: Proceedings of the Third Annual Symposium Document Analysis and Information Retrieval, Vol. 13, Univ. of Nevada, Las Vegas, USA, April 1994, pp. 1}8. [22] T.M. Ha, The optimum class-selective rejection rule, IEEE Trans. Pattern Anal. Mach. Intell. Vol. 19, No. 6, Yokohama, Japan, March 1997, pp. 608}614. [23] Z. Pawlack, Rough sets, Theoritical Aspects of Reasoning About Data, Kluwer Academic Publishers, Dordrecht, Boston, London, 1991. [24] T. Pavlidis, A critical survey of image analysis methods, ICPR (1986) 502}511. [25] A. Black, A. Zisserman, Visual reconstruction, MIT Press, Cambridge, MA, 1987. [26] I. Bloch, H. Maitre, Fuzzy mathematical morphologies: a comparative study, Pattern Recognition 28 (9) (1995) 1341}1387. [27] R. Dave, Characterization and detection of noise in clustering, Pattern recognition Lett. 12 (1991) 657}664. [28] R. Dave, Robust fuzzy clustering algorithms, IEEE International Conference on Fuzzy Systems, San Francisco Vol. 2 (1993) 1281}1286. [29] R.A. Fisher, The statistical utilisation of multiple measurements, Ann. Eugen. 8 (1938) 376}386.
About the Author*MICHEL MED NARD is currently an assistant professor at the University of La Rochelle, France. He holds a Ph.D. degree from the University of Poitiers, France (1993). His research interests are fuzzy pattern recognition, fuzzy sets, data fusion with particular applications to medical imaging.
M. Me& nard et al. / Pattern Recognition 33 (2000) 1219}1237
1237
About the Author*CHRISTOPHE DEMKO is currently an assistant professor at the University of La Rochelle, France. He received an engineer grade and holds a Ph.D. degree from the University of Technology of Compigne, France (1992). His research interests are fuzzy logic, multi-agent systems and pattern recognition.
About the Author*PIERRE LOONIS, born in 1968, is currently an assistant professor at the University of La Rochelle, where he received his Ph.D. degree in Pattern Recognition (1996). His main scienti"c interest include fuzzy pattern recognition, aggregation of multiple classi"cation decisions, neural networks, genetic algorithms and real-world applications.
Pattern Recognition 33 (2000) 1239}1250
Self-organizing neural networks based on spatial isomorphism for active contour modeling夽 Y.V. Venkatesh*, N. Rishikesh Computer Vision and Artixcial Intelligence Laboratory, Department of Electrical Engineering, Indian Institute of Science, Bangalore-560 012, India Received 9 September 1998; received in revised form 8 February 1999; accepted 8 February 1999
Abstract The problem considered in this paper is how to localize and extract object boundaries (salient contours) in an image. To this end, we present a new active contour model, which is a neural network, based on self- organization. The novelty of the model consists in exploiting the principles of spatial isomorphism and self-organization in order to create #exible contours that characterize shapes in images. The #exibility of the model is e!ectuated by a locally co-operative and globally competitive self-organizing scheme, which enables the model to cling to the nearest salient contour in the test image. To start with this deformation process, the model requires a rough boundary as the initial contour. As reported here, the implemented model is semi-automatic, in the sense that a user-interface is needed for initializing the process. The model's utility and versatility are illustrated by applying it to the problems of boundary extraction, stereo vision, bio-medical image analysis and digital image libraries. Interestingly, the theoretical basis for the proposed model can be traced to the extensive literature on Gestalt perception in which the principle of psycho-physical isomorphism plays a role. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Active contours; Deformable templates; Deformation of patterns; Gestalt psychology; Spatial isomorphism; Neural networks; Self-organization; Snakes
1. Introduction A goal of computational vision is to extract the shapes of two- and three-dimensional objects from images of physical scenes. To this end, most of the present literature deals with model-based techniques which use a model of the object whose boundary representation is matched to the image in order to extract the boundaries of the object. The models used in such a process could be either rigid, as in the case of simple template-matching approaches,
夽 An earlier, brief version of this paper has appeared in the Proceedings of the IEEE International Conference on Neural Networks 1997 (ICNN-97) [1]. * Corresponding author. Tel.: #91-80-3092572; fax: #9180-3341683. E-mail addresses:
[email protected] (Y.V. Venkatesh),
[email protected] (N. Rishikesh).
or non-rigid, as in the case of deformable models. The latter, which deform themselves in the process of matching, have come to be known as active contour models (ACM). It is to be noted that the word active is used to represent the dynamical nature of the models in the process of matching. Because these active contour models are more #exible than the earlier rigid models, they have been e!ectively employed in resolving various problems in vision: stereo matching [2], motion tracking [2,3], detection of subjective contours [2], segmenting biomedical images, [3,4] face recognition [5,6], and so on. In this paper, we propose a new active contour model based on self-organization. This model completely di!ers from the other models in both the underlying theory and implementation. We utilize a modi"cation of the neuralnetwork model proposed by Ganesh Murthy and Venkatesh [7] and utilized by Shanmukh et al. [8], who, for pattern classi"cation, employ self-organizing networks
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 6 - 1
1240
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
(SON), which are spatially isomorphic to patterns. While exploiting the simplicity and elegance of the above model, we modify the underlying theory to "t the problem of contour extraction. An analogy, which closely relates the model to the age-old theory of psycho-physiological isomorphism [9], is presented as a possible theoretical basis for its development. The paper is organized as follows: Section 2 describes the existing ACM and the various approaches used towards achieving the goal of modeling contours. Section 3 presents an analogy relating the model to the concept of psycho-physiological isomorphism. Section 4 is concerned with the application of the concept of spatial isomorphism (with respect to neural networks) to character and object recognition [7,8,10]. Section 5 presents the proposed active contour model, along with a description of the constraints imposed and the various methods of initialization. Section 6 discusses the implementation details of the approach, listing its distinct characteristics and advantages. Section 7 describes the applications of the model to contour extraction, stereo-image analysis, biomedical image interpretation, and image libraries. Section 8 concludes the paper.
2. Existing ACM's The snake [2] is probably the "rst proposed ACM, which is a controlled continuity-spline under the in#uence of internal (spline), image and external (constraint) forces. The internal spline forces impose a smoothness constraint, while the image forces push the snake towards salient features (lines, edges, subjective contours, etc.). The external constraint forces, on the other hand, are responsible for placing the snake near the desired local minimum, and originate from the choice of the initial contour, which, in turn, is governed by higher level image interpretation algorithms. The problem of contour modeling is then cast in the framework of energy minimization, with the energy functions consisting of terms corresponding to the internal, image and external forces. The internal spline energy involves "rst and second order terms, controlled by parameters which are themselves functions of the parameter representing the position of the snake. The image force, on the other hand, involves weighted values of line, edge and termination functionals. Finally, the external constraint force is used to select a local minimum of the chosen energy function. In the attempt to overcome some of the shortcomings of the snake model, the ACM's proposed in the literature either modify the energy functionals used in the original snake model or propose new approaches, a few of which are discussed below. Leymarie and Levine [3] employ the snake model for segmenting a noisy intensity image, and for tracking
a non-rigid object in a plane. They also propose an improved terminating criterion (for the optimization scheme in the snake model) on the basis of topological features of the graph of the intensity image. Amini et al. [11] discuss the problems associated with the original snake model, and present an algorithm for active contours based on dynamic programming. They formulate the optimization problem as a discrete multistage decision process, and solve it using a time-delayed discrete dynamic programming algorithm. Cohen [12] proposes a `balloona model as a way to generalize and solve some of the problems encountered with the original snake model. Cohen introduces an internal pressure force by regarding the curve or surface as a balloon which is in#ated, and modi"es the internal and external forces used in the snake model by adding the pressure force, so that the boundary is pushed out as if air is introduced inside. Cohen [4] generalizes the balloon model to a 3-D deformable surface (which is generated in 3-D images). Lai and Chin [13] propose a global contour model, called the generalized active contour model, or g-snakes. Their active contour model is based on a shape matrix which, when combined with a Markov random "eld (used to model local deformations), yields a prior distribution that exerts in#uence over the global model, while allowing for deformations. Moreover, they claim that their internal energy function, unlike the snake model (which constrains the solution to the class of controlled continuity splines), is more general because it allows incorporation of prior models to create bias towards a particular type of contour. Lai and Chin [14], present a min}max criterion which automatically determines the optimal regularization at every location along the boundary. Chiou and Hwang [15] suggest a neural-networkbased stochastic active contour model in which a feedforward neural network is used to build a knowledge base of distinct features so that the external energy function used in the snake model can be formulated systematically. Staib and Duncan [16] consider parametrically deformable models for boundary "nding which is formulated as an optimization problem for estimating the maximum of the a posteriori probability function (MAP). They apply #exible constraints, in the form of a probabilistic deformable model, to the problem of segmenting natural 2-D objects whose diversity and irregularity of shape preclude their representation in terms of "xed features or form. Malladi et al. [17] describe a new ACM based on a level-set approach for recovering the shapes of objects in two and three dimensions. According to them, the parametric boundary representation schemes (similar to the snake model) will encounter di$culties when the dynamic model embedded in a noisy data set expands/
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
shrinks along a normal "eld. They further report that their modeling technique avoids the Lagrangian geometric view (as in snakes), but instead capitalizes on a related initial-value partial di!erential equation. Jain et al. [18] employ deformable templates (which are also, in a sense, ACM's) for object matching. Here, prior knowledge of an object shape is described by (i) a prototype template characterized by representative contours/edges; and (ii) a set of probabilistic deformation transformations on the template. A Bayesian scheme, which is based on this prior knowledge and the edge information in the input image, is used to "nd a match between the deformed template and the objects in the image.
3. Gestalt psychology and isomorphism Psycho-physiological (or psycho-physical) isomorphism is the theory that patterns of perception and of cerebral excitation show a one-to-one topological correspondence in which the spatial and temporal orders of items and events in the conscious and cerebral "elds are the same, although spatial and temporal intervals between items and events (while they may correspond in their orders) do not agree in their magnitudes [9]. This view has a considerable history and plays an important role in the Gestalt school of psychology (cf. Chapter VIII in Hernstein and Boring [9]). A set of points is said to be isomorphic to another set of points, if every point in one corresponds to a point in the other, and the topological relations or spatial orders of the points are the same in the two. The Gestalt psychologists believed that the distribution of electrical activity within the brain resembles the shape of the object seen. This apparent resemblance between perception and brain activity plays a prominent role in Gestalt theory [20]. The proposed approach can be brought into this framework of psycho-physiological isomorphism because it creates a network of neurons topologically equivalent to (or isomorphic to) the points in the image plane (see Section 5), or a one-to-one correspondence is made between the image points and the neurons. The theory of isomorphism, apparently reasonable in principle, turns out to be wrong, as evident from the recent "ndings about the functioning mechanisms of the mammalian brains, which clearly show that the visual world is not represented as an isomorphic picture within
If a system of points is marked on a #at rubber membrane, and the membrane is then stretched tightly over some irregular surface, then the points in the stretched membrane are isomorphic to the points in the #at membrane [19].
1241
the brain [20]. Retinal signals (from the 130 million and odd receptors) pass through the (one million or so of) retinal ganglion cells which collate messages from the numerous photoreceptors, and summarize them in a biologically relevant manner. Observe that there cannot exist, theoretically, a one-to-one mapping from the retinal receptors to the retinal ganglion cells in view of the 130 : 1 compression factor. (It should be noted that this observation is made in a general context because there may be one-to-one correspondence between retinal receptors and ganglion cells in case of the receptors in the foveal region of retina [21].) The neural signals from the ganglion cells, then, pass through the superior colliculus and the lateral geniculate nucleus on the way to the visual cortex. The current interpretation is that the theory of isomorphism cannot be valid since there is no one-to-one mapping from the retinal receptors to the visual cortex.
4. Character and object recognition In this section, we discuss the application of spatial isomorphism to character and object recognition [7,8,10]. The human vision system recognizes patterns in spite of scale changes, rotation and shift. Possibly, this is achieved by a conscious establishment of correspondence between signi"cant features of the model and those of the retinal image. The result of classi"cation then depends on the ease of correspondence of the given pattern with each of the model images (exemplars). The exemplar with which the correspondence is established most easily could then be considered as the class to which the test pattern belongs. In other words, human recognition is perhaps guided by the amount of mental deformation the exemplar has to undergo to match the given unknown pattern. This idea of using a deformation strategy and a corresponding deformation measure for classi"cation has been successfully exploited (with very good accuracy) by Ganesh Murthy and Venkatesh [7] and by Shanmukh et al. [8], for the recognition of 2-D objects and characters, subject to rotation and scaling. Here, a binary template of each and every model is stored as a model image (exemplar). During the recognition phase, a network of neurons is created for each of the exemplars, with the neurons in each network arranged in exactly the same way as the pixels of the corresponding exemplars are. That is, the network created is spatially isomorphic to the exemplars. Then, a locally co-operative weight-updating scheme is used to deform each of the networks so as to establish the correspondence between the test pattern and the exemplars. Once the mappings of the networks onto the test pattern are established, a deformation measure is used to "nd the network which has undergone the least
1242
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
deformation to establish this mapping, and the test pattern is classi"ed to be the pattern corresponding to the network. The method uses a self-organization scheme similar to Kohonen's [22] algorithm, but is completely di!erent from it in terms of architecture. Explicitly, the method does not employ the neural architecture with a lattice of neurons, typical of Kohonen's network.
5. Proposed active contour model In the course of exploiting the simplicity and e$ciency of the above approach [7,8], we modify it so as to be applicable to the problem of contour extraction. The present model, in common with most of the present ACM's, requires an initial contour (see Section 5.2 below), starting from which it evolves. A neural network isomorphic to this initial contour is constructed, and subjected to deformation in order to map onto the nearest salient contour in the image. The correspondence between the salient contour and the network is established by mapping the latter onto the former by using the self-organization scheme [22,10]. The steps involved in such a mapping are as follows: 1. Compute the edge map of the test image. 2. Set the initial contour from where the system has to start, using a suitable initialization scheme (see Section 5.2 below). Choose the region of interest according to the location of the initial contour. (The region of interest is a rectangle enclosing all the points of the initial contour.) 3. Obtain the edge points E"+(x , y ), i"1,2, N , G G C within the region of interest, where N is the number C of edge points within the region of interest. 4. Construct a network with N neurons, where N is the A A number of points on the initial contour. Each neuron in the network receives two inputs (I , I ). The weights wG"(wG , wG ), i"1,2, N , corresponding to A these two inputs, are initialized to the co-ordinates of points on the initial contour. In e!ect, construct a neural network isomorphic to the initial contour. 5. Repeat the following steps a certain number of times (N ): GRCP (i) Select a point p"(u,v)3E randomly, and feed the (x,y) coordinates of the selected point p as inputs (I , I ) to every neuron in the network. (ii) Determine the neuron whose weight vector is closest (w.r.t. Euclidean distance measure) to the input vector, and declare it as the winner neuron. If the distance between the winner neuron's weight vector (wU) and the input vector is greater than a particular threshold ¹ , then go to (i). UB (iii) Update the weights of the neurons in the network using the following rule:
For neuron i, wG"wG#g*e\,U\G,HN*(p!wG),
(1)
where g, p are the standard learning rate and neighborhood parameters. (iv) Calculate the parameter C (neighborhood paraLN meter) of the contour as: C "Max+Max("wG !wG>", "wG !wG>"): LN 1)i)N !1,. A
(2)
If C ' ¹ , the threshold value of the neighLN LN borhood constraint parameter, then restore the previous network weights discarding the present update. (v) Vary g and p according to the following rules: p"p *(p /p )GRCP,GRCP, GLGR DGL GLGR g"g *(g /g )GRCP,GRCP, GLGR DGL GLGR where p and p are the initial and "nal values GLGR DGL of p; g and g are those of g; and iter is the GLGR DGL current iteration number. 5.1. Constraints employed in the model The proposed ACM entails bounds on (i) the winnerdistance (WD); and (ii) the neighborhood parameter (NP), which implicitly impose the smoothness constraint. In order to contrast this with the results of the literature, recall that, in the snake model, the image, internal and external constraint forces are made explicit, and an energy function associated with these forces is minimized to obtain the "nal contour. The internal forces in the proposed model are implicitly imposed by the constraints mentioned above, and the image forces are taken care of by the input fed to the network. As far as the external constraint force is concerned, it is made implicit by virtue of the fact that the initial contour is provided by higher level interpretation processes (see Section 5.2). Finding appropriate bounds for WD and NP is a critical step. We now describe the purpose of these constraints (bounds) and their e!ect on the model's performance. 5.1.1. Constraint on the winner-distance (WD) This constraint on the winner-distance (WD) is useful in avoiding the in#uence of edge points which are within the region of interest, but are not a part of the nearest salient contour of interest (spurious edge points). In the absence of such a constraint, the neurons `organizea themselves to spurious edges, thereby a!ecting the proper extraction of the desired contour. The constraint places a threshold, ¹ , on the WD, UB controlling the updating or otherwise of the weights of
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
1243
Fig. 1. Illustration of the WD constraint (clock-wise from top-left): (a) initial contour overlayed on the image; (b) edge map of the image showing spurious internal edges; (c) "nal contour overlayed on the image.
the network: if the distance between the input vector and the winner neuron's weight vector is greater than ¹ , UB then the weights of the network are not updated. This constraint has, in fact, been made explicit in Step 5(ii) of the above algorithm. The lower the value of ¹ , the UB greater is the constraint on the updating. In other words, if this parameter is assigned a larger value, the neurons in the network tend to organize themselves with respect to spurious inner points which are at larger distances from the salient contour of interest. On the other hand, if it is too low, the weights will never be updated in spite of the input point lying on the salient contour. The utility of this constraint is shown in Fig. 1, where the active contour model organizes itself to the ellipse in spite of the spurious edge points within the ellipse. 5.1.2. Constraint on neighborhood The neighborhood parameter (NP) refers to the maximum of the distances in the x- and y-directions, taken over all the adjacent pairs of points on the contour. Constraining this parameter helps in maintaining the continuity of the contour in the course of its deformation. In the absence of a constraint on this parameter, many neurons tend to organize themselves towards a single point of the input image, leading to discontinuities in the
"nal contour. The threshold parameter on the NP, ¹ , ,. which is essentially the maximum permitted distance between neighboring neurons, is used in Step 5(iv) of the above algorithm (see Eq. (2)). The usefulness of this constraint is illustrated in Fig. 2, where a higher value of ¹ leads to a highly broken contour, while a lower value LN gives a continuous contour. 5.2. Initialization As mentioned earlier, the proposed method requires a rough boundary as the initial contour to start with the deformation process. This initialization can be achieved in a number of ways, depending on the application. We discuss some of them here. For static scenes, the generalized Hough transform technique [13,23] can be employed to initialize the contour, thereby exploiting the e$ciency and globality of Hough transform in the presence of noise and boundary gaps. On the other hand, in an active vision system with movable (and multiple) cameras, two or more images could be acquired, and subjected to optical #ow analysis or image di!erencing techniques, the results of which could be used to initialize the contour. For illustration purposes, in the results presented in the following
1244
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
Fig. 2. Illustration of the NP's utility (clock-wise from top-left): (a) input image; (b) high value of ¹ ("10.5) leads to broken contour; LN (c) low value of ¹ ("2.0) leads to a continuous contour. LN
sections, an user-interface is employed for the initialization (of the contour), which has been used in the literature [2,13] for the same purpose.
6. Implementation and results The proposed method was implemented in C## with X11 for graphical user-interface. The experiments were conducted on a HP9000/715 workstation. For an image of size 128;128, the program takes 5}6 s to arrive at the "nal contour. Typical values of important parameters used in the above system are as follows: E Number of iterations, N "300}600, depending upon GRCP the size and shape of the contour. E Initial value of p, p "3}5. GLGR E Final value of p, p "0.1}0.3. DGL E Initial value of the learning rate parameter, g, g "0.7}0.9. GLGR E Final value of g, g "0.001}0.01. DGL E Threshold parameters, ¹ , ¹ , lie in the range 2}5. LN UB The selection of the above parameters depends on the application as also on the images. A discussion on the
necessity and the e!ect of the parameters ¹ and ¹ on LN UB the model's performance was presented earlier in Section 5.1. Further, the size of the network depends on the nature of the initial contour. This is evident from the way the network is constructed (Section 5). Evidently, if the initial contour consists of n pixel-points, the size of the network is n. The parameter p, which de"nes the neighborhood relation, describes the local co-operativeness of weight update in the network. If this parameter is large, the inyuence of the winner neuron extends to a larger neighborhood, leading to undesirable e!ects (like many neurons organizing towards a single image point). If this parameter is too low, then only the winner neurons will e!ectively be updated, depriving the algorithm of the advantages of local co-operativeness and self-organization. The selection of p and p should be such that, GLGR DGL initially, a larger neighborhood is in#uenced by the winner, and, "nally, the in#uence restricted to the winner neuron alone. The parameter g, on the other hand, de"nes the amount of update forced on the weights of the neurons. It is reduced from a higher value to a lower value, with the idea of allowing a greater movement of weights towards edge points in the initial stages (when the weights are far
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
from them), and a smaller movement of weights towards the end (when the weights are nearer to the edge points). Now, on the basis of the experimental "ndings, we summarize some distinct characteristics of the proposed method: E The method is immune to noise present in the input image. The network can extract the nearest salient contour from a noisy image, as illustrated in Fig. 3 where the contour has been extracted successfully, even though the percentage of noise is 20. E The method can be used to extract salient, open contours from a given noisy image, as shown in Fig. 4. E The method can extract contours even in the presence of kinks in the initial contour. Fig. 5a shows the initial contour being pulled o! from the actual salient contour. The "nal contour is shown in Fig. 5b, where the network has deformed and &snapped' itself appropriately to the actual contour. 6.1. Advantages On the basis of the examples given above, we summarize the advantages of the proposed ACM approach: 1. It is robust with respect to noise in the given image.
1245
2. There is no need to choose energy functions, since the problem is not cast in an optimization framework. 3. Every point in the contour is extracted, which is of considerable importance in stereo matching and motion tracking. 4. As applied to the disparity estimates in stereo image analysis, the approach is believed to be novel. The solution to the correspondence problem is simpler. 5. It is possible to generalize the approach to (i) allow information other than mere coordinates of edge points (e.g. directional information); and (ii) the classi"cation of contours. 7. Applications of the model The proposed model, as mentioned earlier, is applicable to localizing and extracting boundaries in the course of segmenting image data (Figs. 1 and 4). In what follows, we present some of the other applications of the model: stereo-vision, bio-medical image analysis and digital image libraries. 7.1. Stereo-vision The proposed ACM provides a novel technique to e$ciently extract and match, point-by-point (for
Fig. 3. Illustration of the robustness of the approach to noise (clock-wise from top-left): (a) initial contour overlayed on the image; (b) edge map of the image; (c) "nal contour overlayed on the image.
1246
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
Fig. 4. Illustration of the capability of the approach in extraction of open contours (clock-wise from top-left): (a) initial contour overlayed on the image; (b) edge map of the image; (c) "nal contour overlayed on the image.
Fig. 5. (a) Initial contour overlaid on the image showing a part of it away from the salient contour. (b) Final contour illustrating the ability of the model in `snappinga itself to the nearest salient contour.
disparity estimates), the corresponding contours from the left and right images of a stereo-pair of images. We outline below the steps involved in solving this correspondence problem. Fig. 6a shows a stereo-pair of images which are to be matched for depth extraction. 1. Extract contours from both the left and right images of the stereo pair. (We call them left and right contours, respectively.) 2. Form a neural network isomorphic to either the left or the right contour. That is, form a network, with
weights of the neurons set to the co-ordinates of the contour points. Without loss of generality, we assume that the network is constructed isomorphic to the left contour. 3. Present each point from the right contour to each neuron in the network, and use the updating scheme described for contour extraction in Section 5. Dispense with the WD and NP constraints. 4. When the network converges, it is isomorphic to the right contour. The initial and "nal weights of a particular neuron will be the corresponding contour
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
1247
Fig. 6. Illustration of the ability of the approach in analyzing stereo images: (a) (top) stereo pair; (b) disparity map.
points in the left and right images respectively. Also, assuming epipolar geometry, the di!erence between the X co-ordinates of the initial and "nal weights of any particular neuron gives the disparity at that point (which can be used to calculate the depth of the point). The disparity map is shown in Fig. 6b in which intensity is directly proportional to the disparity (and inversely proportional to the depth). In the example shown, the disparity map for the cube was obtained by (i) initializing contours for the three surfaces (see Fig. 6a) separately; (ii) calculating the disparity for each of them separately; and (iii) "nally merging them together. 7.2. Bio-medical image interpretation Imaging techniques like magnetic resonance imaging (MRI), X-ray computed tomography (CT) and positron emmision tomography (PET) provide detailed information regarding the anatomical and physiological function of various parts of the body. The interpretation of the data has been hindered by the inability to relate such information with speci"c anatomical regions. This is a consequence of the interpretation di$culties that arise due to small variations in anatomy [24]. Because the earlier models for shape are rigid, it is not possible to accommodate these variations for better interpretation.
This can be achieved by employing active contour models, which deform themselves in the process of extracting the boundaries. Furthermore, medical applications, like cardiac boundary tracking, tumor volume quanti"cation, cell tracking, etc., require extracting exact shapes in two and three dimensions. These also have been challenging tasks because of the amount of noise inherent in medical images. We have already demonstrated that the proposed approach is noise-tolerant (Section 6). Now, we illustrate the extraction of implicit boundaries from bio-medical images in order to facilitate easy interpretation of anatomical parts. Fig. 7a shows an ultra-sound image of the head, overlaid with the initial contour. Fig. 7b illustrates the ability of the approach in extracting the contour information implicitly present in the image. 7.3. Object retrieval and image libraries The proposed active contour model can be considered as a deformable template for application to the problem of locating and retrieving an object from a complex image. A solution to this problem is of signi"cance to applications, like image database retrieval, object recognition and image segmentation. The proposed approach can be employed in a fashion similar to the one reported by Jain et al. [18]. In this context, it is assumed that
1248
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
Fig. 7. Illustration of the ability of the approach in analyzing bio-medical images: (a) (top) initial contour overlaid on the image; (b) "nal contour overlaid on the image.
a priori information is available in the form of an inexact model of the object, which needs to be matched with the object in the input image. Since the proposed contour model yields a continuous set of points as output after the deformation and matching, it is possible to use the model as a deformable template in object matching applications. The weights of the model are initialized with the co-ordinates of the binary template, and deformation is realized in much the same way as described in the algorithm of Section 5. The only di!erence lies in the search of the parameter space corresponding to scale, shift and rotation of the
pattern, with the model initialized by the transformed versions of the binary template. However, the disadvantage of using such an approach (as with the one found in Jain et al. [18]) is the amount of time required in searching the entire parameter space. This can be reduced by the use of a coarse-to-"ne matching strategy [18]. Further, the dimension of the parameter space of Jain et al. [18] is high because the deformations are also considered as parametric functions. If the proposed approach is applied to such a problem, the dimension of the parameter space reduces to four, corresponding to rotation, scale and shifts in X- and >-directions. This is
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
a consequence of the fact that the deformation is handled by the active contour model itself.
8. Conclusions In order to localize salient contours in an image, a new active contour model (ACM), which is a neural network based on self-organization, is presented. It turns out that the theoretical basis for the proposed model can be traced to the extensive literature on Gestalt perception in which the principle of psycho-physical isomorphism plays an important role. The main contribution of the proposed model is the exploitation of the principles of spatial isomorphism and self-organization in order to create #exible contours characterizing shapes in images. The deformation in the contour model is e!ectuated by a locally co-operative and globally competitive self-organizing scheme, which enables the model to cling to the nearest salient contour in the test image. To start with this deformation process, the model requires a rough boundary as the initial contour. Various methods for this initialization are discussed. As reported here, the model is a semi-automatic method, in the sense that an user-interface is needed for this initialization purpose. The e!ect of the important parameters on the model's performance and the di$culty in choosing them are elaborated. The utility and versatility of the model are illustrated by applying to the problems of boundary extraction, stereo vision, bio-medical image analysis and digital image libraries.
Acknowledgements This is a part of the collaborative research project, between the Indian Institute of Science and the National University of Singapore, on pattern recognition using neural networks. The work was supported in part by an academic research grant (CRP 96/0636) through the Department of Electrical Engineering, National University of Singapore, Singapore.
References [1] Y.V. Venkatesh, N. Rishikesh, Modeling active contours using neural networks isomorphic to boundaries, Proceedings of the International Conference on Neural Networks (ICNN-97), Houston, TX, USA, vol. III, 1997, pp. 1669}1672. [2] M. Kass, A. Witkin, D. Terzopoulos, Snakes: active contour models, Proceedings of the First IEEE Conference on Computer Vision, 1987, pp. 259}268.
1249
[3] F. Leymarie, M.D. Levine, Tracking deformable objects in the plane using an active contour model, IEEE Trans. Pattern Anal. Machine Intell. 15 (6) (1993) 617}634. [4] L.D. Cohen, I. Cohen, Finite-element methods for active contour models and balloons for 2-D and 3-D images, IEEE Trans. Pattern Anal. Machine Intell. 15 (11) (1993). [5] A.L. Yuille, P.W. Hallinan, D.S. Cohen, Feature extraction from faces using deformable templates, Int. J. Comput. Vision 8 (2) (1992) 133}144. [6] K.M. Lam, H. Yan, Locating and extracting the eye in human face images, Pattern Recognition 29 (5) (1996) 771}779. [7] C.N.S. Ganesh Murthy, Y.V. Venkatesh, Pattern encoding and classi"cation by neural networks, Neural Computing, submitted for publication. [8] K. Shanmukh, C.N.S. Ganesh Murthy, Y.V. Venkatesh, Classi"cation using self-organizing networks spatially isomorphic to patterns, International Conference on Robotic Vision for Industry and Automation, ROVPIA'96, Malaysia, 1996, pp. 298}303. [9] R.J. Herrnstein, E.G. Boring (Eds.), A Source Book in the History of Psychology, Harvard University Press, Cambridge, MA, 1966. [10] C.N.S. Ganesh Murthy, K. Shanmukh, Y.V. Venkatesh, A new method for pattern recognition using self-organizing networks, Proceedings of the Second Asian Conference on Computer Vision (ACCV), Singapore, vol. 3, 1995, pp. 111}115. [11] A.A. Amini, S. Tehrani, T.E. Weymouth, Using dynamic programming for minimizing the energy of active contours in the presence of hard constraints, Proceedings of the International Conference on Computer Vision, 1988, pp. 95}99. [12] L.D. Cohen, On active contours models and balloons, Computer Vision, Graphics Image Process.: Image Understanding 53 (2) (1991) 211}218. [13] K.F. Lai, R.T. Chin, Deformable contours: modeling and extraction, IEEE Pattern Anal. Machine Intell. 17 (11) (1995) 1084}1090. [14] K.F. Lai, R.T. Chin, On regularization and initialization of the active contour models (snakes), Proceedings of the First Asian Conference on Computer Vision, 1993, pp. 542}545. [15] G. Chiou, J.N. Hwang, A neural network-based stochastic active contour model (NNS-SNAKE) for contour "nding of distinct features, IEEE Trans. Image Process. 4 (10) (1995) 1407}1416. [16] L.H. Staib, J.S. Duncan, Boundary "nding with parametrically deformable models, IEEE Trans. Pattern Anal. Machine Intell. 14 (11) (1992) 1061}1075. [17] R. Malladi, J.A. Sethian, B.C. Vemuri, Shape modeling with front propagation: a level set approach, IEEE Trans. Pattern Anal. Machine Intell. 17 (2) (1995) 158}174. [18] A.K. Jain, Y. Zhong, S. Lakshmanan, Object matching using deformable templates, IEEE Trans. Pattern Anal. Machine Intell. 18 (3) (1996) 267}278. [19] E.G. Boring, Sensation and perception in the history of Experimental Psychology, Appleton Century Crofts, New York, 1942. [20] R. Sekuler, R. Blake, Perception, Alfred A. Knoph Inc., New York, 1985. [21] M.J. Tovee, An Introduction to the Visual System, Cambridge University Press, England, 1996.
1250
Y.V. Venkatesh, N. Rishikesh / Pattern Recognition 33 (2000) 1239}1250
[22] T. Kohonen, Self-Organization and Associative Memory, Springer, Berlin, 1989. [23] D.H. Ballard, Generalizing the Hough transform to detect arbitrary shapes, Pattern Recognition 13 (2) (1981) 111}122.
[24] M.I. Miller, G.E. Christensen, Y. Amit, U. Grenander, Mathematical textbook of deformable neuroanatomies, Proceedings of the National Academy of Sciences, USA, vol. 90, 1993, pp. 11944}11948
About the Author*Y.V. VENKATESH got his Ph.D from the Indian Institute of science for a dissertation on the stability analysis of feedback systems. He was an Alexander von Humboldt Fellow at the Universities of Karlsruhe, Freiburg and Erlangen, Germany; National Research Council Fellow (U.S.A.) at the Goddard Space Flight Center, Greenbelt, Maryland; Visiting Fellow at the Australian National University, to name a few. His research monograph on stability and instability anlysis of linear-nonlinear time-varying feedback systems has appeared in the Springer Verlag Physics Lecture Notes Series. His present work is on signal representation using wavelet-link arrays, and reconstruction from partial information (like zero-crossings). He is a professor at the Indian Institute of Science, Bangalore, India, and currently the Chairman of the Division of Electrical Sciences. He is a fellow of the Indian National Science Academy, Indian Academy of Sciences and the Indian Academy of Engineering. About the Author*N. RISHIKESH obtained his Bachelor of Engineering at the Madurai Kamaraj University, Madurai, India, in 1995. He completed his Master of Science (Research) in the Department of Electrical Engineering, Indian Institute of Science, Bangalore, India, in 1997. Currently, he is a doctoral student, working on biological information processing. His research interests include Neural Networks, Computer Vision and Computational Neuroscience.
Pattern Recognition 33 (2000) 1251}1259
A genetic clustering algorithm for data with non-spherical-shape clusters夽 Lin Yu Tseng*, Shiueng Bien Yang Department of Applied Mathematics, National Chung Hsing University, Taichung, Taiwan 402, People's Republic of China Received 13 October 1998; received in revised form 10 March 1999; accepted 17 March 1999
Abstract In solving clustering problem, traditional methods, for example, the K-means algorithm and its variants, usually ask the user to provide the number of clusters. Unfortunately, the number of clusters in general is unknown to the user. The traditional neighborhood clustering algorithm usually needs the user to provide a distance d for the clustering. This d is di$cult to decide because some clusters may be compact but others may be loose. In this paper, we propose a genetic clustering algorithm for clustering the data whose clusters are not of spherical shape. It can automatically cluster the data according to the similarities and automatically "nd the proper number of clusters. The experimental results are given to illustrate the e!ectiveness of the genetic algorithm. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Clustering; Genetic clustering algorithm; Non-spherical-shape clusters
1. Introduction The clustering problem is de"ned as the problem of classifying a collection of objects into a set of natural clusters without any a prior knowledge. For years, many clustering methods were proposed by many researchers, for example, see Refs. [1}6]. These methods can be basically classi"ed into two categories: hierarchical and nonhierarchical. The hierarchical methods can be further divided into the agglomerative methods and the divisive methods. The agglomerative methods merge together the most similar clusters at each level and the merged clusters will remain in the same cluster at all higher levels. In the divisive methods, initially the set of all objects is viewed as a cluster and at each level, some clusters are binary divided into smaller clusters. There are also many
夽 This research work was partially supported by the National Science Council of ROC under the contract NSC 87-2213-E005-002. * Corresponding author. Tel.: #886-4-2874020; fax: #8864-2873028. E-mail address:
[email protected] (L.Y. Tseng).
non-hierarchical methods. Among them, the K-means algorithm is an important one. It is an iterative hillclimbing algorithm and the solution obtained depends on the initial clustering. Although the K-means algorithm had been applied to many practical clustering problems successfully, it is shown in Ref. [7] that the algorithm may fail to converge to a local minimum under certain conditions. In Ref. [8], a branch and bound algorithm was proposed to "nd the globally optimum clustering. However, it might take much computation time. In Refs. [9,10], simulated annealing algorithms for the clustering problem were proposed. These algorithms may "nd a globally optimum solution under some conditions. Most of these clustering algorithms require the user to provide the number of clusters as an input. But the user in general has no idea about the number of clusters. Hence, the user is forced to try di!erent numbers of clusters when using these clustering algorithms. This is tedious and the clustering result may be no good especially when the number of clusters is large and not easy to guess. The K-means algorithm is not suitable for clustering the data whose clusters are not of spherical shape. In Ref. [11], a neighborhood clustering algorithm based on the mean distance from an object to its nearest
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 0 5 - 3
1252
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259
neighbor was proposed. It can cluster this kind of data. But just like other neighborhood clustering methods, the threshold of distance for grouping objects together is di$cult to decide. Some papers, for example, Refs. [12,13] had been devoted to this topic. In this paper, we propose a genetic clustering algorithm for the data whose clusters may not be of spherical shape. Since genetic algorithm is good at searching, the clustering algorithm will automatically "nd the proper cluster number and classify the objects into these clusters at the same time. The remaining part of the paper is organized as follows. In Section 2, the basic concept of the genetic approach is introduced. In Section 3, the genetic clustering algorithm for the data whose clusters may not be of spherical shape is described. The experimental results are given in Section 4 and we conclude the paper in Section 5.
2. The basic concept of the genetic strategy The genetic strategy consists of an initialization step and the iterative generations. In each generation, there are three phases, namely, the reproduction phase, the crossover phase and the mutation phase. In the initialization step, a set of strings will be randomly generated. This set of strings is called the population. Each string consists of 0's and 1's. The meanings of the strings in the algorithm will be described in Section 3. After the initialization step, there is an iteration of generations. The user may specify the number of generations that he/she wants the genetic algorithm to run. After each generation, a set of strings with better "tness will be obtained and a clustering will thus be derived. The genetic algorithm will run the speci"ed number of generations and retain the best clustering. The three phases in each generation are introduced in the following. In the reproduction phase, the "tness of each string is calculated. The calculation of the "tness is the most important part of our algorithm. After the calculation of the "tness for each string in the population, the reproduction operator is implemented by using a roulette wheel with slots sized according to "tness. In the crossover phase, strings are chosen by pairs. For each chosen pair, two random numbers are generated to decide which pieces of the strings are to be interchanged. Suppose the length of the string is n, each random number is an integer in [1, n]. For example, if two random numbers are 2 and 5, position 2 to position 5 of this pair of strings are interchanged. For each chosen pair of strings, the crossover operator is applied by probability p . In the mutation phase, bits of the strings in the A population will be chosen with probability p . Each K chosen bit will be changed from 0 to 1 or from 1 to 0. In our experiments, we choose p and p to be 0.8 and 0.1 A K respectively.
3. The genetic clustering algorithm for data with non-spherical-shape clusters There are data whose clusters are not of spherical shape. Some examples are given in Figs. 1}3. In this section, we describe a genetic clustering algorithm CLUSTERING for these kinds of data. Let there be n objects, O , O ,2, O . The algorithm CLUSTERING L consists of two stages. The "rst stage consists of the following steps. Step 1: For each object O , "nd the distance between G O and its nearest neighbor. That is, G d (O )"min ""O !O "", ,, G H G H$G N where ""O !O """ (O !O ) . H G HO GO O Step 2: Compute d , the average of the nearest neigh?T bor distances, as follows.
1 L d " d (O ). Let d"uHd . ?T n ,, G ?T G (d is decided by the parameter u and u is empirically chosen to be 1.5.) Step 3: View the n objects as nodes of a graph. Compute the adjacency matrix A as follows. L"L 1 if ""O !O ""d, G H A(i, j)" 0 otherwise,
where 1)j)i)n. Step 4: Find the connected components of this graph. Let these connected components be denoted by C , C ,2, C . K The connected components C , C ,2, C obtained in K the "rst stage will be taken as the initial clusters in the second stage. Basically, the second stage is a genetic algorithm, which will merge some of these C 's if they are G close enough to one another. We de"ne the distance matrix D to specify the distance between each pair of K"K clusters C and C . G H D(i, j)" min ""O !O "". P Q -PZ!G -QZ!H The initialization step and the three phases of each generation of this genetic algorithm are described in the following. Initialization step: A population of N strings is randomly generated. In our experiments, N is equal to 50. The length of each string is m, which is the number of the initial clusters obtained in the "rst stage. Each string represents a subset of +C , C ,2, C ,. If C is in this K G subset, the ith position of the string is 1; otherwise, it is 0. For example, suppose string R represents the subset
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259
1253
Fig. 1. The "rst set of test data and the clustering results. (a) The original test data. (b) The test data with noises removed. (c) 8 clusters. (d) 7 clusters. (e) 6 clusters. (f) 5 clusters. (g) 4 clusters. (h) 3 clusters. (i) 2 clusters. (j) 4 clusters obtained by applying the genetic algorithm to the test data. (k) 4 clusters obtained by applying K-means algorithm.
1254
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259
Fig. 1. (Continued.)
+C , C , C , and string R represents the subset +C , C , C ,, then R is 1110020 and R is 1011020. For each string R in the population, two sets ; and G G ; are de"ned as follows: G ; "+ j "The jth bit of R is 1,, G G ;"+ j "The jth bit of R is 0,. G G
That is, ; contains those indices at which R has bit G G 1 and ; contains those indices at which R has bit 0. G G These two sets are used to de"ne the intra-distance D and the inter-distance D in the following. Each string R represents a subset of +C , C ,2, C ,. We G K de"ne D to represent the intra-distance among the clusters in this subset. We also de"ne D to represent
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259
1255
Fig. 2. The second set of test data and the clustering results. (a) The original test data. (b) 4 clusters. (c) 3 clusters. (d) 2 clusters.
the inter-distance between this subset and the set of all other clusters not in this subset. D (R )"max min D( j, k), G HZ3G IZ3G I$H D (R )"min D( j, k), G HZ3G IZ3YG If R contains only 0's, both D (R ) and D (R ) are G G G de"ned to be 0. If R contains only one 1, both D (R ) G G and D (R ) are de"ned to be 0. Some explanations may G be helpful in understanding the de"nitions of D (R ) G
and D (R ). Suppose R represents +C , C , C , which G G is a subset of +C , C ,2, C ,, for each C , there must be K H a C that is nearest to C in the subset. Suppose Fig. 4(a) I H indicates these three clusters C , C and C , the nearest clusters to C , C and C are C , C and C respectively. For each pair of nearest clusters, there is a distance and D is just the maximum of all these distances. In Fig. 4(a), D (R ) is D(2, 3). Therefore, D (R ) is used G G to measure the nearness of the clusters in the subset represented by R . As indicated in Fig. 4(b), suppose the G clusters that are outside this subset and nearest to C , C and C are C , C and C respectively. Then, D (R ) is G
1256
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259
Fig. 3. The third set of test data and the clustering result. (a) The third set of test data. (b) The clustering result.
Fig. 4. An example to illustrate the de"nitions of D and D .
de"ned to be the smallest one of these three distances, that is, D(3, 5). So D (R ) is used to measure the degree G of separation among the subset represented by R and G other clusters. Reproduction phase: The "tness of the string R is G de"ned as follows: SCORE(R )"D (R )Hw!D (R ), G G G where w is a weight. If the value of w is small, we emphasize the importance of D (R ). That is, only clus G ters very near to one another are merged. This tends to produce more clusters and each cluster tends to be com-
pact. If the value of w is chosen to be large, we emphasize the importance of D (R ). That is, clusters not very near G to one another may be merged in order to make the distances among the merged ones larger. This tends to produce lesser clusters and each cluster tends to be loose. According to our experience, the value of w is within [1, 3]. After the calculation of "tness for each string in the population, the reproduction operator is implemented by using a roulette wheel with slots sized according to "tness. That is, R is reproduced by the probability G SCORE(R )/ , SCORE(R ). G G G Crossover phase: In the crossover phase, for each chosen pair of strings, two random numbers in [1, m] are generated to decide which pieces of the strings are to be interchanged. For example, suppose R "101001020, G R "110011020, and two random numbers chosen are H 3 and 6, then position 3 to position 6 of two strings are interchanged and "nal R and R are 100011020 and G H 111001020, respectively. For each chosen pair of strings, the crossover operator is done with probability p , which is equal to 0.8 in our experiments. A Mutation phase: In the mutation phase, bits of the strings in the population will be chosen with probability p , which is equal to 0.1 in our experiments. K Each chosen bit will be changed from 0 to 1 or from 1 to 0. For example, suppose R "101001020 and the secG ond bit of R is chosen to do the mutation, then G R "111001020 after the mutation phase. G In each generation of this genetic algorithm, what we really want is not the string with the best "tness but a set of strings with better "tness. This set represents a good merging method for C , C ,2, C . By ap K plying this good merging method to merge some of
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259
C 's, a good clustering will be obtained. The Algorithm G Merge}Sets}Finding consists of the following four steps is used to "nd the sets of C 's that are to be merged. This G algorithm is executed after the calculation of "tness for each string and before the application of the reproduction operator in the reproduction phase. Step 1: Sort the "tness of the strings in non-increasing order. For the sake of brevity, let us assume SCORE(R )*SCORE(R *2*SCORE(R ). , i"1: ;"~. Step 2: Choose R . Let < "+C " j3; ,. ; " G G H G G +j"The jth bit of R is 1, as de"ned earlier in this section. G The clusters in < are to be merged. G ;";6; . G Step 3: Choose smallest l'i such that ; 5;"~. If J no such l exists then go to Step 4 else i"l and go to Step 2. Step 4: End. An example to illustrate the Merge}Sets}Finding algorithm is given as follows. Suppose SCORE(R )* SCORE(R )*2*SCORE(R ), R represents subset , +C , C , C ,, R represents subset +C , C , C ,, R rep resents subset +C , C ,, and each of R to R represents , a subset containing at least one of C , C , C , C and C , then in this algorithm, by "rst choosing R , clusters C , C , C are merged. Since the subset represented by R contains C which is already in the subset represented by R , hence R is discarded. After that, R is considered, the subset represented by R contains no clusters that are already merged. Hence the clusters C , C in this subset are merged. Since each of R to R represents a subset , containing at least one cluster that is already merged, they are all discarded. The time complexity is analyzed as follows. In the "rst stage, Step 1 takes O(n) time to calculate the distances between pairs of objects and takes O(n) time to "nd the minimum. Step 2 takes O(n) time to calculate the minimum. Step 3 takes O(n) time to derive the adjacency matrix and Step 4 also takes O(n) time to "nd the connected components by scanning the adjacency matrix. Therefore, the "rst stage spends O(n) time. Before the second stage, we need to calculate D(i, j) for
1257
all C , C , this takes O(n) time in the worst case. The G H time complexity of the second stage is dominated by the calculation of D (R ) and D (R ). It takes O(Nm) G G time in the worst case. Suppose the genetic algorithm is asked to run k generations, then the time complexity will be O(kNm). Hence, the time complexity of the whole clustering algorithm is O(n#kNm). In our experiments, k equals 20 and N equals 50. In general, m is also a small number. So these three numbers can almost be taken as constants and the time complexity is then O(n).
4. Experiments Three sets of data are used to test the e!ectiveness of the clustering algorithm. Fig. 1(a) shows the original test data. There are some noises in the test data. By calculating the average of the nearest neighbor distances d , if an ?T object has a distance 2d from its nearest neighbor, this ?T object is taken as a noise and is discarded. Fig. 1(b) shows the test data with noises removed. As mentioned before, u is chosen to be 1.5 by experience. In our experiments on the "rst and the second sets of data, two other values of u, namely 1.2 and 2, are also used to illustrate that with a suitable choice of the value of w, a good clustering can be found with all three values of u. This means that u may be chosen from an interval, e.g. [1.2, 2], and the exact value of u is not crucial to the clustering result. In Table 1, if u is 1.2, there are eight initial clusters, which are shown in Fig. 1(c). If u is 1.5 or 2, there are "ve initial clusters, which are shown in Fig. 1(f). As shown in Table 1, several values of w are chosen from [1, 3]. For each value of w, the number of clusters and the maximum intra-distance are recorded. For example, if u is 1.2 and w is 2, there are four clusters as shown in Fig. 1(g) and the maximum intra-distance is 12.9. The values of w are chosen from [1, 3] by binary search. The numbers in circles in Table 1 indicate the sequence of chosen values of w. A good clustering can be found by conducting this binary search. The criteria of selecting the good clustering are as follows. Find the smallest w such that N(w)"N(w)#1 and D(w)/D(w)*2, where w is next to w and larger than w. Then the clustering obtained by using this w is the good clustering. If there is no such case,
Table 1 Numbers of clusters and the maximum intra-distances of the "rst test set for di!erent values of u and w u
1.2 1.5 2
C components
8 5 5
(N(w), D(w)) w"1 (8, 6.5) (5, 8.8) (5, 8.8)
w"1.25 * * *
w"1.5 (6, 8.4) (4, 12.9) (4, 12.9)
w"1.75 (5, 8.8) (3, 29.1) (3, 29.1)
w"2 (4, 12.9) (2, 31.2) (2, 31.2)
w"2.5 (3, 29.1) * *
w"3 (2, 31.2) (2, 31.2) (2, 31.2)
1258
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259
Table 2 Numbers of clusters and the maximum intra-distances of the second test set for di!erent values of u and w u
1.2 1.5 2
C components
76 4 3
(N(w), D(w)) w"1 (76, 6.9) (4, 8.9) (3, 9.5)
w"1.5
w"2
(3, 9.5) (3, 9.5) *
we "nd all w in [1, 3] such that N(w)"N(w)#1 and D(w)/D(w)*1.5. All clustering corresponding to these w's are output. If there are still no such cases, the clustering obtained by choosing 3 as the value of w is output. Fig. 1(c)}(i) show clustering with di!erent number of clusters. In the "rst test data, the good clustering has four clusters as shown in Fig. 1(g). If we apply only the nearest neighbor clustering method (with u"2), we obtain a clustering with "ve clusters as shown in Fig. 1(f), which is not very proper. If we apply only the genetic algorithm (i.e. the second stage of our clustering algorithm) directly to the test data, four clusters are obtained as shown in Fig. 1(j). The clustering is also not good. By applying the K-means algorithm with number of clusters equals to four, the clustering is shown in Fig. 1(k). This clustering result is bad because the K-means algorithm is not suitable for non-spherical-shape clusters. Fig. 2 and Table 2 show the second test data and its clustering results. A good clustering is the case of three clusters shown in Fig. 2(c). Fig. 3 shows the third set of test data and its clustering result.
(2, 19.5) (2, 19.5) *
w"2.5 * * *
w"3 (2, 19.5) (2, 19.5) (2, 19.5)
6. Summary The clustering problem is a very important problem and has attracted much attention of many researchers. Some traditional methods, for example, the K-means algorithm and its variants, usually ask the user to provide the number of clusters. Unfortunately, the number of clusters in general is unknown to the user. Hence, the user usually has to try several times in order to get a good clustering. The traditional neighborhood clustering algorithm usually needs the user to provide a distance d for the clustering. This d is di$cult to decide because some clusters may be compact but others may be loose. In this paper, a genetic clustering algorithm CLUSTERING is proposed. CLUSTERING is a clustering algorithm for data whose clusters may not be of spherical shape. Unlike the K-means algorithm, CLUSTERING can automatically search for a proper number as the number of clusters. By binary searching some proper interval for the value of w, which is the weighting factor between the inter-distance and the intra-distance of the clusters, a proper number of clusters and a good clustering can be found. Several experiments are conducted to illustrate the e!ectiveness of the genetic clustering algorithm.
5. Concluding remarks A genetic clustering algorithm CLUSTERING is proposed. CLUSTERING is a clustering algorithm for data whose clusters may not be of spherical shape. Unlike the K-means algorithm which needs the user to provide it with the number of clusters, CLUSTERING can automatically search for a proper number as the number of clusters. By binary searching some proper interval for the value of w, a proper number of clusters and a good clustering can be found. In general, a natural and steady clustering will correspond to a signi"cantly long interval of w values. The traditional neighborhood clustering algorithm usually needs the user to provide a distance d for the clustering. But a unique d for a set of objects often causes problems because there may be some natural clusters in which the objects are not close to one another within the distance d. CLUSTERING avoids this kind of problem by processing the data in a global view.
References [1] M.R. Anderberg, Cluster Analysis for Applications, Academic Press, New York, 1973. [2] J.T. Tou, R.C. Gonzalez, Pattern Recognition Principles, Addision-Wesley, Reading, MA, 1974. [3] J.A. Hartigan, Clustering Algorithms, Wiley, New York, 1975. [4] K.S. Fu, Communication and Cybernetics: Digital Pattern Recognition, Springer, Berlin, 1980. [5] R. Dubes, A.K. Jain, Clustering Methodology in Exploratory Data Analysis, Academic Press, New York, 1980. [6] P.A. Devijver, J. Kittler, Pattern Recognition-A Statistical Approach, Prentice-Hall, London, 1982. [7] S.Z. Selim, M.A. Ismail, K-means-type algorithm: generalized convergence theorem and characterization of local optimality, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1) (1984) 81}87.
L.Y. Tseng, S.B. Yang / Pattern Recognition 33 (2000) 1251}1259 [8] W.I. Koontz, P.M. Narendra, K. Fukunaga, A branch and bound clustering algorithm, IEEE Trans. Comput. c- 24 (9) (1975) 908}915. [9] S.Z. Selim, K.S. Al-Sultan, A simulated annealing algorithm for the clustering problem, Pattern Recognition 24 (10) (1991) 1003}1008. [10] R.W. Klein, R.C. Dubes, Experiments in projection and clustering by simulated annealing, Pattern Recognition 22 (2) (1989) 213}220.
1259
[11] P.-Y. Yin, L.-H. Chen, A new non-iterative approach for clustering, Pattern Recognition Letters 15 (2) (1994) 125}133. [12] G.C. Osbourn, R.F. Martinez, Empirically de"ned regions of in#uence for clustering analyses, Pattern Recognition 28 (11) (1995) 1793}1806. [13] P.S. Stephen, Threshold validity for mutual neighborhood clustering, IEEE Trans. Pattern Anal. March. Intell. 15 (1) (1993) 89}92.
About the Author*LIN YU TSENG received the B.S. degree in mathematics from the National Taiwan University, Taiwan, in 1975, and the M.S. degree in computer science from the National Chiao Tung University, Taiwan, 1978. After receiving the M.S. degree, he worked in industry and taught at the university for several years. He received the Ph.D degree in computer science from National Tsing Hua University, Taiwan, in 1988. He is presently a Professor at the Department of Applied Mathematics, National Chung Hsing University. His research interests include pattern recognition, document processing, speech coding and recognition, neural networks and algorithm design. About the Author*SHIUENG BIEN YANG received the B.S. degree in 1993 from the Department of Applied Mathemaics, National Chung Hsing University. He is now working towards the Ph.D. degree in the same department. His research interests include pattern recognition, speech coding and recognition, image coding and neural networks.
PR=1220=Indira=Venkatachala=BG
Pattern Recognition 33 (2000) 1261}1262
A slant removal algorithm E. Kavallieratou*, N. Fakotakis, G. Kokkinakis Wire Communications Laboratory, University of Patras, 26500 Patras, Greece Received 29 September 1999; accepted 7 October 1999
1. Introduction A robust optical character recognition (OCR) system has to be able to cope with slanted words. Such words may dramatically a!ect the performance of the segmentation algorithms. Even in cases where segmentation is not a prerequisite, the training procedure of the recognition module is more di$cult and complicated while attempting to cover slanted characters or words. Watanabe [1] conducted comparative experiments showing that slant normalization minimizes the error of recognition. The majority of recent OCR systems contains a preprocessing stage dealing with slanted correction. This stage is usually located before the segmentation module, if it exists, or just before the recognition stage, if it does not. The most commonly used method for slant estimation is the calculation of the average angle of near-vertical strokes [2}4]. This approach requires the detection of the edges of the characters and its accuracy depends on the included characters in the word. Shridar [5] presents two more methods for slant estimation and correction. In the "rst one, the vertical projection pro"le is used while the second one makes use of the chain code method. In this paper, we present an accurate slant removal algorithm that can easily be adapted to any system. We make use of the projection pro"le technique, as above, and the Wigner}Ville distribution. After a short description of the algorithm, some experimental results are given and some conclusions are drawn.
2. The algorithm The proposed approach employs the projection pro"le technique and the Wigner}Ville distribution (WVD). In * Corresponding author. Tel.: #30-61-991722; fax: #30-61991855. E-mail address:
[email protected] (E. Kavallieratou).
particular, the intensity of the WVD of the vertical histogram of a word stands as the criterion for its slant estimation. The vertical histogram of a non-slanted word presents the maximum peaks and the most intent alternations between peaks and dips than any other histogram-byangle of the same word. In a word without slant the gaps between characters will be deeper, even if the characters are connected, and, as a consequence, the vertical position of the strokes is represented by higher values of peaks at the vertical histogram of the word. The WVD is used in order to picture these alternations and detect the slant of the word. The WVD is a timefrequency distribution member of the Cohen's class and due to its simplicity is the most popular distribution of the class. The basic steps of our approach are as follows: 1. The word is sheared to the left and right in an area between !45 and #453 with respect to its original position. 2. The vertical projections are extracted and the WVDs are calculated. 3. Finally, the position where the corresponding WVD presents the maximum intensity is selected. Its declination from the original position is the estimated slant.
3. Results and conclusions There is no objective method to estimate the e!ectiveness of a slant removal algorithms. The most common way is by sight. Even so, opinions can di!er in the case of words that include characters of di!erent orientations. Thus, in the given results we focus more on the general conclusions and less on success rates. The algorithm was tested on both English and Greek words written by about 200 writers and dealt quite successfully with all the cases of script writing and handprinted words. Some problems may arise when a word
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 2 1 9 - 8
Pr=1220=Indira=VVC*BG
1262
E. Kavallieratou et al. / Pattern Recognition 33 (2000) 1261}1262
Fig. 1. The original word is shown on the left and the corrected word, after the slant removal, is shown on the right.
includes characters with di!erent slants. However, in every case the presented algorithm signi"cantly improves the picture of the word. In Fig. 1 some experimental results from the application of the algorithm are shown. By including the algorithm in our OCR system the improvement of the results reaches 3% for cursive words and 1.5% for hand-printed words.
References [1] M. Watanabe, Y. Hamammoto, T. Yasuda, S, Tomita, Normalization techniques of handwritten numerals for Gabor "lters, Proceedings of the International Conference on Document Analysis and Recognition, ICDAR IEEE, Los Alamitos, CA, Vol. 1, 1997, pp. 303}307.
[2] G. Kim, V. Govindaraju, E$cient chain-code-based image manipulation for handwritten word recognition, Proceedings of SPIE-The International Society for Optical Engineering, Bellingham, WA, USA, Vol. 2660, 1996, pp. 262}272. [3] S. Knerr, E. Augustin, O. Baret, D. Price, Hidden Markov model based word recognition and its application to legal amount reading on french checks, Computer Vision and Image Understanding, Vol. 70 (3), 1998, pp. 404}419. [4] A.W. Senior, A.J. Robinson, An o!-line cursive handwriting recognition system, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20 (3), 1998, pp. 309}321. [5] M. Shridar, F. Kimura, Handwritten address interpretation using word recognition with and without lexicon, Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, Piscataway, NJ, USA, Vol. 3, 1995, pp. 2341}2346.