Pattern Recognition 33 (2000) 177}184
A geometric approach to consistent classi"cationq Yoram Baram* Department of Computer Science, Technion, Israel Institute of Technology, Haifa 32000, Israel Received 19 February 1998; received in revised form 14 September 1998; accepted 15 January 1999
Abstract A classi"er is called consistent with respect to a given set of class}labeled points if it correctly classi"es the set. We consider classi"ers de"ned by unions of local separators (e.g., polytopes) and propose algorithms for consistent classi"er reduction. The proposed approach yields a consistent reduction of the nearest-neighbor classi"er, relating the expected classi"er size to a local clustering property of the data and resolving unanswered questions raised by Hart (IEEE Trans. Inform. Theory IT-14(3) (1968)) with respect to the complexity of the condensed nearest neighbor method. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Classi"cation; Local separation; Consistent reduction; Nearest neighbor; Condensed nearest neighbor; Reduced nearest neighbor
1. Introduction Solutions to the classi"cation problem have been characterized in terms of parameterized separation surfaces. In statistical classi"cation (e.g., [1]), such surfaces may represent the boundaries of the support sets for probability density functions or the intersection surfaces of such functions. Other solutions have been based on the direct construction of separation surfaces. A simple example is that of a hyperplane [2]. The construction of separation surfaces of complexity higher than that of a single hyperplane or a single sphere has been limited to weighted sums of such simple elements (e.g., [3,4]). Probabilistic characterizations of the classi"cation power of such elements were presented by Cover [5] and by Vapnik and Chervonenkis [6]. Yet the order of the separation surface, or the classi"er size, has been largely selected in a trial and error fashion. Furthermore, characterizing the classi"ability of a data set with respect to its geometric properties, such as clustering, has largely remained an open problem [7,8]. q This work was supported in part by the NASA Ames Research Center and in part by the Fund for the Promotion of Research at the Technion. * Tel.: #972-4-8294356; fax: #972-4-8221128 E-mail address:
[email protected] (Y. Baram)
Parametric classi"ers do not necessarily classify the sample points used in their design correctly. A single hyperplane cannot correctly classify a given set of class}labeled points, unless the latter are linearly separable. Such misclassi"cation is often intentional, as, in certain cases, the underlying assumption is that the data is `noisya and a relatively simple separation surface provides a `smoothinga e!ect. However, in many classi"cation problem the data is the only knowledge available, and there is no reason to assume that it represents a simple model of a particular structure. Regarded as the `trutha, the data should then be classi"ed correctly by the classi"er. We call classi"ers which correctly classify the data consistent. The nearest-neighbor classi"er, possessing attractive accuracy properties [9] is perhaps the most popular non}parametric classi"cation method. It is, by de"nition, consistent. In an attempt to reduce the memory requirement of the nearest-neighbor method, Hart [10] proposed an algorithm which "nds a consistent subset of the class}labeled sample points, that is, a subset which, when used as a stored reference set for the nearest-neighbor rule, correctly classi"es all of the remaining sample points. Questions on the complexity of the algorithm and the expected size of the resulting consistent subset were raised, but left unanswered. In this paper we present a geometric approach to consistent classi"cation. We observe that the nearest
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 0 - 3
178
Y. Baram / Pattern Recognition 33 (2000) 177}184
neighbor criterion imposes a division on Rn which is similar to that of the Voronoi diagram [11]. However, while the Voronoi cells are the minimum}distance domains of each of the points with respect to its neighbors, our basic cells are the minimum}distance local separators of each of the points with respect to its neighbors of the competing class. Such local separators, like Voronoi cells, are multi}linear domains (or polytopes), but, in contrast to Voronoi cells, their number may be reducible without leaving `holesa in the input space. We call the average number of points of a given class that fall in the minimum distance local separator of a point of the same class the local clustering degree of the data. It is a measure of the classi"ability of the data, and it will make it possible to specify the expected sizes of classi"ers. The union of local separators of labeled points of the same class de"nes a cover for these points, which we call a separator. In the case of minimum}distance local separators, the domain covered by a separator corresponding to a class consists of points of Rn that would be assigned to the same class by the nearest-neighbor method. Moreover, the separator is exclusive: it does not cover any of the points of Rn which are closer to any of the points of the competing class. A new point will be assigned to a class if it falls under the corresponding separator. This is a crude way of performing nearest-neighbor classi"cation. It allows us, however, to "nd reduced consistent subsets, hence, reduced classi"ers. We propose a consistent reduction of the nearest-neighbor classi"er and, employing the local clustering degree of the data, derive bounds on its design complexity and on the expected classi"er size. The latter is also shown to bound the expected size of Hart's classi"er. Observing that the nearest-neighbor method de"nes a multi}linear separation surface between two classes, we consider, for comparison, direct consistent multi}linear separation. The performances of the proposed algorithms in classifying real data is compared to that of the nearest-neighbor method.
We call the set of those points of Rn that are closer to a point x3X than to any point of X the minimum} 1 2 distance local separator of x with respect to X . It is 2 generated by the following procedure: Connect each of the points of X to x and place a 2 hyperplane perpendicular to the line segment connecting the two points at the mid}point. Each of these hyperplanes divides the space into two half}spaces. The intersection of the half}spaces containing x de"nes the minimum}distance local separator of x3X with respect 1 to X . 2 The minimum}distance local separators of the points of X with respect to the set X can be found in a similar 2 1 manner. The de"nition of the minimum}distance local separator of a point resembles that of its Voronoi cell, but the two should not be confused. Given a "nite number of points, x(1) ,2, x(N)3Rn, the Voronoi cell of x(i) consists of those points of Rn that are at least as close to x(i) as to any other point x(j), jOi (see, e.g., [12]). The class assignment of the given points is immaterial. While each of the faces of a Voronoi cell is shared by another cell, the local separators of points of either the same or di!erent classes do not generally share faces. While the Voronoi cell of a point does not contain any of the other given points, the local separator of a point may contain other points of the same class. This di!erence is crucial to our purposes, since, in contrast to Voronoi cells, local separators may be eliminated without leaving `holesa in the data space. Fig. 1 shows the minimum}distance local separator of a point x3X , where X is represented by the set of 1 1 black}"lled circles, from the set X , represented by the 2 white}"lled circles. It should be noted that while only the positive part of R2 is considered in this illustration, local separators need not be bounded. It is quite obvious that the ability to classify a data set depends somehow on its structure. A correspondence between the classi"ability of a data set and the clustering
2. Local clustering Consider a "nite set of points X"Mx(i), i"1,2, NN in some subset of Rn, the real space of dimension n. Suppose that each point of X is assigned to one of two classes, and let the corresponding subsets of X, having N and N points, respectively, be denoted X and X . 1 2 1 2 We shall say that the two sets are labeled ¸ and ¸ , 1 2 respectively. It is desired to divide Rn into labeled regions, so that new, unlabeled points can be assigned to one of the two classes. We de"ne a local separator of a point x of X with 1 respect to X as a convex set, s(xD2), which contains x and 2 no point of X . A separator family is de"ned as a rule that 2 produces local separators for class}labeled points.
Fig. 1. Separating x3X from X by a minimum}distance local 1 2 separator.
Y. Baram / Pattern Recognition 33 (2000) 177}184
of points in it has been noted by many authors. The underlying notion is that data points belonging to the same class are grouped, or clustered, in some geometrically consistent manner. However, not only is a formal characterization of this correspondence lacking, but, oddly enough, it appears that the notion of a data cluster has never been formally de"ned. Anderberg [7] states that `the term &cluster' is often left unde"ned and taken as a primitive notion, in much the same way as a &point' is treated in geometrya. Everitt [8] argues that a formal de"nition of a cluster is `not only di$cult but may even be misplaceda. He notes observations to this e!ect by Bonner [13], who suggested that the ultimate criterion for evaluating the meaning of such terms is the value judgment of the user, and by Cormack [14] and Gordon [15], who attempted to de"ne a cluster using such properties as `internal cohesion and external isolationa. Jain and Dubes [16] suggest that cluster analysis `should be supplemented by techniques for visualizing dataa. The general notion seems to be that while we cannot de"ne a cluster, we know it when we see one. However, biological vision is physically con"ned to two or, at most, three dimensions and, for our classi"cation purposes, we need a cluster characterization that will apply to higher dimensional spaces. We approach the classi"ability characterization problem by de"ning a clustering property of the data. Consider the minimum}distance local separator, s(xD2), of a point x3X with respect to X . Let s be the fraction of 1 2 points of X which are within s(xD2). Suppose that the 1 points of X and X are independently selected at ran1 2 dom from the class domains ¸ and ¸ in Rn. We de"ne 1 2 the local clustering degree of X with respect to X as the 1 2 expected value of s and denote it c . It is the probability 1@2 that, in independent sampling, a random point of ¸ will 1 be closer to a randomly chosen point of X than to any 1 point of X . The clustering degree of X with respect to 2 2 X is de"ned similarly and denoted c . The clustering 1 2@1 degree of the entire data set X is c"min Mc , c N. 1@2 2@1 Suppose, for instance, that X forms a single convex 1 cluster. Then its clustering degree will close to 1. If, on the other hand, the two sets X and X are highly mixed, 1 2 their clustering degrees will be close to 0. The clustering degree makes it possible to quantify the classi"ability of a given data set. The higher the clustering degree of the data, the more classi"able it is. If the clustering degree is high, then, as we shall see, few local separator will su$ce for solving the classi"cation problem. If it is low, then the classi"cation problem is hard and many local separators will be required for solving it.
3. Consistent reduction of the nearest-neighbor classi5er The nearest-neighbor classi"cation criterion assigns to ¸ an unlabeled point that is closer to a labeled point 1
179
of ¸ than to any labeled point of ¸ , and vice versa. The 1 2 rationale behind it is that a point which is close to a labeled point is `similara to it and should be labeled the same way. The Condensed Nearest-Neighbor (CNN) algorithm proposed by Hart [10] "nds a consistent subset of the data, that is, a subset which, when used as a reference set for the nearest-neighbor classi"er, correctly classi"es the remaining data points. The algorithm is described below: Hart's CNN algorithm: Bins called STORE and GARBAGE are set. The data is "rst placed in GARBAGE while STORE is empty. The "rst sample is placed in STORE. The ith sample is classi"ed by the nearestneighbor rule, using as a reference set the current contents of STORE. If classi"ed correctly, it is placed in GARBAGE; otherwise it is placed in STORE. The procedure continues to loop through GARBAGE until termination. Hart has left "nding the complexity of the algorithm and the expected size of the resulting consistent subset (that is, the resulting classi"er size) as unsolved problems. The following algorithm for "nding a consistent subset, which, for distinction, we call the Reduced NearestNeighbor (RNN) algorithm, is simpler than the CNN algorithm, and its expected computational complexity and the expected size of the resulting consistent subset are relatively easy to derive. Algorithm RNN. For convenience, let the data be given in separate records of X and X . Let the sets of points 1 2 of X and X already selected for the consistent subset 1 2 be denoted X and X , respectively. For each point 1,S 2,S of X put in X , "nd the minimum}distance local 1 1,S separator with respect to X . Include a new point of 2 X in X only if it does not fall in any of the local 1 1,S separators of the points already in X . The construction 1,S of X is similar, with the sets X and X interchanging 2,S 1 2 roles. A more detailed implementation of the RNN algorithm is given below. 1. A point x of X is placed in X . 1 1 1,S 2. For the ith point, x , of X , perform the following: i 1 (a) For the jth point, y , of X , "nd the distance d(i, j) j 1,S from x to y . i j (b) Find the minimum distance d (iD2) from x to the .*/ i set X . 2 (c) If d(i, j))d (iD2), eliminate x from X ; otherwise, .*/ i 1 increase j by 1 and go to (a). 3. If x is not eliminated, add it to X . Increase i by i 1,S 1 and go to (2), unless X is empty. 1 The complexity of the algorithm and the expected size of the resulting consistent set are speci"ed by the following result:
180
Y. Baram / Pattern Recognition 33 (2000) 177}184
Theorem 1. The complexity of Algorithm RNN is O(N2). The expected size of the resulting consistent set is no greater than (1!c)N#1. Proof. For each of the points of X , only the distances to 1 some of the other points of X and to all the points of 1 X are required ("nding the minimal distances comes at 2 the same cost), which proves the "rst part of the theorem. The second part follows immediately from the geometric interpretation of the algorithm: Consider the "rst point, say, x, included in X . Its local separator contains, on 1,S average, a fraction c of the points of X . Since these 1@2 1 will be discarded down the line, a fraction 1!c will, 1@2 on average, remain (the remaining set may, of course, be further reduced). Since x itself also remains, the assertion follows. h The analysis of the RNN algorithm, which, in contrast to the CNN algorithm, avoids repetitious calculations, facilitates an analysis of the CNN algorithm, which is summarized by the following result: Theorem 2. The complexity of the CNN algorithm is O(N3). The expected size of the consistent subset generated by the CNN algorithm, is no greater than (1!c)N#1. Proof. It is not di$cult to see that there cannot be more than N runs through GARBAGE, with the number of points in GARBAGE reduced by, at least, one on each iteration. On each iteration, the distance between each of the points in GARBAGE (at most N) and each of the points in STORE (at most N) is calculated ("nding the shortest distance comes at the same cost). It follows that the complexity of CNN is O(N3). The expected size of the resulting consistent subset is, however, smaller than that of the RNN algorithm. This is a consequence of the fact that the RNN algorithm adds a point of the "rst class to the consistent subset if it is closer to the set of all the points of the second class than to the points of the "rst class already included in the subset, while the CNN algorithm adds a point of the "rst class to the consistent subset if it is closer to the smaller set of points of the second class already included in the subset than to the points of the "rst class already included in the subset. Since, in independent sampling, a larger set of random points is likely to be closer to a random point than a smaller set, the RNN algorithm is likely to include more points in the consistent subset than the CNN algorithm. It follows that the expected size of the consistent subset generated by the CNN algorithm, is no greater than (1!c)N#1, as asserted. h The higher complexity of the CNN algorithm is explained by the fact that, since data points that are correctly classi"ed are returned to GARBAGE, there are
repetitions of distance calculations for the same pairs of points and, subsequently, comparisons between the same distances. The RNN algorithm avoids such repetitions.
4. Consistent multi+linear classi5cation We de"ne a separator S(1D2) of X with respect to X as 1 2 a set that includes X and excludes X . 1 2 Given a separator family, the union of local separators s(x(i)D2) of the points x(i), i"12, N , of X with respect 1 1 to X , 2 S(1D2)"Z
s(x(i)D2) x(i)3X1
(1)
is a separator of X with respect to X . It consists of 1 2 N local separators. 1 Let X be a subset of X . The set 1,c 1 S (1D2)"Z (i)3 s(x(i)D2) c x X1,c
(2)
will be called a consistent separator of X with respect to 1 X if it contains all the points of X . The set X will 2 1 1, c then be called a consistent subset with respect to the given separator family. Given a separator, S(1D2), a new point will be assigned to X if it is included in S(1D2). It should be noted that 1 S(1D2) and S(2D1) may, but need not, be either mutually exclusive or complementary, even when they are based symmetrically on the same rules, interchanging the roles of X and X . Consequently, the separators S(1D2) and 1 2 S(2D1) need not produce the same results. Employing minimum}distance local separators, the nearest-neighbor method will assign a point to ¸ if it is 1 covered by S(1D2) (note that in this case S(1D2) and S(2D1) are mutually exclusive and complementary). This, however, is not an e$cient approach to nearest-neighbor classi"cation. The memory requirement for each of the minimum distance separators of the points of X with 1 respect to X is N . Since there are N points in X , the 2 2 1 1 memory requirement of S(1D2) is O(N N )"O(N2). In 1 2 contrast, the memory requirement for the direct nearestneighbor method ("nding the minimum distance from a new point to the labeled set) is O(N #N )"O(N). 1 2 Yet, the approach has a conceptual value for understanding reduced classi"ers of the form S (1D2). c Suppose that we have the Voronoi cells of the combined set X"X XX (that is, all points are treated 1 2 equally, regardless of class assignment). Then, clearly, the union of the Voronoi cells of the points of X de"nes the 1 domain of points which will be assigned to ¸ according 1 to the nearest-neighbor criterion. Any point outside this union will be assigned to ¸ . Since the Voronoi cells are 2 disjoint, none can be eliminated without leaving `holesa in the input space (including, of course, the point whose
Y. Baram / Pattern Recognition 33 (2000) 177}184
cell is eliminated). Since the number of data points may be large, the use of Voronoi cells in classi"cation may be impractical. In addition, the complexity of constructing a Voronoi diagram in high-dimensional space may be considerably higher than that of constructing a minimum}distance separator. (It is known ([17]) that the complexity of constructing a Voronoi diagram is O(N log N) in R2 and N2 log N in R3. Results for 2 2 higher dimensional spaces do not appear to be available). A reduced minimum distance (RMD) classi"er may be produced by selecting a consistent subset of the labeled points corresponding to one of the two classes, whose minimum}distance local separators cover the entire set. This may be done by selecting an arbitrary labeled point for the consistent subset, eliminating all the labeled points of the same class that are within its local separator, selecting an arbitrary point from the remaining points of the same class, etc. Such a selection is illustrated by Fig. 4. It shows that the minimum}distance separators of three out of the "ve points of X su$ce for construct1 ing a separator for X and the local separators of three of 1 the four points of X cover X . In each case, although the 2 2 sample set is reduced, the remaining points, constituting a consistent subset, still produce a separating surface between the two sets of class}labeled points. The surface is the boundary of the union of their local separators. It can be seen that the boundaries produced for the two cases are di!erent. Each may be used for classi"cation, producing possibly di!erent results for new points. If all the minimum}distance separators were used, the two boundaries would be identical (see Fig. 2). The nearest-neighbor criterion, and the CNN, RNN and RMD classi"ers impose multi}linear separations between the classes. We note that multi}linear separation as a sole criterion has been used in previous classi"cation concepts, such as the perceptron [2] and neural networks consisting of linear threshold elements [18]. Next, we present a direct, seemingly natural construction of a consistent multi}linear classi"er. This classi"er may not pos-
Fig. 2. (a) The minimum}distance local separators of x , x and 1 2 x and the associated separator of X with respect to X . (b) The 3 1 2 minimum}distance local separators of y , y and y and the 1 2 3 associated separator of X with respect to X . 2 1
181
sess good predictive abilities, and its memory requirement may be higher than that of the nearest-neighbor method. Yet, it will be of value as a reference for performance evaluation (showing that consistency does not guarantee quality). A multi}linear local separator (MLLS) of a point x of X with respect to X , denoted s(xD2) is generated by the 1 2 following algorithm: Algorithm MLLS. 1. Let XI "X . 2 2 2. Find the point of XI nearest to x. 2 3. Place a hyperplane perpendicular to the line segment connecting the two points at the midpoint. 4. Let the new XI be the set of points of the old XI which 2 2 are on the side of the hyperplane which includes x. 5. While XI is non}empty, go to (2). 2 The local multi}linear separator is the intersection of the half}planes on the sides of the hyperplanes found in step (2) which include x. The local multi}linear separator of a point of X with respect to the set X can be found in 2 1 a similar manner. The union of the multi}linear local separators of the points x(i), i"1 ,2, N , of X with respect to X , 1 1 2 is a separator of X with respect to X . It consists 1 2 of N local separators. A reduced separator, consisting, 1 generally, of less local separators, is generated by the following Multi}Linear Classi"er (MLC) design algorithm, which is written in terms of S(1D2), but is applicable to S(2D1), with an obvious change of variables. Algorithm MLC. 1. Let XI "X . 1 1 2. Select an arbitrary point of XI . Find its multi}linear 1 separator with respect to X , employing MLLS. 2 3. Let the new XI be the set of points of the old XI which 1 1 are outside the separator found in (1). 4. While XI is non}empty, go to (2). 1 The separator is the union of the local separators found in step (2). In the rest of this paper we shall use the terms multi}linear local separator and multi}linear separator to represent the objects generated by algorithms MLLS and MLC, respectively. Fig. 3 shows the multi}linear local separators of points of X and X , and the resulting separators (indicated by 1 2 the thick lines). These can be seen to be di!erent from the ones shown in Fig. 2 for reduced minimum}distance classi"cation. We de"ne the size of the multi}linear classi"er as the total number of local separator faces, which represents the total amount of memory needed for implementing the classi"er. Bounds on the complexity of the algorithm and on the expected size of the classi"er are speci"ed next.
182
Y. Baram / Pattern Recognition 33 (2000) 177}184
5. Examples
Fig. 3. (a) The multi}linear local separators of x , x and x and 1 2 3 the associated separator of X with respect to X . (b) The 1 2 multi}linear separators of y and y and the associated separ1 2 ator of X with respect to X . 2 1
Theorem 3. The complexity of Algorithm MLC is O(N2). The expected size of the resulting classixer is smaller than 0.25[(1!c)N#2]2.
Proof. Consider "rst the procedure of "nding the separator of an arbitrary point x of X . Finding the point y of 1 X nearest to x for constructing the "rst face of the 2 separator is, clearly, O(N ). Each of the points of X is 2 2 now checked for being on the side opposite to x of the hyperplane separating x from y, in which case it is eliminated from the remainder of this procedure. This is another O(N ) process. On average, fewer than N (1!c ) 2 2 2@1 of the points of X remain after this step (since the 2 expected total number of points of X outside the separ2 ator of y is N (1!c ), there are, on average, fewer than 2 2@1 that number of points of X on the side of the hyperplane 2 which includes x). This procedure has to be repeated for the subset of X , which falls outside the local separator of 1 x. The size of this subset is, clearly, smaller than N , and 1 its expected value is N (1!c ). Since N N )0.25N2, 1 1@2 1 2 and since, on the one hand, the face corresponding to y belongs to the separator of x, and, on the other, x belongs to consistent subset de"ning the separator of X , 1 the assertion follows. h The following modi"cation of the MLC algorithm will generally reduce the size of the classi"er. Algorithm MLCR. Same as Algorithm MLC, but at step (2), instead of randomly selecting a point from XI , select 1 the one whose separator has the largest volume of all the separators of the points of XI . Since the separator vol1 umes are generally hard to calculate, select the point whose separator contains the maximal number of points of XI . 1 The classi"er constructed by the MLCR algorithm can be expected to have a smaller size than the one employing random selection, since each separator selected for S(1D2) will potentially contain more points of X , and less 1 separators will be needed for covering the set X . 1
Example 1. Given a sequence of k daily trading (`closea) values of a stock, it is desired to predict whether the next day will show an increase or a decrease with respect to the last day in the sequence. An initial value of k"3 is selected, but it is increased if there are ambiguities (di!erent predictions for the same inputs) in the training data. Records for 10 di!erent stocks, each containing, on average, 1260 daily values were used. About 60% of the data were used for training and the rest for testing. The results achieved by the nearest-neighbor (NN), the condensed nearest-neighbor (CNN), the reduced nearest-neighbor (RNN) and the multi}linear (ML) classi"ers are given in Fig. 4, for each of the stocks, whose code names are speci"ed in the leftmost column. The memory reduction rates achieved by the CNN and the RNN algorithms with respect to the nearest-neighbor method were 37.3 and 35.3%, respectively. The ML algorithm transformed the data into a set of, on average, 0.2N separators, each possessing "ve faces, hence, no data reduction. The table shows that, on average, the nearest-neighbor method has produced the best results. The performances of the CNN and the RNN classi"ers (the latter producing only slightly better results) are somewhat lower and that of the ML classi"er is lower yet. This shows that di!erent consistent classi"ers do not necessarily produce results of similar quality. While the average performances are of certain interest, the individual results of each of the algorithms for each of the stocks are of practical signi"cance, since in actual stock trading, one has the choice of both the prediction method and the stock to be traded. Example 2. The Pima Indians Diabetes Database [19] has 768 instances of eight real}valued measurements, corresponding to 268 ill patients and 500 healthy ones. These two classes were found to be highly mixed and di$cult to characterize, examining any subset or linear combination of the measurements, including the principal directions of the covariance. Five di!erent training sets were de"ned, as shown in Fig. 5, with the rest of the data serving as test sets. The success rates achieved by the di!erent classi"ers are given in Fig. 5 for each of the cases, along with the average values. The results are similar in nature to those of the previous example.
6. Conclusion Solutions to the consistent classi"cation problem have been speci"ed in terms of local separators of data points of one class with respect to the other. Reduced consistent classi"ers for the nearest-neighbor and the multi}linear separation criteria have been presented, and their design
Y. Baram / Pattern Recognition 33 (2000) 177}184
183
Fig. 4. Success rates in the prediction of rise and fall in stock values.
Fig. 5. Success rates in the classi"cation of the diabetes data base.
complexities and expected sizes have been speci"ed in terms of the local clustering degree of the data.
7. Summary The nearest-neighbour classi"cation criterion imposes a division on Rn which is similar to that of the Voronor diagram. However, while the Voronoy cells are the minimum}distance domains of each of the points with respect to its neighbors, the basic cells of the nearestneighbor classi"cation method are the minimum} distance local seperators of each of the pionts with respect to its neighbors of the competing class. Such local seperators, like Voronoy cells, are multi}linear domains
(or polytypes), but, in contrast to Voronoy cells, their number may be reducible without leaving `holesa in the input space. The average number of points of a given class that fall in the minimum}distance local separator of a point of the same class is called the local clustering degree of the data. It is a measure of the classi"ability of the data, and it makes it possible to specify the expected sizes of classi"ers. The union of local seperators of labeled points of the same class de"nes a cover for these points, which is called a separator. In the case of minimum}distance local separators, the domain covered by a separator corresponding to a class consists of points of Rn that would be assigned to the same class by the nearest-neighbor method. Moreover, the seperator is exclusive: it does not cover any of the points of Rn which are closer to any of the points of the competing class. A new point will be assigned to a class if it falls under the corresponding separator. This is a crude way of performing nearest-neighbor classi"cation. It allows us, however, to "nd reduced consistent subsets, hence, reduced classi"ers. A consistent reduction of the nearest-neighbor classi"er is proposed and, employing the local clustering degree of the data, bounds on its design complexity and on the expected classi"er size are derived. The latter is also shown to bound the expected size of Hart's condensed nearest-neighbor classi"er (1996). The existence of reduced consistant versions of the nearest-neighbor, which are likely to produce higher error rates, supports an objection raised by Webb (1996) against a previously proposed utility of Occam's razor in classi"cation. An observation that the nearest-neighbor method de"nes
184
Y. Baram / Pattern Recognition 33 (2000) 177}184
a multi}linear separation surface between two classes suggests direct consistent multi}linear separation. The performances of the proposed algorithms in predicting stock behaviour is compared to that of the nearestneighbour method, providing yet further experimental evidence against the utility of Occam's razor. Acknowledgements The author acknowledges Dr. Amir Atiya of Cairo University for providing the stock data used in the examples and for valuable discussions of the corresponding results. References [1] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, San Diego, 1990. [2] F. Rosenblatt, The perceptron: a probabilistic model for information storage and organization in the brain, Psychol. Rev. 65 (1958) 386}408. [3] G. Cybenko, Approximation by superposition of sigmoidal functions, Math. Control Signals Systems 2 (1989) 303}314. [4] Y. Baram, Classi"cation by balanced binary representation, Neurocomputing 10 (1996) 347}357. [5] T.M. Cover, Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition, IEEE Trans. Electron. Comput. EC-4 (1965) 326}334.
[6] V.N. Vapnik, A. Ya. Chervonenkis, On the uniform convergence of relative frequencies of events to their probabilities, Theory Probab. Appl. No. 2 (1971) 264}280 ("rst published in Russian, May 1969). [7] M.R. Anderberg, Cluster Analysis for Applications, Academic Press, New York, 1973. [8] B.S. Everitt, Cluster Analysis, third ed., Edward Arnold, London, 1993. [9] T.M. Cover, P.E. Hart, Nearest-neighbor pattern classi"cation, IEEE Trans. Inform. Theory, IT-13 (1) (1967) 21}27. [10] P.E. Hart, The condensed nearest-neighbor rule, IEEE Trans. Inform. Theory IT-14 (3) (1968) 515}516. [11] D.T. Lee, F.P. Preparata, Computational geometry } a survey, IEEE Trans. Computers, C-33 (12) (1984) 1072}1101. [12] J.H. Conway, N.J.A. Sloane, Sphere Packings, Lattices and Groups, Springer, New York, 1988. [13] R.E. Bonner, On some clustering techniques, IBM J. Res. Dev. 8 (1964) 22}32. [14] R.M. Cormack, A review of classi"cation, J. Roy. Statist. Soc. (134) (1971) 321}367. [15] A.D. Gordon, Classi"cation, Chapman & Hall, London, 1980. [16] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, Englewood Cli!, NJ, 1988. [17] I. Ya. Akimova, Application of Voronoi diagrams in combinatorial problems (A Survey), Tekh. Kibern. 22 (2) (1984) 102}109; Eng. Cybernet. 22 (4) (1984) 6}12. [18] E.B. Baum, On the capabilities of multilayer perceptrons, J. Complexity 4 (1988) 193}215. [19] University of California at Irvine: Machine Learning Data Bases (www.ics.uci.edu/AI/ML/Machine-Learning.html).
About the Author*YORAM BARAM recieved the B.Sc. degree in aeronautical engineering from the Technion-Israel Institute of Technology, Haifa, the M.Sc. degree in aeronautics and astronautics, and the Ph.D. degree in electrical engineering and computer science, both from the Massachusetts Institute of Technology, Cambridge, in 1972, 1974, and 1976, respectively. In 1974}1975 he was with the Charles Stark Draper Laboratory, Cambridge, MA. In 1977}1978 he was with the Analytic Sciences Corporation, Reading, MA. In 1978}1983 he was a faculty member at the Department of Electronic Systems, School of Engineering, Tel-Aviv University, and a consultant to the Israel Aircraft Industry. Since 1983 he has been with the Technion-Israel Institute of Technology, where he is an Associate Professor of Computer Science. In 1986}1988 he was a Senior Research Associate of the National Research Council at the NASA-Ames Research Center, Mo!ett Field, CA, where he has also spent the following summers. His current research interests are in pattern recognition and neural networks.
Pattern Recognition 33 (2000) 185}194
Adaptive linear dimensionality reduction for classi"cation Rohit Lotlikar, Ravi Kothari* Artixcial Neural Systems Laboratory, Department of Electrical and Computer Engineering and Computer Science, University of Cincinnati, P.O. Box 210030, Cincinnati, OH 45221-0030, USA Received 7 August 1998; accepted 3 February 1999
Abstract Dimensionality reduction is the representation of high-dimensional patterns in a low-dimensional subspace based on a transformation which optimizes a speci"ed criterion in the subspace. For pattern classi"cation, the ideal criteria is the minimum achievable classi"cation error (the Bayes error). Under strict assumptions of the pattern distribution, the Bayes error can be analytically expressed. We use this as a starting point to develop an adaptive algorithm that computes a linear transformation based on the minimization of a cost function that approximates the Bayes error in the subspace. Using kernel estimators we then relax the assumptions and extend the algorithm to more general pattern distributions. Our simulations with three synthetic and one real-data set indicate that the proposed algorithm substantially outperforms Fisher's Linear Discriminant. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Dimensionality reduction; Classi"cation; Adaptive algorithms
1. Introduction Input variables of high-dimensional patterns are often correlated such that the intrinsic dimensionality of these patterns is signi"cantly lower than the input dimensionality. Reducing the dimension of the input patterns removes redundant information from the patterns, and allows for more reliable classi"cation in the subspace with limited patterns. For some classi"ers, especially those which are sensitive to the diluting e!ect of extraneous information (e.g. nearest-neighbor classi"ers), there is often an improvement in classi"cation accuracy. Consequently, classi"cation of high-dimensional patterns is often preceded by mapping these patterns to a lower dimension subspace. The transformation from high-dimensional space to a lower dimensional subspace is constructed to optimize a speci"ed criteria in the subspace.
* Corresponding author. Tel.: #513-556-4766; fax: #513556-7326. E-mail address:
[email protected] (R. Kothari)
Broadly, the majority of the techniques proposed for pattern classi"cation can be categorized as supervised or unsupervised depending on whether or not the class label is used in arriving at the subspace. Other categorizations such as linear and non-linear, or parametric and nonparametric are also possible. Probably, the most well-known example of (linear) unsupervised dimensionality reduction is based on principal component analysis (PCA) [1]. PCA provides the optimum representation (in a mean-squared error sense) for a given number of features, though it may lead to identifying combinations which are entirely unsuitable from the perspective of classi"cation. Although unsupervised techniques do not use a class label, they may impose other constraints to "nd the lower dimension subspace. For example, (non-linear) Sammon's mapping [2], self-organizing feature maps (SOFM) [3], and curvilinear component analysis (CCA) [4] attempt to preserve the local structure by imposing a greater emphasis on the preservation of shorter distances over longer distances. Supervised dimensionality reduction techniques on the other hand take advantage of the class label and typically can project the data such that it is more amenable to
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 3 - 9
186
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
classi"cation. Discriminant analysis, for example, uses the concept of a within-class scatter matrix S and a bew tween-class scatter matrix S to maximize a separation b criterion, such as J"tr(S~1S ). (1) w b The advantage of linear discriminant analysis is that it is non-recursive, though as given above it is not applicable for multi-modal distributions. It is worth noting that the criterion in Eq. (1) is not directly related to classi"cation accuracy. The above-mentioned techniques along with some neural network methods for dimensionality reduction are discussed and compared by Mao and Jain [5]. For pattern classi"cation, the ideal criteria is the minimum achievable classi"cation error or the Bayes error. The increase in Bayes error is a measure of the loss of relevant information when dimensionality is reduced. When reducing dimensionality, the Bayes error rate in the reduced-dimension space is therefore an ideal measure to assess the classi"ability of the projected data. However, Bayes error cannot be easily expressed analytically for arbitrary pattern distributions and hence there have been few studies that use Bayes error as the criteria. Buturovic` [6] uses a k-NN method to obtain an estimate of the Bayes error, computed on the training data set projected to a low-dimensional subspace using conventional methods of dimensionality reduction [6]. The simplex method is then used as the optimization algorithm. An optimal value of k is not easy to "nd and the estimate of the Bayes error is directly dependent on k. The authors suggest estimating it based on standard methods of error estimation such as leave-one-out and re-substitution methods. However, the approach is computationally expensive. On the other hand, the reverse procedure of "rst selecting a classi"er and then using the decision boundaries produced by the classi"er to "nd suitable subspaces has also been considered [7]. The fundamental drawback of this technique is that the quirks of the selected classi"er precludes optimal extraction of features. The approach we adopt in this paper is to derive an adaptive algorithm for dimensionality reduction based on some assumptions about the pattern distribution. Under these assumptions, the Bayes error is analytically obtained allowing for an algorithm which provides nearoptimal performance when the assumed pattern distribution matches the true pattern distribution. We present this algorithm in Section 2. In Section 3, we extend this approach using kernel estimators for the case when the pattern distribution does not satisfy the assumed pattern distribution. We present results using three synthetic and one real-world data sets in Section 4. We "nd that the basic algorithm performs nearly as well as the algorithm extended to deal with arbitrary pattern distributions if a whitening transform is used to preprocess the data. We present our conclusions in Section 5.
2. Reducing dimensionality We begin the development of the algorithm by considering multi-category exemplars in an n-dimensional real-space Rn. Two assumptions are made regarding the distribution of samples: (A1) Each class has a normal distribution characterized by a Gaussian density function and a covariance matrix +"p2I (where I is the identity matrix). The covariance matrix is the same for all classes but each class may have a di!erent mean. (A2) The classes have equal a priori probabilities. These assumptions are unlikely to hold for real-world problems, however they are not overly restrictive. We use it to develop the basic algorithm and establish it on a "rm theoretical foundation. It simpli"es our analysis and also allows us to build a classi"er with an error equal to Bayes error in the case when the pattern distribution satis"es the assumed distribution. Later in the paper we relax these assumptions. Our objective is to "nd among all m-dimensional linear subspaces of Rn, where m has been "xed a priori and (m)n), a subspace S-Rn in which Bayes error is minimized. We also wish to "nd a corresponding transformation ¹ : RnPS(m)n). Since Bayes error depends only upon the subspace, ¹ can be any linear transformation with range space S. We assume without loss of generality, that ¹ is orthonormal and is parameterized by an orthonormal matrix = . The columns of = will form an nCm orthonormal basis for S. Under such an orthonormal transformation, a Gaussian distribution with covariance matrix +"p2I remains Gaussian and its covariance (nCn) matrix in the subspace is +K "p2I .1 (mCm) When there are only two classes, the Bayesian boundary is a hyperplane and is the orthogonal bisector of the line segment joining the two class centers. The probability that samples from class 1 are incorrectly a$liated to class 2 can be expressed in terms of the marginal distribution p (t) of class 1 along the line joining the two m centers
P
e"
=
p (t) dt, (2) m dK @2 where dK is the distance between the class centers in the output subspace. This distance depends on the distance between the centers in the input space and the transformation = (we will make the relationship explicit shortly). The marginal distribution is a Gaussian with
1 As done here, for each input space variable, the corresponding output space variable is denoted by a `hata above the variable.
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
variance p2 so that
P
1 = e" exp (!t2/2p2) dt. J2pp dK @2
(3)
When the two classes do not have identical covariance matrices, the Bayesian boundary is not a hyperplane and the overlap between the pdfs of the two classes cannot be expressed in terms of a marginal distribution. When we have c classes MC : k"12cN, the Bayesian k boundaries form a Voronoi tessellation of the output space. In this case the probability of misclassi"cation for class C is the probability mass of class C that lies k k outside the Voronoi polygon it is associated with. This quantity cannot be expressed in a simple analytic form for c'2. Instead, as an approximation, we consider each pair of classes and minimize the sum total of the pairwise errors e . ij Let k(i) be the mean of class i in the input space. Let k( (i) denote its mean in the output space. The two are related through k( (i)"=Tk(i). Let dK "DDk( (i)!k( (j)DD deij note the Euclidean distance between the means of class i and class j in the subspace. The sum total of pairwise errors forms our objective function and is given by
P
= c c 1 exp (!t2/2p2) dt. J"2 + + i/1 j/i`1 J2pp dK ij@2
(4)
The objective function of Eq. (4) is highly non-linear, therefore, a closed form solution for = that minimizes it cannot be obtained. So we resort to a gradient descent scheme for minimizing J. Di!erentiating Eq. (4) with respect to = we obtain LJ 1 c c LdK "! + + exp (!dK 2 /8p2) ij . ij L= L= J2pp i/1 j/i`1
(5)
Denote v(ij),k(i)!k(j), so that dK "DD=Tv(ij)DD, ij whereby
(6)
LdK 1 ij " v(ij)v(ij)T=. L= 2dK ij Combining Eqs. (5) and (7) we obtain
(7)
LJ 1 "! L= 2J2pp
C
D
c c 1 ] + + exp (!dK 2 /8p2)v(ij)v(ij)T = ij dK ij i/1 j/i`1 and the new = is given by LJ = "= !g , new old L=
(8)
(9)
187
where, g is the step size. Before repeating this step to reduce J further, we note that there is no constraint to ensure that the updates will result in an orthonormal transformation matrix =. However, it is important that each iteration be started o! with an orthonormal =, because our assumption that the class-speci"c covariance matrices remain "xed (+K "p2I) is dependent upon it. This may be achieved by the following procedure. After each update, we "nd an orthonormal basis for the subspace spanned by the columns of = and construct a new orthonormal = by using those basis vectors for its columns. Since selection of an orthonormal basis does not change the subspace, the procedure leaves the cost J una!ected. We use this = to start of the next iteration. The adaptive subspace theorem [3] ensures that the subspace spanned by the columns of = will always converge, provided that g is su$ciently small. We could also add penalty terms to the cost function that penalizes J when = deviates from orthonormality, i.e. m m m JI "J#j + + =(i)T=(j)#j + (DD=(i)DD!1)2. (10) 1 2 i/1 jEi i/1 Here =(i) is the ith column of =. j controls the empha1 sis on orthogonality and j controls the emphasis on 2 normality of the columns of =. These penalty terms are at best soft constrains and do not guarantee that orthonormality is achieved. Therefore, the orthonormalization procedure is always required. In our simulations we have not used such a term because of the added complications in appropriately determining j and j . 1 2 The "nal update rule for = therefore is
C
D
c c 1 = "= #g + + exp (!dK 2 /8p2)v(ij)v(ij)T = , new old ij old dK i/1 j/1 ij jEi (11) where the constant factor 1/2J2np is absorbed into g. Each update of = using Eq. (11) rotates the subspace so that dK , the length of the projection of v(ij) on the ij subspace is increased unless v o=. It may also be noted ij that it is possible to obtain at most (c!1) linear features with this technique, because v(ij)v(ij)T has a rank of at most (c!1). The procedure for computing the optimal transformation = : RnPRm, (m)n) may be summed up in the nCm following steps. The available information includes the class centers k(i) : i"12c, and the value of p2. 1. Generate an n]m matrix = with random entries. 2. Repeat the following sequence until J no longer decreases (a) Select an orthonormal basis for the subspace spanned by the columns of =. This may be done by using the Gram}Schmidt orthonormalization procedure.
188
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
(b) Project the class centers in the input space (k(i)) onto the output space. The class centers in the output space are given by k( (i)"=Tk(i). (c) Compute the cost J using Eq. (4). (d) Back in the input space "nd the vector di!erence vij,k(i)!k(j), of the two means. (e) Use Eq. (11) update =. As the algorithm proceeds, the cost J typically decreases rapidly at "rst and then slowly as J reaches it steady-state value. The decrease in cost is monotonic with occasional small upward jumps that are a consequence of "nite step size. The algorithm is terminated when the cost J reaches a steady-state value. Getting stuck in local minima is a standard weakness of gradient descent algorithms, however, our experiments indicate that the algorithm is fairly robust to local minima. Nevertheless, it is necessary to run the algorithm a few times with di!erent randomly chosen initial conditions (i.e. starting =) and select the "nal = from the run that gave the lowest J. Also, one can apply various heuristics that have been traditionally applied to gradient descent algorithms to improve their convergence properties. Eq. (11) thus provides an adaptive rule that ensures that when the pattern distribution matches the assumed distribution, a near-optimal subspace representation is achieved. When the assumed and true pattern distribution di!er, the representation is less than optimal and we present a simple extension in the next Section.
3. Extension to other distributions When the data are such that assumptions (A1) and (A2) are not satis"ed, we model the density function using a kernel estimator as a superposition of Gaussians. A Gaussian with a covariance matrix p2I, where p is the kernel width, is placed centered at each training point of that class. The modi"ed cost function is then,
P
q q 1 = J" + + (1!d ) exp (!t2/2p2) dt, (12) ij i/1 j/i`1 J2pp dK ij@2 where we have q training points x(i), i"12q from c(c(q) classes, dK is the distance between two samples ij with indices i and j, and d is 1 when patterns i and j are ij of the same class. The modi"ed update equation is then given by LJ 1 "! L= 2J2pp
C
q c 1 + + (1!d ) ij dK ij i/1 j/i`1
D
]exp (!dK 2 /8p2)v(ij)v(ij)T =, ij
(13)
where v(ij) is the corresponding vector di!erence of the two samples in the input space.
The issue here is the choice of the width p of the kernel estimator. A kernel width that is optimal in the sense of having the minimum mean-square error in its estimate of the density, may not be optimal from the view of classi"cation performance. This is because most of the misclassi"cation typically occurs at the tails of the distributions where sample densities are low. Also, the surface of the cost function will be riddled with local minima, particularly for small p. We therefore suggest that = be "rst computed using only the class centers, and this procedure be subsequently used to `"ne-tunea = using a small learning rate. p may be chosen by trial and error, the mean within-class intersample distance in regions where misclassi"cation is observed can be used as a guide for the initial choice. When the data satisfy assumptions (A1) and (A2), use of this extension will be most likely in increased classi"cation error. There are two contributing factors, one is the error in the estimate of the density function, and the other that the pairwise error becomes increasingly less representative of Bayes error as the number of kernels grows.
4. Simulations The algorithm was applied to four data sets, three of which are synthetic and one of which is a real data set. For the real data set, a k-NN (nearest neighbor) classi"er was used to classify the data projected onto a lower dimension space. The value of k was selected by using leave-one-out cross-validation. To put the performance of the proposed dimensionality reduction technique into perspective, we include the results obtained with Fisher's linear discriminant (FLD) [8]. We also refer to the algorithm of Section 2 as the adaptive algorithm and the extension of the algorithm using kernel estimators as the extended adaptive algorithm. In the simulations, the columns of = were randomly initialized corresponding to a random orientation in the m-dimensional space. Typically, the minimum value of J is reached in a few hundred iterations. The initial learning rate was chosen g"0.05. While we did not, g maybe adapted to obtain faster convergence. We performed "ve repeat trials with di!erent initial conditions (for =) for each case. The `besta =, among all the "ve runs is retained as the end result. Here `besta implies the = that resulted in the lowest J. For Simulation 2 (which used the image segmentation data set [9]) we also report the results obtained in each of the "ve runs to provide an indication of the variability introduced through di!erent initial conditions. 4.1. Simulation 1 This simulation is intended to serve as an illustrative example. The data consist of six classes in
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
three-dimensional space. Each class has a Gaussian distribution with a covariance matrix that is a 0.04 I, where I is the identity matrix. There are 50 samples per class. The six class centers are at c1"[!3 0 0]T, c2" [3 0 0]T, c3"[0 !2 0]T, c4"[0 2 0]T, c5"[0 0 !1]T, c6"[0 0 1]T. The mean of the data is the origin. Quite clearly, FLD would produce the three projection directions of [1 0 0]T, [0 1 0]T, and [0 0 1]T in decreasing order of importance. Now if we wished to reduce the data to two dimensions, class 5 and class 6 would be projected to the same region in the two-dimensional space. This is what happens as shown in Fig. 1. In contrast, the projection obtained with the adaptive technique is also shown in Fig. 1 and is superior to that obtained using FLD in that the six classes are well separated.
189
tory [9]. The data contain instances drawn randomly from a database of seven outdoor images. Nineteen continuous-valued features are extracted for each 3]3 region, and the class label is obtained by manual segmentation. Attribute 3 is a constant by de"nition and was deleted from the data set leaving 18 features. The classes are brickface, sky, foliage, cement, window, path, and grass. There are a total of 210 training patterns and 2100 testing patterns with 30 patterns for each of the seven classes in the training data and 300 patterns per class in the test data. The only preprocessing used was to whiten the data set. The whitening transformation, when applied to a data set consisting of c classes MC : i"12cN, coni verts the within-class covariance matrix
4.2. Simulation 2
c S " + + (x(j)!k(i)) (x(j)!k(i))T w i/1 x(j)|Ci
For this simulation, we used the image segmentation data set available from the UCI machine learning reposi-
into an identity matrix. This transformation is necessary because of our assumption that the individual-class
(14)
Fig. 1. The lower dimensional representation for the data set of Simulation 1 as obtained with FLD (left) and the adaptive algorithm (right).
Table 1 Classi"cation accuracy for the image segmentation data set over the "ve runs with the adaptive algorithm. The optimal value of k as determined using cross-validation is identi"ed as k 015
J at convergence Training accuracy Testing accuracy k 015
Trial d 1
Trial d 2
Trial d 3
Trial d 4
Trial d 5
0.0019292 88.9 88.5 5
0.0019297 89.5 88.9 3
0.0019371 88.6 88.8 5
0.0019382 88.5 89.0 5
0.0019222 88.5 88.8 5
190
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
Fig. 2. The lower dimensional representation for the image segmentation data set as obtained with the adaptive algorithm in the "rst four trials reported in Table 1. Table 2 Classi"cation accuracy for the image segmentation data set with FLD and the adaptive algorithm. m denotes the dimensionality of the subspace m"2
Training accuracy Testing accuracy k 015
m"3
FLD
Adaptive
FLD
Adaptive
77.6 73.7 7
88.5 88.8 5
85.7 86.8 3
91.4 90.4 1
covariance matrices are identity. Unless all the classes had identical-class covariance matrices, the individualclass covariance matrices will not be identity, however the average class conditional covariance matrix will be identity. The algorithm however performs fairly well nevertheless. Table 1 lists the training accuracy, testing accuracy, the "nal value of J, and the optimal value of k (as determined through cross-validation) obtained in each of the "ve runs when the dimensionality was reduced to 2. One may observe that since the data distribution does not satisfy
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
191
Fig. 3. The lower dimensional representation for the image segmentation data set as obtained with FLD (left) and with the `besta = (Triald 5 of Table 1) obtained with the adaptive algorithm.
Fig. 4. The lower dimensional representation for the modi"ed image segmentation data set. There are a total of seven classes. The lower dimensional representation as obtained with FLD (left) and the adaptive technique (right).
the assumptions made by the adaptive algorithm, the `besta = (in terms of the one that led to the lowest J) does not minimize classi"cation error. However, the variability is small. For this simulation the `besta = corresponds to Trial d 5 in Table 1.
Fig. 2 shows the representations obtained with the "rst four trials. Fig. 3 shows the representation obtained with the "fth trial (corresponding to the `besta =) along with the the representation obtained using FLD.
192
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
Fig. 5. The original data set (top left); the reduced dimension representation obtained with FLD (top right), the adaptive algorithm proposed (bottom left), and the extended adaptive algorithm (bottom right).
Table 2 compares the classi"cation accuracy obtained using FLD and the adaptive algorithm for subspace dimensionality of 2 and 3. The results for m"2 correspond to Trial d 5 in Table 1. 4.3. Simulation 3 In Simulation 2, the data did not satisfy the assumption that each class had a Gaussian pdf with an identity covariance matrix. In this simulation we take the means
of the seven classes of the data set of Simulation 2 and arti"cially generate samples around each mean so that each class has a Gaussian distribution with a covariance matrix given by 0.04 I. Four-hundred patterns are generated per class. After computing the transformation matrix =, the patterns and their means are projected to the lower dimension subspace. Classi"cation consists of assigning to each pattern the class label of the true mean that is closest to it. Fig. 4 shows the projected patterns for the case m"2 dimensions. It may be observed that there
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
is considerable overlap in "ve of the seven classes in the reduced dimension representation obtained using FLD. In contrast, the adaptive algorithm allows for clear separation of all the seven classes. In our plots, for the sake of clarity, we have displayed only a subset consisting of 50 points randomly chosen from each class. 4.4. Simulation 4 This simulation considers the modi"cation of the algorithm for arbitrary distributions. Speci"cally, a distribution which does not satisfy the assumptions of equal a priori probabilities and where class covariance matrices are not a scalar multiple of the identity matrix. For visualization purposes, we start with a data set which has three classes in two dimensions (see Fig. 5). Two of the classes have identity covariance matrices and equal a priori probabilities. The third class has a non-identity covariance matrix and a priori probability thrice that of either of the other two classes. On the assumption of equal a priori probabilities and a class covariance matrix that is a multiple of the identity matrix, the = vector (for projection to one-dimension) is the !453 line. The centers of the three classes are uniformly spaced. Histogram of the projection obtained by reducing the dimensionality from 2 to 1 is shown in Fig. 5. With FLD, the projection obtained shows signi"cant overlap while that obtained with the adaptive algorithm has minimal overlap. When the suggested modi"cation for arbitrary distribution (i.e. the extended adaptive algorithm) is used, with p"0.5, the = vector is rotated clockwise by a few degrees, increasing the spacing between the centers and class 1 and class 3 and reducing the spacing between the centers of class 1 and class 2. This resulted in a further decrease of the total overlap.
5. Conclusions In this paper we introduced an approach to "nd a subspace of a priori chosen dimensionality in which Bayes error is the minimum when the patterns have a speci"ed distribution. We also extended the algorithm so as to be applicable to more general pattern distributions. The adaptive algorithm performed quite well even when the pattern distribution did not agree with the assumed distribution. Additional improvement in performance was always obtained with the extended adaptive algorithm though the amount of improvement may vary with the problem. Computationally, both the adaptive algorithm and the extended adaptive algorithm are more expensive than
193
FLD. Both these algorithms are based on a gradientdescent procedure making it di$cult to predict the time to convergence. However, in our simulations, the minimum J was found in less than 500 iterations. In practical terms, the adaptive algorithm took about a minute to converge for the image segmentation data set used in this paper, while the extended adaptive algorithm took about 5 min to converge for the data set of Simulation 4 on a SPARCstation 5. A further insight may be obtained by observing that the weight update equation in the adaptive algorithm is based on c(c!1)/2 summations, where c is the total number of classes. This dependence on the number of classes (as opposed to the number of training points) allows the algorithm to be computationally e$cient. In the extended adaptive algorithm however, the number of kernels we chose were equal to the number of training points resulting in considerably slower performance. For large problems, it may be worthwhile to reduce the number of kernels used in the kernel estimator. The Bayes error as an optimization criteria, as used in this paper, is superior to criteria based on "rst- and second-order statistics of the data which are not directly related to classi"cation accuracy.
References [1] I.T. Jolli!e, Principal Component Analysis, Springer, New York, 1986. [2] J.W. Sammon, A non-linear mapping algorithm for data structure analysis, IEEE Trans. Comput. 19 (1969) 401}409. [3] T. Kohonen, Self Organizing Maps, Springer, Berlin, 1997. [4] P. Demartines, J. Herault, Curvilinear component analysis: a self-organizing neural network for nonlinear mapping of data sets, IEEE Trans. Neural Networks 8 (1997) 1197}1206. [5] J. Mao, A.K. Jain, Arti"cial neural networks for feature extraction and multivariate data projection, IEEE Trans. Neural Networks 7 (1995) 296}317. [6] L.J. Buturovic`, Towards Bayes-optimal linear dimension reduction, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) 420}424. [7] C. Lee, D. Landgrebe, Feature extraction based on decision boundaries, IEEE Trans. Pattern Anal. Mach. Intell. 15 (1993) 388}400. [8] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1990. [9] C.J. Merz, P.M. Murphy, UCI Repository of Machine Learning Databases, [http://www.ics.uci.edu/˜mlearn/ MLRepository.html]. University of California, Department of Information and Computer Science, Irvine, CA, 1996.
194
R. Lotlikar, R. Kothari / Pattern Recognition 33 (2000) 185}194
About the Author*ROHIT LOTLIKAR received the B.Tech (1993) and M.Tech (1995) degrees in Electrical Engineering from the Indian Institute of Technology, Bombay, India. In 1995 he was a senior research assistant at the Center for Studies in Resources Engineering, Bombay, India. Since 1995 he has been a Ph.D. student at the University of Cincinnati. His current research interests include pattern recognition, neural networks, computer vision, and image analysis. His current research focuses on dimensionality reduction techniques for pattern classi"cation. About the Author*RAVI KOTHARI received his B.E. degree (with Distinction) in 1986 from Birla Institute of Technology (India), his M.S. from Louisiana State University in 1988, and his Ph.D. from West Virginia University in 1991, all in Electrical Engineering. He is currently an Associate Professor in the Department of Electrical and Computer Engineering and Computer Science at the University of Cincinnati and Director of the Arti"cial Neural Systems Laboratory there. His research interests include arti"cial neural networks, pattern recognition, and image analysis. He received the Eta Kappa Nu Outstanding Professor of the Year Award in 1995, and the William E. Restemeyer Teaching Excellence Award in 1994 from the Department of Electrical and Computer Engineering and Computer Science at the University of Cincinnati. Dr. Kothari serves on the Editorial Board of the Pattern Analysis and Applications Journal (Springer-Verlag), and is a member of the IEEE, the International Neural Network Society, and the Eta Kappa Nu and Phi Kappa Phi honor societies.
Pattern Recognition 33 (2000) 195}208
Skew detection and reconstruction based on maximization of variance of transition-counts Yi-Kai Chen!, Jhing-Fa Wang!,",* !Institute of Information Engineering, National Cheng Kung University Tainan, Taiwan, Republic of China "Department of Electrical Engineering, National Cheng Kung University Tainan, Taiwan, Republic of China Received 31 March 1997; received in revised form 30 April 1998; accepted 12 January 1999
Abstract The input document images with skew can be a serious problem in the optical character recognition system. A robust method is proposed in this paper for skew detection and reconstruction in document images which can contain less text areas, high noises, tables, "gures, #ow-chart, and photos. The basic idea of our approach is the maximization of variance of transition counts for the skew detection and text-orientation determination. Once the skew angle is determined, the scanning-line model is applied to reconstruct the skew images. 103 documents with great varieties have been tested and successfully processed. The average detection time of A4 size image is 4.86 s and the reconstruction time is 5.52 s. The proposed approach is also compared with the existing algorithms published in the literature and our method gets some signi"cant improvements in skew detection and reconstruction. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Skew detection; Skew reconstruction; Scanning line; Transition-counts; Transition-counts variance; SNR; Scanning-line model
1. Introduction Document image processing using computers has become more and more useful in di!erent kinds of applications. When the document image is scanned, it may be skew because of some reasons. The skew image will cause serious problems in document analysis, such as incorrectness of character segmentation or recognition. To solve these problems, skew detection and reconstruction should be performed before document analysis. Hough transform is a common technique to detect the skew angle of a document image [2}5]. It is well known that Hough transform is time consuming. Besides, methods based on Hough transform are sensitive to non-text areas and noises. If the input document contains too much
* Corresponding author. Tel.: #886-6-2746867; fax: #8866-2746867 E-mail address:
[email protected] (J-F. Wang)
non-text areas or noises, the results will be unexpected while using these methods. Some of the other methods are based on the features of local text areas [6}8]. Baird [1] proposed an algorithm to determine the skew angle using an energy function on sets of projection counts of character locations. These methods are also sensitive to non-text areas and may fail if the local region of text areas cannot be found. Another method uses connected component analysis and line "tting to detect skew angle [9,10]. These methods are also time consuming and may contain large errors. A few of the other methods are based on mathematical theory [11,12]. They are too complex and ine$cient. In our paper, a new approach based on maximization of variance of transition counts is used to detect the skew angle and reconstruct the skew image e$ciently. Using our method, the document image can contain less text areas, high noises, tables, "gures, #ow-charts, and photos. In addition, the page orientation (column-wise or row-wise) can also be detected.
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 5 - X
196
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
The approach in this paper can be mainly divided into two parts: 1. Maximization of variance of transition counts for skew detection and page orientation determination. 2. Skew reconstruction using scanning-line model. The proposed algorithm is described in detail as follows.
Fig. 2. The example of computing transition-counts.
2. Maximization of variance of transition-counts for skew detection and page orientation determination
Then the transition-counts variance (TCV) at each angle from !453 to #453 can be computed by using the following equations. (Subscript h and v mean horizontal and vertical orientations, respectively.)
2.1. Scanning-line model Any digital straight line of slope h"tan~1(A/B) can be expressed [13] approximately by the chain code of the form PP2QPP2Q"(PmQ)n where
+W~1(¹C [h][ j]!¹M [h])2 v v , ¹C< [h]" j/0 v =
AB
A , B
P"NIN¹
G
Q"
P#1, if A!B]P'0,
P!1, if A!B]P(0,
A
m"NIN¹
+H~1(¹C [h][i]!¹M [h])2 h h ¹C< [h]" i/0 , h H
B
B !1, DA!B]PD
where +H~1¹C [h][i] h ¹M [h]" i/0 , h H +W~1¹C [h][ j] v ¹M [h]" j/0 , v =
n"DA!B]PD and `NINT a stands for `Nearest Integera. An example is shown in Fig. 1. It is a straight line of 303 skew. If we let A"256 and B"148, the digital line of 303 skew can be expressed in the chain code (231)40 by the modeling method described above. 2.2. Maximization of variance of transition-counts for skew detection The transition counts (TC) on each scanning line is de"ned as the number of transition phenomenon (pixel from black to white or white to black) on the scanning line and an example is shown in Fig. 2.
Fig. 1. The example of a digitized line with 303 skew.
where ¹C< [h] is the horizontal transition-counts h variance at h3; ¹C< [h] is the vertical transitionv counts variance at h3; ¹C [h][i] is the horzontal h transition-counts of the ith row at h3; ¹C [h][ j] is v the vertical transition-counts of the jth column at h3; ¹M [h] is the horizontal transition-counts mean at h3; h ¹M [h] is the vertical transition-counts mean at h3; H v is the image height; = is the image width and h"!453Dž. We "rst choose the strip of 256 pixels in horizontal direction at the middle of the document as well as a strip of the same width in the vertical direction. Then we compute the total counts of transition in the strip. If the total counts of transition exceed the threshold from experimental results, then the strips are regarded as containing enough text and are used to compute transitioncounts variance for skew detection. Otherwise, the strips are shifted in horizontal and vertical direction until we con"rm that the strips contain enough text. After the horizontal and vertical ¹C< at each angle are computed, the maximum ¹C< in horizontal and vertical direction are labeled as ¹C< } and ¹C< } , .!9 v .!9 h respectively. If ¹C< } is larger than ¹C< } , we .!9 h .!9 v regard that the text is row-wise and the skew angle is the angle with ¹C< } . Otherwise, the text is column-wise .!9 h and the skew angle is the angle with ¹C< } . In this .!9 v
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
197
Fig. 4. The transition-counts variance of each angle in horizontal direction.
Fig. 5. The transition-counts variance of each angle in vertical direction. Table 1 The experimental results of skew detection to documents with di!erent SNR Image with SNR(1
Image with SNR'"1
Success 2
Success 14
Failure 5
Fig. 3. The test document image with #193 skew.
Fig. 6. (a) A skew document with SNR"5.7; (b) the image reconstructed by our approach.
Failure 0
198
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
Fig. 7. (a) A skew document with SNR"1.2; (b) the image reconstructed by our approach.
Fig. 8. (a) A skew document with SNR"0.8; (b) the image reconstructed by our approach.
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
way, the skew angle and page orientation can be detected at the same time. The basic idea of using transition-counts variance to detect the skew angle is the fact that peaks of the transition-counts histogram at the skew angle appear
199
periodically and the transition-count histograms at the other angles look #at. In other words, the transitioncounts variance will be the maximum at the skew angle and so we can apply this concept to detect the skew angle and the page orientation.
Fig. 9. (a) Caption on p. 201.
200
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
Fig. 9. (b) Caption opposite.
For example, the transition-counts variance of the image in Fig. 3 is computed and plotted in Fig. 4 (horizontal) and Fig. 5 (vertical). It can be seen that ¹C< } is 16.2 at the angle #193 .!9 h while ¹C< } is 5.9 at the angle !193. So the skew .!9 v angle of the image in Fig. 3 is #193 and the page orientation is row-wise. Here we do experiments on 21 noisy documents by knowing the SNR. The signal-to-noise ratio (SNR) in our paper is de"ned as follows:
¹ransition count of original image SNR" . ¹ransition count of noises
Fig. 6 is a document with SNR"5.7 and Fig. 7 is another with SNR"1.2. Table 1 lists our experimental results. With the results in Table 1, we justify the noise robustness of our algorithm. Since most noisy signals are random, the in#uence of noise to transition-count variance at each angle is almost the same. So, the maximum transition-count variance is invariant at the skew angle. According to our experiment, if the SNR of the document is greater than 1, our algorithm can still work properly. If the SNR is less than 1, some documents are successfully detected and some fail. Fig. 8 is a document with SNR"0.8 and our algorithm still works well.
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
201
Fig. 9. (a) A row-wise document with !133 skew; (b) the document reconstructed by our approach; (c) the document reconstructed by PhotoStyler.
3. Skew reconstruction using scanning-line model After the skew angle is detected by the method described above, the scanning-line model at the skew angle is used to calculate the vertical and horizontal o!sets for skew reconstruction. Using the chain code of the scanning-line at skew angle with the parameters P, Q, m and n de"ned in Section 2, the vertical o!set >}shift of the jth column can be computed j as follows:
CD CD CD
r r >}shift "b](m#1)# , if (m, j P P >}shift "b](m#1)#m, if j
r *m, P
where b is the quotient of j!1/m]P#Q; r the ( j!1) mod(m]P#Q) and DxD stands for the greatest integer not greater than x. In the same manner, the horizontal o!set X}shift of i the ith row can be computed as follows:
CD CD
X}shift "b](m#1)# j
r r , if (m, P P
X}shift "b](m#1)#m, j
if
CD
r *m, P
where b is the quotient of i!1/m]P#Q and r the (i!1)mod(m]P#Q).
202
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
Fig. 10. (a) Caption on p. 204.
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
Fig. 10. (b) Caption overleaf.
203
204
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
Fig. 10. (a) A column-wise document with #73 skew; (b) the document reconstructed by our approach; (c) the document reconstructed by PhotoStyler.
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
205
Table 2 Some classi"cations of our test samples
Row-wise mode Column-wise mode Document with noises Document with large skew angle Document with less text area Tabular form
Chinese
English
Chinese/English mixed
Japanese
Total
31 25 17 8 8 9
32 0 14 7 10 0
5 0 1 1 0 2
6 4 0 0 2 0
74 29 32 16 20 11
Table 3 Comparison of performance Paper no.
Platform
Range of detected angle
Skew detection Skew recontime struction time
Able to detect less-text areas, high-noise, tabular, and hand-written documents
Text orientation determination
Algorithm complexity analysis
[2] [3]
Sun 4-280 DELL486D/50 PC MC68020 486DX2-66 Sun SPARC-2 HP 9000/720 Sun SPARC-10 Pentium 133
N.A. #15}!153
67 s 3.8 s
N.A. N.A.
Fair Fair
Yes Yes
O(N3) O(N3)
#20}!203 #45}!453 #30}!303 #15}!153 #5}!53 #11}!113
14.5 s 3.5 s 1.4 s 0.32 s 10 s Almost real time 4.86 s
11.9 s N.A. N.A. N.A. N.A. Almost real time 5.52 s
Fair Fair Fair Fair Fair Fair
No No No No No No
O(N3) O(N3) O(N2) N.A. N.A. N.A.
Good
Yes
O(N2)
[4] [5] [7] [8] [11] [14] Our approach
Pentium 133
Horizontal #45}!453 and vertical #45}!453
N.A."not available.
After skew reconstruction, the reconstructed image becomes larger than the original image. The new width and height of the reconstructed image are calculated as follows: new}width"old}width#X}shift } old height new}height"old}height#>}shift } old width where the parameters X}shift } and >}shift } old height old width are the horizontal and vertical o!sets at the positions of the old}height and old}width, respectively. Then, the pixel at the ith row and the jth column on the skew document with positive angle is mapped to the (i#>}shift )th row and the ( j!X}shift # j i X}shift } )th column in reconstructed image. In conold height )th trast, it is mapped to the (i!>}shift #>}shift } j old width row and the ( j!X}shift #X}shift } )th column for i old height the document with negative angle.
Here we compare our approach with that in PhotoStyler, which is one of the famous image-processing software. Examples in Fig. 9b and Fig. 10b are reconstructed by our approach while Fig. 9c and Fig. 10c are processed by PhotoStyler. The average processing time of our approach for an A4 size documents is 5.52 s and the average processing time of PhotoStyler is 67 s. It can be seen that we save much processing time without reducing the quality of reconstructed image.
4. Testing and results The algorithm described above was implemented on a PC (Pentium 133). The program was tested by 103 document images scanned from magazines, newspapers, manual documents, and so on. Some classi"cations of these documents are shown in Table 2. All of the document
206
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
Fig. 11. (a) Caption opposite.
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
207
Fig. 11. (a) A hand-written document with #73 skew; (b) the document reconstructed by our approach.
images were scanned in 300 dpi binary format. The range of skew angle is between $453 in both horizontal and vertical orientation. Some of the test samples are shown in Figs. 6}11. The average processing time of skew detection is 4.86 and reconstruction is 5.52 s. Our approach has the advantages of dealing with large skew angle and is successful to handle the documents with less text areas, high noises, tabular form and hand-written characters. Table 3 is the performance comparison between our method and others. In Table 3, papers 2}4 are based on
Hough transform. The non-text areas and noises in the document may result in errors or failure. Paper 5 can also process large skew angle with very fast skew detection time. However, their method requires left-aligned documents. Our approach does not have this kind of limitation. The algorithm of papers 7 and 8 are mainly based on analysis in text area. The non-text area and noises in documents may cause errors or failure. Paper 11 is based on mathematical theory. It is too time consuming. No.14 is the commercial software. Its processing time is almost
208
Y-K. Chen, J-F. Wang / Pattern Recognition 33 (2000) 195}208
real time but 53 documents failed in our 103 test samples. Only 5 documents failed by using our approach. In Table 3, we list the complexity of each algorithm and we compare our algorithm with some others. In summary, the overall performance of our approach is superior.
[3]
[4]
5. Conclusion The algorithms of skew and page-orientation detection are based on the maximization of variance of transitioncounts. The skew reconstruction approach comes from scanning-line modeling. The algorithms in this paper are quite reliable, robust and fast. It is insensitive to noises and non-text areas in the skew document image. Experiments and tests are done with documents of wide variety and a pretty good result has been obtained. This result has shown that our approach has achieved a good performance and e$ciency in skew detection, page orientation determination and skew reconstruction. In the future, we will try to apply these concepts to the color-document images. References [1] H.S. Baird, The skew angle of printed documents, Proceedings of SPIE Symposium on Hybrid Imaging Systems, Rochester, N.Y., 1987, pp. 21}24. [2] S.C. Hinds, J.L. Fisher, D.P. D'Amato, A document skew detection method using run-length encoding and the
[5] [6] [7] [8] [9] [10]
[11]
[12]
[13] [14]
Hough transform, Proceedings of the 10th International Conference On Pattern Recognition, 1990, pp. 464}468. D.S. Le, G.R. Thoma, H. Wechsler, Automated page orientaion and skew angle detection for binary document images, Pattern Recognition, 1994, pp. 1325}1344. Y. Nakano, Y. Shima, H. Fujisawa, J. Higashino, M. Fujiwara, An algorithm for the skew normalization of document image, Proceedings 10th International Conference On Pattern Recognition, vol. 2, 1986, pp. 8}13 . H.F. Jiang, C.C. Han, K.C. Fan, A Fast approach to detect and correct the skew document, OCRDA, 1996, pp. 67}68. H. Yan, Skew correction of document images using interline cross-correlation, CVGIP, 1993, pp. 538}543. K. Toshiba-cho, S. Ku, Document skew detection based on local region complexity, Proc. IEEE, 1993, pp. 125}132. R. Smith, A simple and e$cient skew detection algorithm via text row accumulation, ICDAR, 1995, pp. 1145}1148. F. Hones, J. Licher, Layout extration of mixed mode document, Machine Vision Appl., 1994, pp. 237}246. C.L. Yu, Y.Y. Tang, C.Y. Suen, Document skew detection based on the fractal and least squares method, ICDAR, 1995, pp. 1149}1152. S. Chen, R.M. Haralick, I.T. Phillips, Automatic text skew estimation in document image, ICDAR, 1995, pp. 1153}1156. H.K. Aghajan, B.H. Khalaj, T.Kailath, Estimation of skew angle in text-image analysis by SLIDE: subspace-based line dection. S.-X. Li, M.H. Loew, Analysis and modeling of digitized straight-line segments, Proc. IEEE, 1988, pp. 294}296. Sequoia Data Corporation, ScanFix Image Optimizer, V2.30 for Windows.
About the Author*JHING-FA WANG received the Ph.D. degree in electrical engineering and computer science from Stevens Institute of Technology, Hoboken, in 1983. He is an IEEE fellow and was elected as the general chairman of the Chinese Image Processing and Pattern Recognition Society in 1993. He was the director of Institute of Information Engineering in National Cheng Kung University from 1990 to 1996. At present he is a professor in the department of Electrical Engineering and Institute of Information Engineering at National Cheng Kung University. He is also the present Chairman of Taiwan Information Software Association and the Chairman of Computer Center of National Cheng Kung University. His current research interests include graph theory, CAD/VLSI, neural nets for Image processing, neural nets for computer speech processing, and optical character recognition.
About the Author*YI-KAI CHEN received the B.S. and M.S. degrees in electrical engineering from National Cheng Kung University in 1994 and in 1996, respectively. His interests include image processing and optical character recognition.
Pattern Recognition 33 (2000) 209}224
Appearance-based object recognition using optimal feature transformsq Joachim Hornegger*, Heinrich Niemann, Robert Risack Lehrstuhl f u( r Mustererkennung (Informatik 5), Universita( t Erlangen}Nu( rnberg, Martensstr. 3, D-91058 Erlangen, Germany Received 9 December 1997; accepted 7 January 1999
Abstract In this paper we discuss and compare di!erent approaches to appearance-based object recognition and pose estimation. Images are considered as high-dimensional feature vectors which are transformed in various manners: we use di!erent types of non-linear image-to-image transforms composed with linear mappings to reduce the feature dimensions and to beat the curse of dimensionality. The transforms are selected such that special objective functions are optimized and available image data provide some invariance properties. The paper mainly concentrates on the comparison of preprocessing operations combined with di!erent linear projections in the context of appearance-based object recognition. The experimental evaluation provides recognition rates and pose estimation accuracy. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Appearance-based object recognition; Pose estimation; Feature transform; Manifold models; Statistical modeling
1. Introduction Even up to these days, the e$cient and robust recognition and localization of arbitrary 3-D objects in graylevel images is a generally open problem. There exists no uni"ed technique which allows the reliable recognition of arbitrary-shaped objects in cluttered scenes. The available algorithms are mostly restricted to special types of objects. Standard identi"cation and pose estimation techniques use segmentation operations in order to detect geometrical features like corners or lines [1]. The classi"cation itself is based on geometrical relationships between observations and suitable models like geometric representations which use, for instance, wire frame or
* Corresponding author. Tel.: #49-9131-852-7894; fax: #49-9131-303811. E-mail address:
[email protected] (J. Hornegger) q This work was funded partially by the Deutsche Forschungsgemeinschaft (DFG) under grant numbers Ho 1791/2-1 and GRK 244/1-96. The authors are solely responsible for the contents.
CAD models [1]. The main problems of these approaches are due to the automatic generation of models using training samples, and the robust detection of geometric features. Recently, appearance-based methods have become more and more popular, and are used to deal with object recognition tasks [2,3]. These techniques consider the appearance of objects in sensor signals instead of the reconstruction of geometrical properties. This overcomes quite a lot of problems related to standard approaches as, for example, the geometric modeling of fairly complex objects and the required feature segmentation. Preliminary comparative studies prove the power and the competitiveness of appearance-based approaches to solve recognition problems [2] and suggest further research and experiments. Now well-known and classical pattern recognition algorithms can be used for computer vision purposes: feature selection methods [4,5], feature transforms [4,6], or even more recent results from statistical learning theory [7]. This paper will consider and compare di!erent transforms of high-dimensional feature vectors for object recognition and pose estimation purposes.
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 8 - 5
210
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
2. Contribution and organization of the paper Appearance-based approaches treat images as feature vectors. If we consider M]N images, the dimension of associated feature vectors is m " : NM. Obviously, these high-dimensional feature vectors will not allow the implementation of e$cient recognition algorithms [8] and the curse of dimensionality will prohibit classi"cation [5]. For that reason, transforms are necessary to reduce the dimensions of features. Commonly used transforms are the principal component analysis [9}11] or in more recent publications the Fisher transform [12]. Variations of feature vectors dependent on di!erent illumination or pose parameters are modeled by interpolating between di!erent feature vectors and considering the resulting manifolds as object models [9]. These models are called eigenfaces or Fisherfaces } dependent on the chosen transform. This work extends the existing appearance-based methods with respect to di!erent linear transforms of feature vectors. The considered linear transformations are based on optimization criteria which are basically known from standard pattern recognition literature [8]. In addition, we also consider various types of non-linear preprocessing operations which eliminate, for instance, noise or dependencies of illumination. The main contribution of this paper is therefore twofold and includes f the comparison of di!erent preprocessing operations, and f the application of various feature transforms for the reduction of dimensions. The experimental evaluation provides an extensive characterization of distinct feature transforms. We summarize several methods for improving the recognition rates and pose estimation accuracy of existing algorithms for 3-D object recognition. The "nal judgement of methods depends on recognition rates and pose estimation errors. The paper is organized as follows: the next section gives a brief overview of related work and discusses parallels and di!erences to the main contributions of this paper. Before we introduce mathematical and technical details, we clarify and specify the general formal framework (Section 4). The restriction of already published approaches to the principal component analysis for the reduction of features' dimensions motivates to consider and to compare experimentally di!erent types of linear projections from high-dimensional image into lowerdimensional feature spaces. Feature transforms and the e$cient solution of optimization problems related to these projections form the main part of Sections 5 and 6. Instead of using gray-level images as features, some nonlinear image transforms, which can be applied within a preprocessing stage, are summarized in Section 5.2. Computational aspects of the involved algorithms are
included in Section 7. The experimental evaluation of introduced concepts is summarized in Section 9: the recognition and pose estimation experiments are evaluated with various combinations of image transforms. The paper ends with a summary, draws some conclusions, and gives several hints to further unsolved research problems concerning appearance-based recognition. Mathematical details, which are less essential for the basic understanding of the proposed techniques, are provided in the appendix.
3. Related work Appearance-based approaches discussed in the literature are mostly restricted to the principal component analysis to map gray-level images to low-dimensional feature vectors and neglect preprocessing operations [2]. Here, the considered feature transforms are incorporated into an optimization-based framework, which allows geometrical interpretations within the feature space. The mathematical tools which are required for a practical implementation are provided by Murase and Lindenbaum [13]. Fields of application are medical image processing, face recognition [14], or 3-D object recognition and pose estimation. The major problems related to appearance-based methods are due to unknown background objects and occlusion. Classi"cation in cluttered scenes is discussed and su$ciently solved in Ref. [15]. The application of appearance-based methods in the presence of occlusion is considered in Ref. [16], whereas the in#uence of varying illumination to eigenfaces is experimentally evaluated in Ref. [17]. The authors of Ref. [17] show that 5-D vectors in the eigenspace are su$cient for modeling di!erent lighting conditions. For that reason, this work does not discuss various methods which work with cluttered background and occlusion, but concentrates on the comparison of di!erent optimization criteria with respect to the computation of linear projections.
4. General framework A digital image f is mathematically represented by a matrix [ f ] , where the range of f is i,j 1xixN,1xjxM i,j determined by the quantization of intensity values. The parameters N and M denote the number of rows and columns. Let us assume, we have K di!erent classes ) , ) ,2, ) of objects. Examples of di!erent objects 1 2 K are shown in Fig. 1. These objects are assigned to the pattern classes ) , ) ,2, ) . The classi"cation pro1 2 5 cedure is thus a discrete mapping which assigns an image, showing one of these objects, to the pattern class the present object corresponds to. If we compute the pose parameters, the position and orientation of the object
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
211
Fig. 1. Object classes considered in the experiments (Columbia images).
Fig. 2. Structure of feature computations: images and preprocessed sensor data are transformed into feature vectors.
with respect to the world coordinate system are calculated. Usually, there exists no closed-form analytical description of these mappings. Most systems decompose this function into a sequence of } mostly independent } procedures [18]. It is suggesting to consider images [ f ] i,j 1xixN,1xjxM as feature vectors f3Rm, where m"NM. Because of the geometric nature of objects, however, this is not selfevident. Due to the dimension of (N]M)-images, classi"ers using these high-dimensional features directly will not provide e$cient algorithms for several reasons: in high-dimensional vector spaces the de"nition of similar vectors is somehow di$cult, since nearly all vectors are considered to be neighbors [18]. Furthermore, the comparison of vectors is the most often used operation within the classi"cation module and should be as e$cient as possible. The use of high-dimensional feature vectors contradicts this requirement [8]. To reduce the data, it is important to select or project features from a gray-level image. Especially for object recognition, traditional methods use the segmentation of (hopefully discriminating) features in the image, like edges or corners of an
object. These features allow the explicit use of geometrical relationships between 3-D models and 2-D observations. The geometry of object transforms and the projection of 3-D models into the 2-D image plane are well-understood and mathematically formalized [19]. But the usage of segmentation results shows some substantial disadvantages, which partially con"ne their practical use: f the quality of segmentation highly depends on the chosen algorithm and on illumination conditions as well as selected viewpoints, f robust, optimal, and reliable detection of features is far from its implementation, and "nally f the huge amount of data reduction induces a decrease of information, which might also decrease the discriminative power of resulting classi"ers. Appearance-based approaches to object recognition, however, prohibit the use of geometry, but the algorithms do not depend on a reliable and accurate detection of points or lines. The computation of features directly from graylevel images can be done by di!erent types of mappings:
212
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
1. The "rst stage might transform the gray-level image into another image which shows special properties. The discrete Fourier transform and spectrum [20], for example, results in features which are invariant with respect to translations of the objects in the image plane and thus reduce the search space with respect to pose parameters. Other "ltering operations, like highpass "lters, abate dependencies on illumination. These transforms again lead to large feature vectors for input signals and do not reduce dimensions. 2. For e$cient algorithms, however, it is essential that features are projected or selected to obtain small but still discriminating feature vectors. Because selecting the subset of best features is an NP complete problem, only an approximation of the optimal set can be computed within practical applications [18,21]. The transforms which reduce the dimension of features can have two di!erent motivations: one might be the application of some heuristic principles which project the feature vectors and show satisfying recognition results. Here we will consider a restricted class of transforms which have the property to be optimal with respect to objective functions. The applied objective functions f are based on basic assumptions concerning the distribution of feature vectors, f are comparatively easy to calculate, and f induce e$cient algorithms for the analytical computation of optimal feature transforms. Fig. 2 summarizes the general idea the subsequent analysis is based on.
5. Gray-level features, non-linear preprocessing, and feature selection The classi"cation and pose estimation task is generally formalized as a sequence of mappings which assign an image f3Rm showing one object to a preprocessed image h3Rm, then to a feature vector c3Rn(n;m), and "nally to the class ) of the observed pattern class. Furtheri more, related pose parameters, which are de"ned by three rotations and three components de"ning the translation vector, have to be computed. The classi"cation and localization of objects crucially depends on postulates which are the basic requirements of most pattern recognition algorithms. These postulates } as far as they are relevant for our application } are summarized in the following subsection, and they form the base of all subsequent linear feature transforms. 5.1. Postulates Usually, feature vectors suitable for 3-D recognition are expected to show a high discriminating power and to
allow reliable classi"cation as well as pose estimation. For that reasons, features have to satisfy basic postulates for decision making. f Similarity: objects belonging to the same pattern class show similar feature vectors independent of the associated classes. f Distinction: objects of distinct pattern classes have different feature vectors, which provide a high discriminating power. f Smoothness: small variations in pose or illumination induce small variations in associated features. Using these basic assumptions for the construction of good features, we derive di!erent types of linear feature transforms from high-dimensional into lower-dimensional feature spaces. The basic idea here is to select the transform such that the resulting features are optimal with respect to above postulates. In detail we will consider transforms which: f maximize the distance of all features among each other independent of pattern classes, f maximize the distance of features belonging to di!erent pattern classes (interclass distance), f minimize the distance of features belonging to the same pattern class (intraclass distance), and f optimize combinations of the above measures. However, it should be clear to the reader that linear transforms will not improve the recognition rate of classi"ers, even if we choose m"n. 5.2. Non-linear image transforms Before we transform the image matrix into lowerdimensional vectors, we transform sensor data into images, which show distinguished properties [22]. Examples for preprocessing operations are high-pass "lters, low-pass "lters, the application of the 2-D Fourier transform or the use of segmented images. The application of segmented images gives also a fundamental hint how recognition results are in#uenced by segmentation. A comparison of gray-level and feature-based identi"cation as well as pose estimation is possible based on this approach (cf. Section 9). Other approaches to object recognition do not allow comparative studies of that kind. Within this work we use the following preprocessing operations: the absolute values of the 2-D discrete Fourier transform (spectrum), the result of Gaussian "ltering, the absolute values of second derivatives (Laplace), the edge strength of pixels computed by the operators due to Nevatia}Babu and Sobel. Finally, we also use binary images, where edge pixels are black and the rest white (edge images).
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
213
5.3. Optimization-based feature selection and transform The basic idea of feature transforms is that we are looking for mappings which reduce the dimension of feature vectors and optimize optimality criteria related to the cited postulates of pattern recognition systems. The search for the optimal transform requires the restriction to special parametric types of transforms. It seems computationally prohibitive to search for the best transform without any constraints to its properties. For that reason, we restrict the subsequent discussion to linear transforms which map the m-dimensional vectors if of the sample set u"Mif3RmDi"1,2, rN from Rm to the n-dimensional features ic3Rn. The linear transform is obviously completely characterized by the matrix U3RnCm which maps the m-dimensional preprocessed image vector h to the n-dimensional feature vector c, i.e., c"Uh.
(1)
The nm components of the matrix are considered to be the variables of the search process, and thus the search space for suitable transforms is strongly restricted to an nm-dimensional vector space. This makes the search problem tractable and } as we will see in the following subsections } induces optimization problems which can be solved using basic and well-understood techniques of linear algebra. The computation of the optimal linear transform UH makes the use of objective functions necessary which have to be optimized with respect to the parameters, i.e., the components of U. In the following, we de"ne di!erent objective functions
G
RnCmPR, s: (2) i U C s (U), i where i"1, 2 ,2, according to the postulates summarized in Section 5.1. A transform UH is called optimal i with respect to s (U), if it holds i UH"argmax s (U), (3) i i ' presupposed s has to be maximized, and i UH"argmin s (U) (4) i i ' if s has to be minimized. Since scaling of the matrix i U would also a!ect the value of the objective, the mai trices are restricted to those composed of unit length vectors. In the following we use illustrative motivations for di!erent objectives by considering distributions of feature vectors. 5.4. Principal component analysis The most often used linear transform of this type results from the principal component analysis and is the so-called Karhunen}Loe% ve transform (KLT), [10,23].
Fig. 3. KLT and the ADIDAS-problem.
The idea of this transform is based on the reduction of the dimension of original image vectors h using a linear mapping U such that the resulting feature vectors c show pairwise maximum distance. For this transform U, the objective function s (U) thus is the mean squared 1 distance of all sample feature vectors ic"U ih to each other, i.e., 1 N N s (U)" + + (ic!jc)T(ic!jc). 1 N2 i/1 j/1
(5)
The use of KLT provides both advantages and disadvantages: the computation of the optimal linear transform UH with respect to s (U) does not require the classi"ca1 tion of sample vectors. Furthermore, feature vectors resulting from KLT allow the reconstruction of images with minimal mean quadratic error [9]. Problems, however, occur if the distribution of features is such that the principal axes of all classes are parallel to each other. A 2-D example, where we project the features onto the x-axis, is shown in Fig. 3 (ADIDAS-problem, Ref. [24]). Obviously, the projected features will allow no discrimination of these classes. For this situation, the optimal linear mapping related to s will not induce discriminat1 ing features, whereas the projection on the y-axis would. This simple example shows that other objective functions than s (U) seem to be useful or necessary for reduc1 ing the features' dimension and for providing a higher discriminating power. 5.5. Maximizing interclass distance Another plausible optimization criterion, which does not show the disadvantages of KLT, results from the distinction property. Features of one and the same pattern class should have maximum distance to features of the other pattern classes. In contrast to the KLT, however, this transform requires a classi"ed sample set. The original sample data are partitioned, i.e., u"X Q ui, where u "Mif D i"1,2, r N consists of all samples belonging i i i to pattern class ) . Thus the following objective function i
214
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
can be applied only for those sample sets, where such a labeling is available. Let ic denote the ith sample i vector which belongs to class ) . Of course, the number i r of sample data of each class may be di!erent, i.e., i r Or . The associated objective function based on the i j above motivated criterion is de"ned by 2 K i~1 1 s (U)" + + 2 K(K!1) rr i/2 j/1 i j ri rj ] + + (ic !jc )T (ic !jc ), (6) i j i j i/1 j/1 where K denotes the number of pattern classes. Now we use the classi"ed sample data and also de"ne a criterion which combines the ideas of s and s . For 1 2 each class ) we compute the mean vector k , i i i"1, 2, 2, K, and substitute the feature vectors of s 1 by the mean vectors. The objective s is thus de"ned by 3 K i~1 2 + + (l !l )T (l !l ). (7) s (U)" i j i j 3 K(K!1) i/2 j/1 If we optimize s with respect to the linear transform U, 3 the distance between the class centers is maximized. Consequently, the above ADIDAS-problem can be solved using both s and s . 2 3 5.6. Minimizing intraclass distance The objective functions discussed so far maximize distances of features. In order to take the similarity postulate into account, we de"ne an objective which yields a measure for the density of features belonging to the same pattern class. Features of the same pattern class should have a minimum distance and therefore we suggest to minimize the intraclass distance de"ned by 1 K 1 ri ri s (U)" + + + (ic !jc )T (ic !jc ). (8) 4 i i i i K r2 i/1 i i/1 j/1 The use of this objective function also requires a set of classi"ed training data. The optimal feature transform w.r.t. s results from solving 4 UH"argmin s (U). (9) 4 4 ' The trivial solution U"0 is excluded, because U has to 4 be composed of unit length vectors. As we see later, the matrix UH will be composed by i eigenvectors of a kernel matrix Q(i). In this application the number of sample image vectors if will be much i smaller than the dimension of these vectors. Therefore, the matrix Q(4) will have a fairly large and therefore non-trivial null space. Projection to this null space will minimize the objective with s "0. 4 In this space, as Fig. 4 shows, each class will be represented by a single point. If further feature reduction has to be applied, a proper subspace must be selected to
Fig. 4. Projection to null space.
allow good separation of class points. This can be done by another KLT. Due to the high dimension of the null space and numerical problems in evaluating eigenvectors to eigenvalue zero, we only consider combined objectives which maximize the inter- and minimize the intraclass distance at the same time. 5.7. Combination of inter- and intraclass distance The simplest way of combining the inter- and intraclass distance measure is the use of fractions or linear combinations of s , s , and s , i.e., we could, for instance, 2 3 4 de"ne s (U)"s (U)#hs (U), 5 3 4
(10)
s (U) or s (U)" 2 , (11) 7 s (U) 4 and compute linear transforms using these objectives. Herein, the weighting factor h is a free variable which has to be chosen by the user. The following considerations are restricted to the "rst de"nition s . An experimental 5 comparison of s (Fisher transform) and s can be found 7 1 in Ref. [12]. s (U)"s (U)#hs (U) 6 2 4
6. Optimization of objective functions The optimal linear transforms UH, i"1, 2, 3, 4, 5, with i respect to the introduced objective functions s , s ,2, s 1 2 5 are not obvious considering the complicated sums in Eqs. (5)}(8) and (10). Of course, we could start a brute force exhaustive optimization procedure, but concerning above objectives a simpli"cation of the related optimization tasks results from a reorganization of summations and multiplications. Indeed, all objective functions can be written in the following sum of quadratic forms: m s (U)"2 + uT Q(i)u , (12) i l l l/1
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
where the kernel matrix Q(i) corresponding to the ith objective function s (U) is implicitly de"ned, and i u , l"1, 2 ,2, m, denote the column vectors of the l transform U, i.e., we obtain U"(uT, uT ,2, uT ) where 1 2 m u 3Rn. l The introduction of kernel matrices shows one crucial advantage: the optimization of the introduced objectives s is reduced to the computation of eigenvectors and i eigenvalues due to the quadratic forms involved in Eq. (12). It is a well-known result from linear algebra that quadratic terms are minimal (resp. maximal) if the vectors u are eigenvectors corresponding to the minimal l (resp. maximal) eigenvalues. The computation of the optimal scatter matrix UH thus i proceeds as follows: 1. we compute the eigenvalues and eigenvectors of the involved kernel matrices Q(i), 2. sort the eigenvalues, 3. de"ne the n rows of the scatter matrix UH to be the i n eigenvectors related to the n eigen values; herein we take the n highest eigenvalues, if the objective function has to be maximized, otherwise we use the vectors corresponding to the smallest eigenvalues. The remaining problem is the explicit de"nition of kernel matrices, and for the implementation of the proposed feature transforms, however, some numerical aspects and computational considerations are required. We prove the validity of Eq. (12) exemplary for s in 3 the Appendix by explicitly computing the kernel matrix Q(3). The technical aspects of computations of other kernel matrices are quite similar and left to the reader. In the following, we present only the "nal kernel matrices related to above objective functions, since these will be required for formalizing the optimization algorithms. 6.1. Kernel matrix of s 1 Elementary algebraic transforms show that using the objective function s the kernel matrix Q(1) is simply the 1 covariance matrix of the sample set, which is de"ned by
A
B
1 r 1 r 2 Q(1)" + jfj fT! + jf r r j/1 j/1 1 r ! + (jf!l) (jf!l)T, r j/1
6.2. Kernel matrices of s and s 2 3 Considering the interclass distance and the related objective function s , we get the explicit kernel matrix 2 1 K 1 ri + jf jfT Q(2)" + i i K r i/1 i j/1 1 K i ! + + K(K!1) i/2 j/1
A
1 ri rj + jf + jfT i j rr i j j/1 j/1
B
1 rj ri # + jf + jfT j i rr i j j/1 j/1
2 K " + (l !lN ) (l !lN )T, i i K(K!1) i/1
(15)
where 1 K 1 ri l " + jf and lN " + l . i i r i K i j/1 i/1
(16)
The reorganization of arithmetic operations in s results 3 in the explicit kernel matrix: K i~1 1 Q(3)" + + (l !l ) (l !l )T. i j i j K(K!1) i/2 j/1
(17)
Obviously, the matrix Q(3) is the kernel matrix of a KLT based on mean vectors instead of feature vectors. In contrast to other kernels, the rank of this matrix is not bounded by the cardinality of the feature set, but by the class number. 6.3. Kernel matrix of s 4 The kernel matrix Q(4) is given by
A
B
1 K 1 ri Q(4)" + + jf jfT!l lT . i i i i K r i/1 i j/1
(18)
This result shows a connection between the kernel matrices of s , s and s . Indeed the identity 2 3 4 Q(4)"Q(2)!Q(3)
(19)
is valid. 6.4. Kernel matrices of s 5 (13)
where l denotes the mean vector of the non-classi"ed sample data, i.e., 1 r l" + jf. r j/1
215
The explicit kernel matrices for linear combinations of objective functions are trivial. Due to the linear nature of involved mappings the kernel matrices are linear combinations of the kernel matrices of its summands Q(5)"Q(2)#hQ(4)"Q(3)#hI Q(4),
(14)
where h and hI denote the weighting factors of Q(4).
(20)
216
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
7. Computational considerations The direct calculation of the eigenvalues and vectors of Q(i) is computationally prohibitive due to the storage requirements. If we, for instance, assume images of size 128]128, which is a quite low resolution for common computer vision purposes, we get m"16 384 rows and columns for kernel matrices Q(i). If we suppose, for example, that the entries of the matrix are double valued (i.e., eight byte for each entry), this matrix has storage requirements of about 2 GB. This simple numerical example shows that there is a need for more sophisticated methods to compute the optimal linear transforms related to the objective functions s , s , s , s , and s as well 1 2 3 4 5 as associated kernel matrices. 7.1. Implicit computation of eigen vectors The storage requirements can be reduced using a result of singular-value decomposition theory. Let us assume we have to compute the eigenvalues of a matrix Q3RmCm which can be factorized as follows: Q"FFT,
(21)
where F3RmCp, p(m. As already mentioned, the size of the matrix is intractable for the main memory of our computer. We are interested in computing the eigenvectors and eigenvalues, but a straightforward computation is prohibited. Instead of considering Q directly, we de"ne according to Murase and Lindenbaum [25] the implicit matrix Q) "FTF,
(23)
using (22) we thus get FTFu( "jK u( . l l l
(24)
In the next step, we multiply both sides by F, this yields FFT(Fu( )"jK (Fu( ) l l l
(25)
and thus we get Q(Fu( )"jK (Fu( ). l l l
Q(i)"F(i)F(i)T.
(27)
For that reason, the following subsections will derive the required factorizations of involved kernel matrices. 7.2. Reorganization of Q(1) The kernel matrix Q(1) is the covariance matrix of the given sample data, i.e., 1 r Q(1)" + (if!l) (if!l)T. r i/1
(28)
We de"ne F(1)"J2 (1f!l,2,rf!l)3RmCr, r
(29)
and it is obvious that Q(1)"F(1)F(1)T.
(30)
Before we compute the factorization of Q(2), it is advantageous to consider the decomposition of Q(3) and Q(4) (see Eq. (19)). This concrete example shows that there is a trade of between the storage requirements of implicit kernel matrices and the size of sample sets: here we have r"p, i.e., the higher r, the higher is the reliability of resulting models. Higher p-values, however, increase the storage requirements.
(22)
and observe that there is a remarkable relation between the eigenvalues and eigenvectors of Q and Q) . Let u( denote the eigenvectors and jK the eigenvalues of l l the implicit matrix Q) . The eigenvectors and values are de"ned by Q) u( "jK u( ; l l l
This result proves that the eigenvalues and eigenvectors of the kernel matrices Q(i) can be computed with low memory requirements presupposed p;m and matrices can be factorized in the form of [26]
7.3. Reorganization of Q(3) Analogous to Q(1) we get for class centers the decomposition Q(3)"F(3)F(3)T,
(31)
where J2 F(3)" (l !lN ,2, l !lN )3RmCK. 1 K r k
(32)
The scaling factor J2/JK is important if we use the combined distance measures. Otherwise this factor can be neglected. 7.4. Reorganization of Q(4)
(26)
The last equation shows that each eigenvalue of Q is also an eigenvalue of Q) , and the eigenvectors are related by the linear transform F.
The kernel matrix 1 K 1 ri Q(4)" + + (jf !l ) (jf !l )T i i i i K r i/1 i j/1
(33)
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
217
can also be factorized in the required manner. The similarity to Q(1) is evident, and analogous to Eq. (29) we de"ne the class-dependent matrices
f manifold models as suggested by Murase and Nayar [9], and f Gaussian models.
J2 F" (1f !l ,2, rif !l )3RmCri, i Jr i i i i i
More recent classi"cation methods using support vector machines are omitted [3].
(34)
where i"1, 2 ,2, K. The factor J2/Jr is necessary, i because r varies for di!erent classes ) . The summation i i of matrix products can be written using matrix multiplication, i.e.,
AB
FT 1 1 K 1 Q(4)" + F FT" (F ,2, F ) F i i 1 K K K i/1 FT K 1 " F(4)F(4)T. K
(35)
7.5. Reorganization of Q(2)
8.1. Manifold models Objects have several degrees of freedom. Di!erent rotation angles, for instance, result in di!erent feature vectors. It is suggesting to use parametric models (curves) with respect to these variables. Manifold models result from sample feature vectors by interpolation. Fig. 5 shows 3-D feature vectors and the interpolated manifold model. These manifolds are computed for each object class. The class decision is based on the minimization of the distance between an observed feature vector and the manifold models. The parameter vector associated with the manifold point, which has the lowest distance to the observation, de"nes the pose parameters.
Using Eq. (19) we obviously get Q(2)"F(2)F(2)T"(F(3), F(4)) (F(3), F(4))T.
(36)
7.6. Reorganization of Q(5) The kernel matrix of the combined objective s is 5 Q(5)"Q(4)#hQ(3)"(F(4), JhF(3)) (F(4), JhF(3))T "F(5)F(5)T.
(37)
The weight factor h has to be positive, because of the square root in the de"nition of F(5). The theoretical part has introduced objective functions which are used to compute optimal linear transforms and which are motivated by the basic postulates of pattern recognition. The required linear mapping is e$ciently computed reducing objectives to quadratic forms and solving eigenvalue problems. Related problems with storage requirements of involved computations were solved by the introduction of implicit matrices. In the following section we will compare these transforms and techniques experimentally. Before, it is necessary to de"ne the used models and decision rules the experimental evaluation is based on.
Fig. 5. Example of a manifold model with one degree of freedom (rotation angle) resulting from KLT features. The model corresponds to the second object shown in Fig. 1. The gray-level image is not preprocessed.
8. Models and decision rules The classi"cation of objects is based on the introduced features of the eigenspace. Samples of the training set are represented by points within the eigenspace. Due to the fact that lighting conditions, position and orientation of objects vary, feature vectors di!er in the eigenspace. Here we distinguish two di!erent types of models:
Fig. 6. Clusters of features belonging to four classes shown in Fig. 7. Feature transforms use the combined objective s where 5 h"10~4. Herein, the original image matrix was transformed into the Fourier spectrum.
218
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
Fig. 7. Industrial objects captured by uncalibrated cameras.
8.2. Gaussian densities Simpler modeling schemes characterize all sample features assigned to one class by a single probability density function. Here we use multivariate Gaussians for modeling and decide for that class with the highest aposteriori probability. Of course, these statistical models do not allow for pose estimation. Therefore, these models are especially useful for those applications or training samples, where no pose information is required or available for model generation. Fig. 6 shows four clusters of features belonging to object classes shown in Fig. 7. Within the chosen probabilistic framework, each cluster is characterized by a 3-D Gaussian density.
9. Experimental results The experimental evaluation provides a comparative empirical study of the introduced transforms UH. Before i we describe detailed results, we give a brief overview of the experimental set up and the used image data. 9.1. Experimental setup and image data The experimental evaluation is done on a HP 9000/735 (99 MHz, 124 MIPS) using 128]128 images. Within the experiments we use two di!erent image databases. To provide the capability of comparing the introduced feature transforms with other methods and di!erent approaches to 3-D object recognition, we discuss some experiments using the standard images of the Columbia University image database.1 We restrict these recognition experiments to the "ve object classes, which were already shown in Fig. 1. For each object 36 training and 36 test views are available. The images show single 3-D objects with homogeneous background rotated by 53. Rotations of 0, 10, 20,2, 3603 are used for training. The
1 See http://www.cs.columbia.edu/CAVE/coil-20.html.
recognition experiments run on images showing rotations 5, 15,2, 3553. Training and test sets are disjoint, and contain images showing objects of varying pose. Occlusion, except self-occlusion, does not occur. For each training and test view the pose parameters, i.e. the single rotation angle, are known. Illumination conditions are constant for all samples. In addition to these idealized images (homogeneous black background) we also consider industrial parts from a real application using an uncalibrated camera.2 We use four objects which are shown in Fig. 7. Of each object 200 di!erent views are available, including also partially occluded objects. Planar rotations and translations as well as lighting are chosen randomly. The set of 2-D views is partitioned into training and test sets of equal cardinalities. In contrast to the above-mentioned image database, the pose parameters are not available. 9.2. Varied parameters and evaluation criterions The computation of features has several degrees of freedom. Within the experiments, we varied the following parameters: f dimension of used features, f di!erent preprocessing methods, and f di!erent objective functions. The basic criteria for experimental evaluations are the recognition rates, errors in pose estimates, and the runtime. The used models are both manifold models as suggested in Ref. [9], which also consider the pose parameters, and simple statistical models. Statistical models assume normally distributed feature vectors for each class, and do not use pose information within the training data. The experiments related to pose estimation accuracy are restricted to manifold models and therefore to images of the Columbia database.
2 These images are available via URL: http://www5.informatik.uni-erlangen.de.
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
9.3. Pose estimation results We tested the pose estimation accuracy using manifold models. The considered object transforms are rotations around a single coordinate axis. The features are transformed by linear mappings induced by the discussed objective functions. Table 1 summarizes the obtained errors with respect to rotations around the z-axis of the world coordinate system. Obviously, the best results are achieved by the combined objectives. The overall improvement with respect to the standard principal component analysis, however, is minor. Table 2 summarizes the errors based on di!erent preprocessing operations and a subsequent principal component analysis in 10 dimensions. If no bijective mappings are used, we expect a reduction of accuracy. Indeed, the experiments show that the best pose estimates result from the immediate use of the gray-level image. The worst accuracy is obtained by using edge images. These examples prove that the appearance-based approach does not provide reliable pose estimates if segmented images are used. Using images containing lines only decreases the accuracy of pose estimates
219
drastically. Appearance-based pose estimation techniques should not be applied to this type of preprocessed images. 9.4. Recognition results In the following experiments we compare various preprocessing operations and linear transforms with respect to the resulting recognition rates. 9.4.1. Columbia images Using the Columbia images (see Fig. 1, "ve classes) we compare manifold models and statistical models based on simple multivariate Gaussians. The graphs shown in Fig. 8 summarize the recognition results for varying linear transforms and di!erent dimensions of used feature vectors. These experiments prove that the recognition rate is 100% for all transforms, if the dimension of eigen vectors is at least 3 and manifold models are used. For lower-dimensional features, s dominates both with 3 respect to manifold and Gaussian models. The recognition results using combined objectives with di!erent weights are summarized in Fig. 9.
Table 1 Mean errors and deviations in estimated rotation angles based on 10-D feature vectors for s , s and s , and 4-D feature vectors 1 2 5 for s 3
Table 2 Mean error in pose estimates using s as objective function and 1 di!erent preprocessing operations. The chosen dimension of eigenvectors is 10
Objective function
Mean error (deg)
Standard deviation (deg)
Filter
Error (deg)
s 1 s 2 s 3 s (h"10~4) 5 s (h"0.1) 5 s (h"0.5) 5
0.71 0.71 8.45 0.69 0.70 0.67
0.78 0.79 43.32 0.77 0.78 0.74
No "ltering Spectrum Gaussian "ltering Edge detection Laplace Nevatia Sobel
0.70 0.96 0.74 14.84 3.81 2.96 1.73
Fig. 8. Comparison of di!erent linear image transforms using s , s , and s (Columbia images) and di!erent models: manifold models 1 2 3 (left) and Gaussian models (right).
220
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
Fig. 9. Combined objective s "s #hs , where h"10~4, 10~1, 1, 1 (Columbia images). 5 3 4 2
Fig. 10. Recognition rates using s and di!erent preprocessing operations. 1
Recognition results using di!erent preprocessing operations are summarized in Fig. 10 using objective s and 1 Fig. 11, where we have used s . It is conspicuous that the 3 optimization criterion s combined with the spectrum 3 shows the highest recognition rates independently of the selected model. The main reason for that is the invariance of the spectrum with respect to object translations in the image plane. All examples show that manifold models provide higher recognition rates than Gaussian models. How-
ever, manifold models require pose information within the training samples, probabilistic models using multivariate Gaussians do not. 9.4.2. Industrial objects The next experiments use images, where no pose information is available (see Fig. 7, four classes). Therefore, we only consider probabilistic models and analyze the recognition in the presence of occlusion. The recognition rates do also vary with the dimension of eigenvectors. In
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
221
Fig. 11. Recognition rates using s and di!erent preprocessing operations. 3
Table 3 Recognition rates using images of industrial objects shown in Fig. 7. The dimension of the eigenspace is 20. The linear transform is based on s , and the columns show recognition rates 1 using no preprocessing, Gaussian "ltering (GF) and segmenting the background (BG) Class
No occlusion
Occlusion
*
GF
BG
*
GF
BG
) 1 ) 2 ) 3 ) 4
25 87 1 57
41 85 0 48
40 80 3 64
10 30 0 80
10 20 0 80
10 30 0 80
Average
43
44
47
30
28
30
contrast to previous experiments, we restrict the dimension of the eigenspace to 10 and 20. Table 3 shows the low recognition rates for the industrial objects based on linear transforms using s , even if di!erent preprocessing 1 operations are used. Obviously, partially occluded objects cannot be classi"ed using the high-dimensional eigenvectors and this approach. Recognition rates are
comparable to random estimates. The use of linear transforms introduced above also does not essentially improve the recognition results. Tables 4 and 5 also show the curse of dimensionality: an increasing dimension of feature vectors does not necessarily increase recognition results. The main reason for low recognition rates is the presence of translations in the image plane. If we detect the object and consider only pixels belonging to the object, we observe a remarkable improvement of recognition rates. We get 100% even in the presence of occlusion. Therefore, we use the spectrum of images (absolute values of the 2-D Fourier transform), which is known to be invariant with respect to translations. These experiments show that rotations do not in#uence the accuracy of recognition in contrast to translations. The segmentation of objects, i.e. the bi-partition of image points into object and background pixels, or usage of Fourier transform for object classi"cation is advantageous for recognition, if no pose information is available within the training data. 9.5. Run time The run time behavior of the complete system is summarized in Tables 6}9. All numbers are based on the
222
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
Table 4 Recognition rates based on 10-dimensional eigen vectors. The images are preprocessed such that background and object pixels are separated Class
No occlusion
Occlusion
s 1
s 2
s (0.5) 5
s (10~4) 5
s 1
s 2
s (0.5) 5
s (10~4) 5
) 1 ) 2 ) 3 ) 4
90 99 92 87
84 98 93 87
88 98 92 87
61 88 62 62
30 100 60 100
30 100 60 100
30 100 60 100
40 100 40 100
Average
92
90
91
68
72
72
72
70
Table 5 Recognition rates based on 20-dimensional eigenvectors. The images are preprocessed such that background and object pixels are separated Class
No occlusion
Occlusion
s 1
s 2
s (0.5) 5
s (10~4) 5
s 1
s 2
s (0.5) 5
s (10~4) 5
) 1 ) 2 ) 3 ) 4
40 80 3 64
35 80 9 65
47 83 25 63
61 88 62 62
10 30 0 80
10 30 0 90
10 30 0 100
30 40 0 80
Average
47
47
54
68
30
33
35
38
Table 6 Recognition rates using 20-dimensional feature vectors Method
Recognition rate (%) No occlusion
Non-invariant features Separated object/background pixels Spectrum
Table 7 Run time of the learning stage dependent on the dimension of used eigenspaces (180 images)
Occlusion
47 99
30 73
100
100
Columbia image database including 180 training images of size 128]128. Table 6 shows the time required for training using all images of the training set. Most of the time is obviously required for the computation of eigenvectors. Table 8 shows the relation between the time for classi"cation and the dimension of the eigenspace.
10. Summary and conclusions Standard linear feature transforms which are broadly used in pattern recognition and speech recognition are
Dimension of eigenvectors
5 10 20
Computation of eigenvectors (min:s)
Training (ms) Gauss
Manifold
3:34 3:55 4:18
(10 (10 40
(10 (10 (10
Table 8 Run time of eigenvalue computations (10-dimensional eigenspace) dependent on the number of training images Number of images
Time [min:s]
45 90 135 180
0:38 1:37 2:34 3:55
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224 Table 9 Run time of the classi"cation module. The images are represented as vectors Dimension of eigenspace
5 10 20
Projection
30 60 110
Classi"cation Gauss (ms)
Manifold (ms)
(10 (10 30
560 650 790
successfully applied to solve object recognition and pose estimation problems in the "eld of 3-D computer vision using gray-level images. This paper has summarized various objective functions for the computation of optimal feature transforms: the principal component analysis, the interclass distance, the intraclass distance, and various combinations. We have shown how the associated optimization problems are reduced to the computation of eigenvectors. A twostage re-organization of considered objective functions leads to computational practical solutions: 1. the transform of the objective functions into sums of quadratic forms that reduces the optimization problem to the computation of eigenvectors, and 2. the factorization of kernel matrices into products of matrices and its transposes which induces lower storage requirements for computing eigenvalues and eigenvectors. The experimental evaluation provides a comparison of new types feature transforms. Based on a standard image database, we prove empirically that the best pose estimation results are provided by a transform which maximizes a combination of intra- and interclass distances. The recognition results show highest accuracy, if the distance of class-speci"c mean vectors is maximized. Dependent on the selected dimension of feature vectors we have shown that a dimension of 4 already leads to recognition results of 100% correctness. Instead of manifolds, we have also tested the recognition rates using the assumption of normally distributed feature vectors. Using spectral features which are invariant to translations in the image plane we observed also recognition rates of 100% using industrial objects, where the training set includes no pose information. Considering these results, we conclude that appearance-based object recognition systems can compete with standard geometrical approaches both with respect to recognition rates and run time behavior. The introduction of implicit kernel matrices has reduced storage requirements. The problems which are not yet solved su$ciently are the explicit modeling of occlusion, the analysis of
223
multiple object scenes, and the construction of object models in the presence of background features. The application of the considered transforms to classify and localize objects with heterogeneous background is straightforward using the hierarchical framework introduced in Ref. [15].
Acknowledgements The authors gratefully acknowledge S. Nene, H. Murase and K. Nayar for the friendly admission to use their Software Library for Appearance Matching (SLAM).
Appendix A We consider the criterion s and get the following 3 quadratic form: K 2 + s " 3 K(K!1) i/2 K 2 " + K(K!1) i/2 2 K " + K(K!1) i/2
i~1 + (Ul !Ul )T (Ul !Ul ) i j i j j/1 i~1 + (l !l )T UTU(l !l ) i j i j j/1 i~1 + tr (UTU(l !l ) (l !l )T) i j i j j/1 1 "2tr UTU K(K!1)
C A
K i~1 ] + + (l !l ) (l !l )T i j i j i/2 j/1 "2tr [UTUQ(3)]
BD
r "2 + uTQ(3)u . i i i/1 The kernel matrix for this case thus is [8] 1 K i~1 Q(3)" + + (l !l ) (l !l )T. i j i j K(K!1) i/2 j/1
(38)
(39)
References [1] A.K. Jain, P.J. Flynn (Eds.), Three-Dimensional Object Recognition Systems, Elsevier, Amsterdam, 1993. [2] J. Ponce, Zisserman, M. Hebert (Eds.), Object Representation in Computer Vision, Lecture Notes in Computer Science, vol. 1144, Springer, Heidelberg, 1996. [3] M. Pontil, A. Verri, Support vector machines for 3d object recognition, IEEE Trans. Pattern Anal. Machine Intell. (PAMI) 20 (1998) 637}646. [4] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, 1996.
224
J. Hornegger et al. / Pattern Recognition 33 (2000) 209}224
[5] C.M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1995. [6] E. Trucco, A. Verri, Introductory Techniques for 3-D Computer Vision, Prentice-Hall, New York, 1998. [7] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer, Heidelberg, 1996. [8] H. Niemann, Klassi"kation von Mustern, Springer, Heidelberg, 1983. [9] H. Murase, S.K. Nayar, Visual learning and recognition of 3-D objects from appearance, Int. J. Comput. Vision 14 (1) (1995) 5}24. [10] K. Karhunen, UG ber lineare Methoden in der Wahrscheinlichkeitsrechnung, Ann. Acad. Sci. Fenn. Ser. AI (1947) 37. [11] Y.T. Chien, K.S. Fu, Selection and ordering of feature observations in a pattern recognition system, Inform. Control 12 (1968) 395}414. [12] P.N. Belhumeur, J.P. Hespanha, D.J. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class speci"c linear projection, IEEE Trans. Pattern Anal. Machine Intell. 19 (7) (1997) 711}720. [13] H. Murase, M. Lindenbaum, Spatial temporal adaptive method for partial eigenstructure decomposition of large images, Tech. Report 6527, Nippon Telegraph and Telephone Corporation, March 1992. [14] M. Turk, A. Pentland, Eigenfaces for recognition, J Cognitive Neurosc. 3 (1) (1991) 71}86. [15] H. Murase, S.K. Nayar, Detection of 3D objects in cluttered scenes using hierarchical eigenspace, Pattern Recognition Lett. 18 (5) (1997) 375}384. [16] H. Bischof, A. Leonardis, Robust recovery of eigenimages in the presence of outliers and occlusions, Int. J. Comput. Inform. Technol. 4 (1) (1996) 25}38.
[17] R. Epstein, P.W. Hallinan, A.L. Yuille, 5$2 eigenimages su$ce: an empirical investigation of low-dimensional lighting models, Proceedings of IEEE Workshop on Physics Based Modeling in Computer Vision, Boston, June 1995, pp. 108}116. [18] H. Niemann, Pattern Analysis and Understanding, Springer Series in Information Sciences, vol. 4, Springer, Heidelberg, 1990. [19] O.D. Faugeras, New steps toward a #exible 3d-vision system for robotics, Proceedings of the Eighth International Confernce on Pattern Recognition, Montreal, 1987, pp. 796}805. [20] A.V. Oppenheim, R.W. Schafer, Digital Signal Processing, Prentice-Hall, Englewood Cli!s, NJ, 1975. [21] A.K. Jain, D. Zongker, Feature selection: evaluation, application, and small sample performance, IEEE Trans. Pattern Anal. Machine Intell. 19 (2) (1997) 153}158. [22] W.K. Pratt, The PIKS Foundation C Programmers Guide, Manning, Greenwich, 1995. [23] K. Karhunen, I. Selin (trans.), On linear methods in probability theory. Translation to [10] T-131, The Rand Corporation, August 1960. [24] E.G. Schukat}Talamazzini, Automatische Spracherkennung, Vieweg, Wiesbaden, 1995. [25] H. Murase, M. Lindenbaum, Spatial temporal adaptive method for partial eigenstructure decomposition of large images, IEEE Trans. Image Process. 4 (5) (1995) 620}629. [26] H. Murakami, V. Kumar, E$cient calculation of primary images from a set of images, IEEE Trans. Pattern Anal. Machine Intell. 4 (5) (1982) 511}515.
About the Author*JOACHIM HORNEGGER graduated (1992) and received his Ph.D. degree in Computer Science (1996) at the UniversitaK t Erlangen-NuK rnberg, Germany, for his work on statistical object recognition. Joachim Hornegger was research and teaching associate at UniversitaK t Erlangen}NuK rnberg, a visiting scientist at the Technion, Israel, at the Massachusetts Institute of Technology, USA, and a visiting scholar at Stanford University, USA. His major research interests are 3D computer vision, 3D object recognition and statistical methods applied to image analysis problems. Joachim has taught computer vision and pattern recognition at the UniversitaK t Erlangen-NuK rnberg, Germany, at the University of Seville, Spain, and at Stanford University, USA. He is the coauthor of a monography on pattern recognition and image processing in C##. Currently Joachim is with Siemens Medical Systems working on 3-D reconstruction and a lecturer at the UniversitaK t Erlangen-NuK rnberg, Germany. About the Author*HEINRICH NIEMANN obtained the degree of Dipl.-Ing. in Electrical Engineering and Dr.-Ing. at Technical University Hannover in 1966 and 1969, respectively. During 1966/67 he was a graduate student at the University of Illinois, Urbana. From 1967 to 1972 he was with the Fraunhofer Institut fuK r Informationsverarbeitung in Technik und Biologie, Karlsruhe, working in the "eld of pattern recognition and biological cybernetics. During 1973}1975 he was teaching at Fachhochschule Giessen in the department of Electrical Engineering. Since 1975 he has been Professor of Computer Science at the University of Erlangen-NuK rnberg, since 1988 he is also head of the research group `Knowledge Processinga at the Bavarian Research Institute for Knowledge Based Systems (FORWISS) where he also was on the board of directors for six years. During 1979}1981 he was dean of the Engineering faculty of the University, in 1982 he was program chairman of the 6. International Conference on Pattern Recognition in MuK nchen, Germany, in 1987 he was director of the NATO Advanced Study Institute on `Recent Advances in Speech Understanding and Dialog Systemsa, in 1992 he was Program Chairman of the conference track on `Computer Vision and Applicationsa at the 11. International Conference on Pattern Recognition in The Hague, The Netherlands, and he was program co-chairman at the International Conference on Acoustics, Speech, and Signal Processing 1997 in MuK nchen. His "elds of research are speech and image understanding and the application of arti"cial intelligence techniques in these "elds. He is on the editorial board of Signal Processing, Pattern Recognition Letters, Pattern Recognition and Image Analysis, and Journal of Computing and Information Technology. He is the author or coauthor of 6 books and about 250 journal and conference contributions as well as editor or coeditor of 23 proceedings and special issues. He is a member of ESCA, EURASIP, GI, IEEE, and VDE. About the Author*ROBERT RISACK received his M.Sc. degree in Computer Science (Diplom Informatiker) from the UniversitaK t Erlangen-NuK rnberg, Germany. Since May 1997 Robert has been a Ph.D. student at the Fraunhofer Institut fuK r Informations-und Datenverarbeitung, Karlsruhe. He is working on the design and implementation of computer vision systems.
Pattern Recognition 33 (2000) 225}236
Adaptive document image binarization J. Sauvola*, M. PietikaK inen Machine Vision and Media Processing Group, Infotech Oulu, University of Oulu, P.O. BOX 4500, FIN-90401 Oulu, Finland Received 29 April 1998; accepted 21 January 1999
Abstract A new method is presented for adaptive document image binarization, where the page is considered as a collection of subcomponents such as text, background and picture. The problems caused by noise, illumination and many source type-related degradations are addressed. Two new algorithms are applied to determine a local threshold for each pixel. The performance evaluation of the algorithm utilizes test images with ground-truth, evaluation metrics for binarization of textual and synthetic images, and a weight-based ranking procedure for the "nal result presentation. The proposed algorithms were tested with images including di!erent types of document components and degradations. The results were compared with a number of known techniques in the literature. The benchmarking results show that the method adapts and performs well in each case qualitatively and quantitatively. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Adaptive binarization; Soft decision; Document segmentation; Document analysis; Document understanding
1. Introduction Most document analysis algorithms are built on taking advantage of the underlying binarized image data [1]. The use of a bi-level information decreases the computational load and enables the utilization of the simpli"ed analysis methods compared to 256 levels of grey-scale or colour image information. Document image understanding methods require logical and semantic content preservation during thresholding. For example, a letter connectivity must be maintained for optical character recognition and textual compression [2]. This requirement narrows down the use of a global threshold in many cases. Binarization has been a subject of intense research interest during the last ten years. Most of the developed algorithms rely on statistical methods, not considering the special nature of document images. However, recent developments on document types, for example documents with mixed text and graphics, call for more specialized binarization techniques. In current techniques, the binarization (threshold selection) is usually performed either globally or locally. * Corresponding author. Tel.: #358-40-5890652. E-mail address:
[email protected]." (J. Sauvola)
Some hybrid methods have also been proposed. The global methods use one calculated threshold value to divide image pixels into object or background classes, whereas the local schemes can use many di!erent adapted values selected according to the local area information. Hybrid methods use both global and local information to decide the pixel label. The main situations in which single global thresholds are not su$cient are caused by changes in lumination (illumination), scanning errors and resolution, poor quality of the source document and complexity in the document structure (e.g. graphics is mixed with text). When character recognition is performed, the melted sets of pixel clusters (characters) are easily misinterpreted if binarization labelling has not successfully separated the clusters. Other misinterpretations occur easily if meant to be clusters are wrongly divided. Fig. 1 depicts our taxonomy (called MSLG) and general division into thresholding techniques according to level of semantics and locality of processing used. The MSLG can be applied in pairs, for example (ML), (SL), (MG) and (SG). The most conventional approach is a global threshold, where one threshold value (single threshold) is selected for the entire image according to global/local information. In local thresholding the threshold values
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 5 - 2
226
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
Fig. 1. Taxonomy of thresholding schemes.
Fig. 2. Examples of document analysis problem types in binarization.
are determined locally, e.g. pixel by pixel, or region by region. Then, a speci"ed region can have &single threshold' that is changed from region to region according to threshold candidate selection for a given area. Multithresholding is a scheme, where image semantics are evaluated. Then, each pixel can have more than one threshold value depending on the connectivity or other semantic dependency related to physical, logical or pictorial contents. Many binarization techniques that are used in processing tasks are aimed at simplifying and unifying the image data at hand. The simpli"cation is performed to bene"t the oncoming processing characteristics, such as computational load, algorithm complexity and real-time requirements in industrial-like environments. One of the key reasons when the binarization step fails to provide the subsequent processing a high-quality data is caused by the di!erent types and degrees of degradation introduced to the source image. The reasons for the degradation may vary from poor source type, the image acquisition process to the environment that causes problems for the image quality directly. Since the degradation is unquestionably one of the main reasons for processing to fail, it is very important to design the binarization technique to detect and "lter possible imperfections from becoming the subject for processing and potential cause of errors for post-processing steps. Most degradation types in document images a!ect both physical and semantic understandability in the document analysis tasks, such as page segmentation, classi"cation and
optical character recognition. Therefore, the result after all the desired processing steps can be entirely unacceptable, just because of the poorly performed binarization. Fig. 2 depicts two types of typical degradation, when dealing with scanned grey-scale document images. In Fig. 2a the threshold &base line' is changing due to illumination e!ect or implanted (designed) entity. Then, each object has a di!erent base level that a!ects the object/ non-object separation decision in selecting threshold(s). In Fig. 2b a general type &stain problem' is presented. In this case, the background and object levels are #uctuating from clear separation to non-clear separation and small level di!erence between object/non-object. The optimal threshold lines are drawn to both images to depict the base line that a successful binarization algorithm should mimic. Fig. 3 presents another type of problem, frequently occurring in scanned document images: more than two di!erent levels are visible in textual areas due to transparency of the next page. Then, a binarization algorithm should cope with at least two di!erent threshold candidates: background-transparent text and background-text. The binarized example presents a correct binarization result. 1.1. Survey on document image binarization techniques The research on binarization techniques originates from the traditional &scene image' processing needs to
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
227
Fig. 3. Example of good binarization on degraded sample image.
optimize the image processing tasks in terms of image data at hand. While the image types have become more complex the algorithms developed have gained wider theoretical grounds. Current trend seems to move forward image domain understanding based binarization and the control of di!erent source image types and qualities. The state-of-the-art techniques are able to adapt to some degree of errors in a de"ned category, and focus on few image types. In images needing multi-thresholding, the problem seems to be ever harder to solve, since the complexity of image contents, including textual documents has increased rapidly. Some document directed binarization algorithms have been developed. O'Gorman [3] proposes a global approach calculated from a measure of local connectivity information. The thresholds are found at the intensity levels aiming to preserve the connectivity of regions. Liu et al. [4] propose a method for document image binarization focused on noisy and complex background problems. They use grey-scale and run-length histogram analysis in a method called &object attribute thresholding'. It identi"es a set of global thresholds using global techniques which is used for "nal threshold selection utilizing local features. Yang et al.'s [5] thresholding algorithm uses a statistical measurement, called &largest static state di!erence'. The method aims to track changes in the statistical signal pattern, dividing the level changes to static or transient according to a grey-level variation. The threshold value is calculated according to static and transient properties separately at each pixel. Stroke connectivity preservation issues in textual images are examined by Chang et al. in Ref. [6]. They propose an algorithm that uses two di!erent components: the background noise elimination using grey-level histogram equalization and enhancement of grey-levels of characters in the neighbourhood using an edge image composition technique. The &binary partitioning' is made according to a smoothed and equalized histogram information calculated in "ve di!erent steps. Pavlidis [7] presents a technique based on the observation that after blurring a bi-level image, the intensity of original pixels is related with the sign of the curvature of the pixels of the blurred image. This property is used to construct the threshold selection of partial histograms in locations where the curvature is signi"cant. Rosenfeld and Smith [8] presented a global thresholding algorithm to deal with noise problem using an
iterative probabilistic model when separating background and object pixels. A relaxation process was used to reduce errors by "rst classifying pixels probabilistically and adjusting their probabilities using the neighbouring pixels. This process is "nally iterated leading to threshold selection, where the probabilities of the background and the object pixels are increased and will be ruled accordingly to non-object and object pixels. The thresholding algorithm by Perez and Gonzalez [9] was designed to manage situations where imperfect illumination occurs in an image. The bimodal re#ectance distribution is utilized to present grey-scale with two components: re#ectance r and illumination i, used also in homomorphic "ltering. The algorithm is based on the model of Taylor series expansion and uses no a priori knowledge of the image. The illumination is assumed to be relatively smooth, whereas the re#ectance component is used to track down changes. The threshold value is chosen from the probabilistic criterion of occurring twodimensional threshold selection function. This can be calculated in raster-scan fashion. The illumination problem is emphasized in the thresholding algorithm, called &edge level thresholding', presented by Parker et al. in Ref. [10]. Their approach uses the principles that objects provide high spatial frequency while illumination consist mainly of low spatial frequencies. The algorithm "rst identi"es objects using Shen}Castan edge detector. The grey-levels are then examined in small windows for "nding highest and lowest values that indicate object and background. The average of these values are used to determine the threshold. The selected value is then "tted to all pixels as a surface leading the values above to be judged as a part of an object and a value lower than threshold belongs to background. Shapiro et al. [11] introduce a global thresholding scheme, where the independency is stressed in the object/background areas ratio, intensity transition slope, object/background shape and noise-insensitivity. The threshold selection is done by choosing a value that maximizes the global non-homogeneity. This is obtained as an integral of weighted local deviations, where the weight function assign higher standard weight deviation in case of background/object transitions than in homogeneous areas. Pikaz and Averbuch [12] propose an algorithm to perform thresholding for scenes containing distinct
228
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
objects. The sequence of graphs is constructed using the size of connected objects in pixels as a classi"er. The threshold selection is gained from calculating stable states on the graph. The algorithm can be adapted to select multi-level thresholds by selecting highest stable state candidate in each level. Henstock and Chelberg [13] propose a statistical model-based threshold selection. The weighted sum of two gamma densities, used for decreasing the computational load instead of normal distributions, are "tted to the sum of edge and non-edge density functions using a "ve-parameter model. The parameters are estimated using an expectation maximization-style two-step algorithm. The "tted weighted densities separate the edge pixels from non-edge pixels of intensity images. The enhanced speed entropic threshold selection algorithm is proposed in Ref. [14] by Chen et al. They reduce the image grey-scale levels by quantization and produce a global threshold candidate vector from quantized image. The "nal threshold selection is estimated only from the reduced image using the candidate vector. The reduction in computational complexity is in the order of magnitude of O(G8@3) of the number of grey-scale values, using Onotation. The quality of binarization is su$cient for preliminary image segmentation purposes. Yanowitz and Bruckstein [15] proposed an image segmentation algorithm based on adaptive binarization, where di!erent image quality problems were taken into consideration. Their algorithm aimed to separate objects in illuminated or degraded conditions. The technique uses variating thresholds, whose values are judged by edge analysis processing combined with grey-level information and construction of interpolated threshold surface. The image is then segmented using the gained threshold surface by identifying the objects by post-validation. The authors indicated that validation can be performed with most of the segmentation methods. 1.2. Our approach For document image binarization, we propose a new method that "rst performs a rapid classi"cation of the local contents of a page to background, pictures and text. Two di!erent approaches are then applied to de"ne a threshold for each pixel: a soft decision method (SDM) for background and pictures, and a specialized text bi-
narization method (TBM) for textual and linedrawing areas. The SDM includes noise "ltering and signal tracking capabilities, while the TBM is used to separate text components from background in bad conditions, caused by uneven (il)lumination or noise. Finally, the outcome of these algorithms are combined. Utilizing proper ways to benchmark the algorithm results against ground-truth and other measures is important for guiding the algorithm selection process and directions that future research should take. A well-de"ned performance evaluation shows which capabilities of the algorithm still need re"nement and which capabilities are su$cient for a given situation. The result of benchmarking o!ers information of the suitability of the technique to certain image domains and quality. However, it is not easy to see the algorithm quality directly from a set of performance values. In this paper we use a goal-directed evaluation process with specially developed document image binarization metrics and measures for comparing the results against a number of well-known and well-performed techniques in the literature [16].
2. Overview of the binarization technique Our binarization technique is aimed to be used as a "rst stage in various document analysis, processing and retrieval tasks. Therefore, the special document characteristics, like textual properties, graphics, line-drawings and complex mixtures of their layout-semantics should be included in the requirements. On the other hand, the technique should be simple while taking all the document analysis demands into consideration. Fig. 4 presents the general approach of the binarization processing #ow. Since typical document segmentation and labelling for content analysis is out of question in this phase, we use a rapid hybrid switch that dispatches the small, resolution adapted windows to textual (1) and non-textual (2) threshold evaluation techniques. The switch was developed to cover most generic appearances of typical document layout types and can easily be modi"ed for others as well. The threshold evaluation techniques are adapted to textual and non-textual area properties, with the special tolerance and detection to di!erent basic defect types that are usually introduced to images. The outcome of these techniques represent a threshold value
Fig. 4. Overview of the binarization algorithm.
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
229
Fig. 5. Interpolation options for binarization computation.
proposed for each pixel, or every nth pixel, decided by the user. These values are used to collect the "nal outcome of the binarization by a threshold control module. The technique also enables the utilization of multi-thresholds region by region of globally, if desired.
3. Adaptive binarization The document image contains di!erent surface (texture) types that can be divided into uniform, di!erentiating and transiently changing. The texture contained in pictures and background can usually be classi"ed to uniform or di!erentiating categories, while the text, line drawings, etc. have more transient properties by nature. Our approach is to analyse the local document image surface in order to decide on the binarization method needed (Fig. 4). During this decision, a &hybrid switching' module selects one of two specialized binarization algorithms to be applied to the region. The goal of the binarization algorithms is to produce an optimal threshold value for each pixel. A fast option is to compute "rst a threshold for every nth pixel and then use interpolation for the rest of the pixels (Fig. 5). The binarization method can also be set to bypass the hybrid switch phase. Then the user can choose which algorithm is selected for thresholding. All other modules function in the same way as in hybrid conditions. The following subsection describes the region type and switching algorithms. The two di!erent binarization algorithms are then discussed in detail. The "nal binarization is performed using the proposed threshold values. This process is depicted in the last subsection.
then scaled between 0 and 1. Using the limits of 10, 15 and 30% of scaled values, the transient di!erence property is de"ned as &uniform', &near-uniform', &di!ering' or &transient'. This coarse division is made according to average homogeneity on the surface. According to these labels, a vote is given to corresponding binarization method that is to be used in a window. The labels &uniform' and &near-uniform' correspond to background and &scene' pictures, and give votes to the SDM. The labels &di!ering' and &transient' give their votes to the TBM method. Selection of a binarization algorithm is then performed as following example rules (1, 2) show: 1. If the average is high and a global histogram peak is in the same quarter of the histogram and transient di!erence is transient, then use SDM. 2. If the average is medium and a global histogram peak is not in the same quarter of the histogram and transient di!erence is uniform, then use TBM. An example result of image partitioning is shown in Fig. 6. The white regions are guided to the SDM algorithm, while the grey regions are binarized with the TBM algorithm. 3.2. Binarization of non-textual components As in soft control applications, our algorithm "rst analyses the window surface by calculating descriptive characteristics. Then, the soft control algorithm is applied to every nth pixel (Fig. 5). The result is a local threshold based on local region characteristics. To ensure local adaptivity of threshold selection, we use two di!erent types of locally calculated features:
3.1. Region analysis and switching Threshold computation is preceded by the selection of the proper binarization method based on an analysis of local image properties. First, the document image is tiled to equal sized rectangular windows of 10}20 pixels wide, corresponding to the resolution that linearly varies between '75 and (300 dpi. Two simple features are then computed for each window; these results are used to select the method. The "rst feature is simply the average grey value of a window. The second feature, &transient di!erence', measures local changes in contrast (Eq. (4)). The difference values are accumulated in each subwindow and
Fig. 6. Example of region (SDM/TBM) selection.
partitioning
for
algorithm
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
230
&weighted bound' and &threshold di!erence'. The membership function issues, soft decision rules and defuzzi"cation algorithm are presented in the following paragraphs. 3.2.1. Weighted bound calculation Histogram-based analysis schemes and features are often used in binarization methods. In document analysis the histogram is very useful for detecting and di!erentiating domains in physical and logical analysis. We use a new approach developed for local detection and weighting of bounds in grey-scale image texture. A new feature called weighted bound (= ) is introduced and b utilized in the soft control algorithm. The = is used for b characterization of local pixel value pro"les by tracking low, medium and high pixels in a small area. In a given surface area of n]n pixels, where n is a window width gained from the non-overlapping regions analysis tile size (see Section 3.1), three di!erent measures are calculated. The values are collected in a two-dimensional table used to weight and simplify the three envelope curves in soft control membership functions. The measures are minimum, medium and maximum averages given in Eqs. (1)}(3). Minimum average, A .*/ 100@n min (P(i, j)) 100@n A " + , .*/ 100/n k/0
(1)
where P(i, j) is the document image region, and i is the width, and j is the height. n is the static number gained from average window size (see Section 3.1).
Medium average, A
.%$ 100@n med (P(i, j)) 100@n . (2) A " + .%$ 100/n k/0 Maximum average, A .!9 100@n max (P(i, j)) 100@n A " + . (3) .!9 100/n k/0 These values are stored in an n]n]3 table, called a weighted average table (WAT). Using Eqs. (1)}(3), three di!erent histograms are formed where the values are added to their respective bin values (value"bin index). These histograms are then separately partitioned to ten horizontal and three vertical sections, where the number of peaks from histograms are calculated to each section according to sectioning limits. The horizontal borders are set between bins 0 and 255 with a formula int((256/10)Hm), where m"1, 2,2, 9. The number of borders was set to ten. Also a smaller number could be selected, but the penalty is that the original histogram is aliased more. Ten borders equals 25 bins of grey-scale. The two vertical borders are set between 0 and maximum, representing the number of votes calculated for each horizontal bin so that the limits are set to 80% of maximum number of votes and to 40% of the maximum number of votes, respectively. These limits are set according to the tests performed with a large set of images. The higher limit is relatively insensitive to $10% change. Lowering the lower limit brings more votes to medium peak calculation, thus enhancing the envelope curve in bins where a medium peak appears. After the peaks are calculated in a 3]10 table, the weighting is performed (Fig. 7). The result is a = b
Fig. 7. An example of = membership function calculation using A histogram. b .*/
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
envelope curve that is used in the soft decision process. The three = curves, calculated from A , A and A b .*/ .%$ .!9 are used as membership functions. 3.2.2. Transient diwerence calculation The transient di!erence is aimed at extracting the average amount of variations occurring between the neighbouring pixels (contrast di!erence) in an n]n area, i.e. to follow local surface changes. The di!erences between adjacent pixels are accumulated. The transient di!erence (¹D) of the horizontal and vertical adjacent pixel values is calculated and accumulated. The gained value is then scaled between 0}1 (Eq. (4)). ¸ represents the number of grey-levels in the image. (+n +n D2P(i, j)![P(i!1, j)#P(i, j!1)]D) ¹D" i/1 j/1 . (¸n)2 (4) The ¹D value is used in soft decision making to expose uniform, di!erential and transient area types when calculating the control value for threshold selection. 3.2.3. Membership function generation Two di!erent membership functions are used according to the extracted feature values for a given pixel: weighted bound (= ) and transient di!erence (¹D ). The b m "rst one is calculated dynamically from the image. The transient di!erence uses prede"ned membership functions. Fig. 8 depicts these functions using the ideal functions as = and the actual membership functions b for ¹D . m
231
3.2.4. Soft decision rules and defuzzixcation In the soft decision process, we use nine di!erent rules derived from the feature analysis and membership management. For = these are (LOW, MIDDLE, HIGH), b denoting the local histogram properties. For ¹D we use m (UNIFORM, DIFFERING, TRANSIENT), describing the local region property. The rule set is shown in Fig. 9. As in soft control problems, the rules are expressed with clauses, for example: If = is SP(i, j)T and ¹D is S¹D(i, j)T b m then ¹ (i, j)"S0, 255T. c The current rule set is designed for pictorial and background-type image regions. Using this set the noise and most illumination defects can be adaptively corrected in the processed areas. For defuzzi"cation we use Mamdani's method [17]. The result of the defuzzi"cation is a unique threshold value for each pixel n.
3.3. Binarization of textual components For text binarization we use a modi"ed version of Niblack's algorithm [18]. The idea of Niblack's method is to vary the threshold over the image, based on the local mean, m, and local standard deviation, s, computed in a small neighbourhood of each pixel. A threshold for each pixel is computed from ¹"m#kHs, where k is a user de"ned parameter and gets negative values. This method
Fig. 8. Input and output membership functions: = (ideal), ¹D and ¹ . b m c
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
232
threshold line that is adapted to original degraded document image. 3.4. Interpolative threshold selection
Fig. 9. Example of soft decision rules for threshold candidate ¹ (i, j). c
does not work well for cases in which the background contains light texture as the grey values of these unwanted details easily exceed threshold values. This results in costly postprocessing as demonstrated in Ref. [19]. In our modi"cation, a threshold is computed with the dynamic range of standard deviation, R. Furthermore, the local mean is utilized to multiply terms R and a "xed value k. This has the e!ect of amplifying the contribution of standard deviation in an adaptive manner. Consider, for example, a dark text on light dirty-looking background (e.g., stains in a bad copy), Fig. 2. The m-coef"cient decreases the threshold value in background areas. This e$ciently removes the e!ect of stains in a thresholded image. In our experiments, we used R"128 with 8-bit gray level images and k"0.5 to obtain good results. The algorithm is not too sensitive to the value of parameter k. Eq. (5) presents the textual binarization formula.
C
¹(x, y)"m(x, y) ) 1#k )
A
BD
s(x, y) !1 R
,
After thresholding guided by the surface type, the "nal thresholds are calculated for background, textual, graphics and line drawing regions. A fast option is to compute "rst a threshold for every nth pixel and then using interpolation for the rest of the pixels. The control algorithm has two modes depending on the value of n. If n"1, the threshold values gained from SDM and TBM algorithms are combined directly. If n'1, threshold values for non-base pixels are calculated using the surrounding threshold values. We have two options to calculate the non-base pixel thresholds: bilinear interpolation and simple averaging. In the interpolation method, the threshold value for a non-base pixel is gained by computing the surrounding base pixels distance to the current one, and using these values as weights, Fig. 11a. This approach gives a more precise, weighted threshold value for each pixel. In the simple averaging method, the average of the surrounding four n pixel threshold candidate values is calculated and used as a "nal threshold for each non-base pixel between the selected base pixels, Fig. 11b. This approach is used to lower the computational load and is suitable for most images, especially for those with random noise and n larger than "ve pixels.
(5)
where m(x, y) and s(x, y) are as in Niblack's formula. R is the dynamic range of standard deviation, and the parameter k gets positive values. Fig. 10 shows an example
4. Experiments The proposed binarization algorithm was tested with the benchmarking technique and various scenarios
Fig. 10. Example of threshold candidate selection of an example scanline.
Fig. 11. Two interpolation choices for threshold selection of non-base pixels.
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
233
Fig. 12. Visual and numeric results on the comparison algorithms applied to illuminated, textual images.
against several known binarization techniques in the literature [18,20}22]. Using the environment factors (such as di!erent degradations) and available document and test image databases the algorithm results were evaluated and benchmarked against each other, against the ground-truth knowledge by visual and benchmark event(s) evaluation processes. The focus was set on documents with textual content and on multi-content documents, i.e. documents having text, graphics, linedrawings and halftone. The test images were selected from a special database of document image categories, comprising over 1000 categorized document images (e.g. article, letter, memo, fax, journal, scienti"c, map, advertisement, etc.) [23]. The numerical test and results presented were gained using binarization metrics emphasizing the performance in textual image region binarization. Fig. 12 presents an example benchmarking scene performed to a database of 15 textual document images having illumination. Visual results to a sample input image having 20% of centered illumination defect, an example of a ground-truth image map and the results of the proposed and comparison binarization algorithms. The results show good behaviour of Sauvola's, Niblack's and Eikvil's algorithms, when the limit is set to 80% performance, i.e. the limit where the OCR performance drop is less than 10% using Caere Omnipage OCR package [24]. Bernsen su!ered of noise that was introduced to binarized result image, while the Eikvil's threshold ruled some of the darkest areas belong to object pixels. Parker's algorithm adapted poorly to even small changes in lumination, but had su$cient results with relatively &clean' grey-scale document images. The visual tests performed for a synthetic test image database were based on ranking according to di!erent
objectives set for these types of images. The purpose of the synthetic image database is to allow visual analysis of the nature and behaviour of the benchmarking technique in a di!erent kind of situation, e.g. in edge preservation, the object uniformity preservation, in changing/varying background, etc. This is aimed to aid the suitability selection of di!erent algorithm to di!ering environmental conditions in terms of adaptability to changes, shape management, object preservation, homogeneousness of region preservation, and so on. An example of the visual results on synthetic images is shown in Fig. 13. Fig. 13 shows visually the results of our, and comparison, algorithms applied to synthetic grey-scale images having di!erent/di!ering kind of background(s), object(s), line(s), directions and shapes complying with certain simple test setting rules. As the input grey-scale images were synthetically generated, a set of groundtruth images were generated focusing in di!erent areas of interest in measuring the algorithm performance and behaviour. Therefore, the benchmark results are dependent on the selection of the ground-truth set used, i.e. the target performance group the algorithm behaviour. For example, the ground-truth criteria of object uniformity and edge preservation were tested using ground-truth image in Fig. 13a. The object edge and background/ object uniformity was used as a weight criteria, where the Euclidean distance was used as a distance measure between the result and the ground-truth pixel maps. Fig. 13b shows a situation, where the synthetic image has uniformly gliding background from white to black, and thin lines, whose grey-scale value glides on the opposite direction from the background. The test evaluation criterium was set on di!erentiating lines from background and uniformity of the background. Since the results are highly dependent on the target aims of the binarization,
234
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
Fig. 13. Results on the comparison algorithms applied to the synthetic graphical images.
Fig. 14. Overall benchmarked binarization and example pro"le results on &text only' document database.
the results are presented also visually. By using the criteria of the uniformity and object shape preservation the proposed algorithm behaves robustly compared to other techniques. Since most of the pixels in synthetic images are judged by the soft control method, the threshold between objects and non-object candidates seems very clear. Fig. 14 shows benchmarking results performed with the textual image database with small amounts of clean and mixed illumination and noise types. An example performance pro"le to noise degradation component is shown for all the comparison algorithms. The degree of noise degradation presents the percentage of Gaussian and random noise introduced in the textual image, and the performance using combined pixel map and OCR metrics with equal weight factors. The performance of the proposed and comparison algorithms, excluding Parker's, seems to be su$cient up till 20% noise penetration. The performance pro"le clearly shows that the performance of the comparison algorithms drops between 20 and 30% penetration, while the proposed algorithm tolerated with severe noise, up to 45% having 80% threshold limit for acceptable value. Fig. 15 shows the overall results of the proposed and comparison algorithms with various document categories performed to a large database of document images. The test images comprise simple textual documents with
and without degradation types and degrees, documents with mixed textual and graphical properties, where the bene"ts of the hybrid approach of the proposed algorithm can be clearly seen. The methods of Eikvil and Niblack performed best against the proposed algorithm, but they still su!ered of poor adaptation to various degradation types and, for example, the font size used in the textual parts was combined with the characters. The Bernsen algorithm shows good results on clean document and did tolerate small amount of one defect type. When the degradation was higher, the algorithm's performance decreased rapidly both in visual and numerical evaluation. Parker's algorithm shows su$cient results with clean document images, but the result quality dropped with even small introduction of document with any defect type. The algorithm execution times were not measured in this comparison, where only the quality of the result was benchmarked against the metrics in a weighted (textual, graphics, character) process. The computing times for all the evaluated algorithms were tolerable, for example for utilization as a preprocessing step in optical character recognition engines. One question in performing the benchmarking is the arrangement of parametrization. The proposed algorithm had no parameters to set during testing, while Niblack had one, and Bernsen two, Eikvil
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
235
Fig. 15. Overall benchmarked binarization results on textual document database.
used "rst Otsu's technique with one parameter and their postprocessing with one parameter, Parker's algorithm had four parameters to set. Each algorithm with parameters that needed manual tuning was computed with di!erent parameters, whose result were evaluated and the best was selected to "nal comparison presented in this paper. When the higher adaptation is required from the algorithm, the number of manually tunable parameters should not exceed two, otherwise the amount of manual work increases too much and cause instability where automated preprocessing is required. The overall results show good evaluated performance to the proposed, Niblack's and Eikvil's algorithms. The difference if these approaches lies in overall adaptability, the need for manual tunability, target document category domain and environment, where the algorithm is utilized, and "nally the threshold performance set for the binarization process. In the latter case the proposed and Niblack's algorithms performance and adaptivity was highest in all test categories in graphical and textual cases.
5. Conclusions Document image binarization is an important basic task needed in most document analysis systems. The quality of binarization result a!ects to subsequent processing by o!ering pre-segmented objects in precise form (object/non-object). In this paper we proposed a new technique to document image binarization, using hybrid approach and taking document region class properties into consideration. Our technique is aimed at generic document types coping also with severe cases of di!erent types of degradation. The result of the quality validation (i.e. benchmarking against other algorithms and ground truth) is an important part of the algorithm development process. The proposed algorithm went over large tests utilizing test image databases having textual, pictorial and synthetically generated document images with
ground-truths and degradations. The results show especially good adaptation into di!erent defect types such as illumination, noise and resolution changes. The algorithm showed robust behaviour in most, even severe, situations in degradation and performed well against the comparison techniques.
6. Summary This paper presents a new algorithm for document image binarization using an adaptive approach to manage di!erent situations in an image. The proposed technique uses rapid image surface analysis for algorithm selection and adaptation according to document contents. The contents is used to select the algorithm type and need for parametrization, if any, and to compute and propose the threshold value for each or every nth pixel (interpolative approach). The document content is used to guide the binarization process: a pictorial content is subjected to a di!erent type of analysis than a textual content. The degradations, such as illumination and noise, are managed within each algorithm structure to e!ectively "lter out the imperfections. The results of the thresholding processes are combined to a binarized image that can either use a fast option, i.e. to compute binarization for every nth pixel and interpolate the threshold value for the in-between pixels, or a pixel by pixel option that computes a threshold value for each pixel separately. The tests were run on a large database of document images having 15 di!erent document types and a number of representative images of each type. Each image was processed with the presence of various amount of di!erent degradation to evaluate the e$ciency of the proposed algorithm. The results were compared to those obtained with some of the best-known algorithms in the literature. The proposed algorithm outperformed its competitors clearly and behaved robustly in di$cult degradation cases with di!erent document types.
236
J. Sauvola, M. Pietika( inen / Pattern Recognition 33 (2000) 225}236
Acknowledgements The support from the Academy of Finland and Technology Development Centre is gratefully acknowledged. We also thank Dr. Tapio Seppanen and Mr. Sami Nieminen for their contributions.
References [1] J. Sauvola, M. PietikaK inen, Page segmentation and classi"cation using fast feature extraction and connectivity analysis, International Conference on Document Analysis and Recognition, ICDAR '95, Montreal, Canada, 1995, pp. 1127}1131. [2] H. Baird, Document image defect models, Proceedings of the IAPR Workshop on Syntactic and Structural Pattern Recognition, 1990, pp. 38}46. [3] L. O'Gorman, Binarization and multithresholding of document images using connectivity, CVGIP: Graph. Models Image Processing 56 (6) (1994) 496}506. [4] Y. Liu, R. Fenrich, S.N. Srihari, An object attribute thresholding algorithm for document image binarization, International Conference on Document Analysis and Recognition, ICDAR '93, Japan, 1993, pp. 278}281. [5] J. Yang, Y. Chen, W. Hsu, Adaptive thresholding algorithm and its hardware implementation, Pattern Recognition Lett. 15 (2) (1994) 141}150. [6] M. Chang, S. Kang, W. Rho, H. Kim, D, Kim, Improved binarization algorithm for document image by histogram and edge detection, International Conference for Document Analysis and Recognition ICDAR '95, Montreal, Canada, 1995, pp. 636}643. [7] T. Pavlidis, Threshold selection using second derivatives of the gray scale image, International Conference on Document Analysis and Recognition, ICDAR '93, Japan, 1993, pp. 274}277. [8] A. Rosenfeld, R.C. Smith, Thresholding using relaxation, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-3 (5) (1981) 598}606. [9] A. Perez, R.C. Gonzalez, An iterative thresholding algorithm for image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. PAMI-9 (6) (1987) 742}751.
[10] J.R. Parker, C. Jennings, A.G. Salkauskas, Thresholding using an illumination model, ICDAR '93, Japan, 1993, pp. 270}273. [11] V.A. Shapiro, P.K. Veleva, V.S. Sgurev, An adaptive method for image thresholding, Proceedings of the 11th KPR, 1992, pp. 696}699. [12] A. Pikaz, A. Averbuch, Digital image thresholding, based on topological stable-state, Pattern Recognition 29 (5) (1996) 829}843. [13] P.V. Henstock, D.M. Chelberg, Automatic gradient threshold determination for edge detection, IEEE Trans. Image Processing 5 (5) (1996) 784}787. [14] W. Chen, C. Wen, C. Yang, A fast two-dimensional entropic thresholding algorithm, Pattern Recognition 27 (7) (1994) 885}893. [15] S.D. Yanowitz, A.M. Bruckstein, A new method for image segmentation, CVGIP 46 (1989) 82}95. [16] S. Nieminen, J. Sauvola, T. SeppaK nen, M. PietikaK inen, A benchmarking system for document analysis algorithms, Proc. SPIE 3305 Document Recognition V 3305 (1998) 100}111. [17] S.T. Welstead, Neural Network and Fuzzy Logic Applications in C/C##, Wiley, New York, 1994, p. 494. [18] W. Niblack, An Introduction to Image Processing, Prentice-Hall, Englewood Cli!s, NJ, 1986, pp. 115}116. [19] O.D. Trier, A.K. Jain, Goal-directed evaluation of binarization methods, IEEE Trans. Pattern Anal. Mach. Intell. 17 (12) (1995) 1191}1201. [20] L. Eikvil, T. Taxt, K. Moen, A fast adaptive method for binarization of document images, International Conference on Document Analysis and Recognition, ICDAR '91, France, 1991, pp. 435}443. [21] J. Bernsen, Dynamic thresholding of grey-level images, Proceedings of the Eighth ICPR, 1986, pp. 1251}1255. [22] J. Parker, Gray level thresholding on badly illuminated images, IEEE Trans. Pattern Anal. Mach. Intell. 13 (8) (1991) 813}819. [23] J. Sauvola, S. Haapakoski, H. Kauniskangas, T. SeppaK nen, M. PietikaK inen, D. Doermann, A distributed management system for testing document image analysis algorithms, 4th ICDAR, Germany, 1997, pp. 989}995. [24] Caere Ominpage OCR, Users Manual, Caere Corp., 1997.
About the Author*JAAKKO SAUVOLA is a Professor and Director of the Media Team research group in the University of Oulu, Finland, and a member of the a$liated faculty at the LAMP Laboratory, Center for Automation Research, University of Maryland, USA. Dr. Sauvola is also a Research Manager in Nokia Telecommunications, where his responsibilities cover value adding telephony services. Dr. Sauvola is a member of several scienti"c committees and programs. His research interests include computer-telephony integration, media analysis, mobile multimedia, media telephony and content-based retrieval systems. About the Author*MATTI PIETIKAG INEN received his Doctor of Technology degree in Electrical Engineering from the University of Oulu, Finland, in 1982. From 1980 to 1981 and from 1984 to 1985 he was a visiting researcher in the Computer Vision Laboratory of the University of Maryland, USA. Currently, he is a Professor of Information Technology, Scienti"c Director of Infotech Oulu research center, and Director of Machine Vision and Media Processing Group at the University of Oulu. His research interests cover various aspects of image analysis and machine vision, including texture analysis, color machine vision and document analysis. His research has been widely published in journals, books and conferences. He was the editor (with L.F. Pau) of the book `Machine Vision for Advanced Productiona, published by World Scienti"c in 1996. Prof. PietikaK inen is one of the founding Fellows of the International Association for Pattern Recognition (IAPR) and a Senior Member of IEEE, and serves as Member of the Governing Board of IAPR. He also serves on program committees of several international conferences.
Pattern Recognition 33 (2000) 237}249
Adaptive window method with sizing vectors for reliable correlation-based target tracking Sung-Il Chien*, Si-Hun Sung School of Electronic and Electrical Engineering, Kyungpook National University, 1370 Sankyuk-dong, puk-gu, Taegu 702-701, South Korea Received 5 August 1998
Abstract We propose an adaptive window method that can provide a tracker with a tight reference window by adaptively adjusting its window size independently into all four side directions for enhancing the reliability of correlation-based image tracking in complex cluttered environments. When the size and shape of a moving object changes in an image, a correlator often accumulates walk-o! error. A success of correlation-based tracking depends largely on choosing the suitable window size and position and thus transferring the proper reference image template to the next frame. We generate sizing vectors from the corners and sides, and then decompose the sizing vector from the corner into two corresponding sides. Since our tracker is capable of adjusting a reference image size more properly, stable tracking has been achieved minimizing the in#uence of complex background and clutters. We tested the performance of our method using 39 arti"cial image sequences made of 4260 images and 45 real image sequences made of more than 3400 images, and had the satisfactory results for most of them. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Adaptive window; Sizing vector; Error correction; Target tracking; Correlation; Motion detection; Tracking feature extraction
1. Introduction Correlation-based tracking [1}5] acquires a reference template from the previous frame called a reference image and searches for the most similar area to estimate the target position in the current frame. Although the correlator is said to be robust against the cluttered noise, its real application has some limitations, too. Usually, it is desirable that searching area should be chosen to be small due to its large computation involved. Another problem is its tendency to accumulate walk-o! error especially when the object of interest is changing in size, shape, or orientation from frame to frame.
* Corresponding author. Tel.: #82-53-950-5545; fax: #8253-950-5505 E-mail address:
[email protected] (S-I. Chien)
If walk-o! error is accumulated beyond a certain critical point, correlation-based tracking often fails. It is quite important that the size and position of a window should be determined precisely to guarantee that a proper reference image can be transferred to the next frame. In order to increase correlation reliability, the transferred reference image is usually desired to have a high occupancy rate of an object, which means that the window encloses the object properly. For this, it is quite desirable that the window should adjust its size as circumstances of the object have been under change. A concept of an adaptive window could be found also in stereo matching. In the case of stereo matching, the disparity boundaries become sharp for the smaller window but the computed disparity becomes usually noisy. The larger window means that the computed disparity becomes clean but the disparity boundaries can be blurred. Kanade and Okutomi [6] determined the adaptive window size using local intensity and disparity patterns
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 6 - 4
238
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
to minimize uncertainty in the disparity computed at each point. Lotti and Giraudon [7] made four noncentered adaptive windows associated to each image point in the thresholded edge image. In correlation-based tracking, the adaptive window studies based on estimating an object size are not much reported in the technical literature. To automatically adapt the reference area, Hughes and Moy [1] designed the edge walking algorithm for the boundary detection to operate on a segmented binary image. This algorithm scans the binary image in a raster fashion looking for illuminated pixels. Once an illuminated pixel has been found, the edge walking algorithm searches for another illuminated pixels connected to the initial pixel. They used a boundary detection method for estimating the size of an object and thus determining the size of window that would enclose the object. Similarly, Montera et al. [2] determined an object region through expanding from the inner point of an object to the outer in the image. To yield the boundary of an object, they searched for the areas where pixel values vary from above the threshold to below the threshold. However, both methods could be di$cult to be applied to an object having internal edges in non-homogeneous cluttered environment. Chodos et al. [3] developed a window algorithm embedded in a tracker, which is able to adjust the track gate size in four directions using the gradient function formed from the correlation weighting function. However, we expect this method to be unsuitable for being applied to a large object moving fast near an observer since it is able to adjust only by one pixel in each direction. An adaptive window without a proper sizing mechanism can hardly accommodate itself to environment variations when a window size is much larger or smaller than an object size or an object size is abruptly changed. To adjust a window size more rapidly and e$ciently, we propose the adaptive window method which is able to control the size continuously with four directional sizing vectors in a complex background. Our method introduces eight sizing vectors estimated from eight districts (four side districts and four corner districts), and decomposes each sizing vector from a corner district into the two orthogonal vectors to estimate the "nal sizing vectors in four side directions.
In the proposed window method, positive di!erence of edges (PDOE) image rather than the conventional edge image is adopted as a basic correlation feature, since its use is found to be quite useful in compressing background components. The detailed description of the PDOE is beyond the viewpoint of this paper. Thus we brie#y introduce the PDOE and the applied correlator in Section 2. In Section 3, we detail the structure and procedure of the proposed window method. Then in Section 4, we provide experimental results using arti"cial and real image sequences. Finally, we include a conclusion in Section 5.
2. Applied image tracker architecture The image tracking block diagram we propose is described in Fig. 1. The overall system is largely divided into the PDOE, the correlator for the main tracking process, the proposed adaptive window block, and the recursive updating loop. First, we acquire the background-reduced image using the PDOE as a tracking feature and then track the object by applying the correlator. Finally, the adaptive window method determines a reference image region tightly enclosing the object to be used in the next frame. For the consecutive tracking, the size and position of the reference image will be suitably updated. 2.1. Positive diwerence of edges (PDOE) for feature extraction Conceptually, the PDOE has spatial and temporal motion components as shown in Fig. 2. We can represent the PDOE image PDOE (x, y) in the nth frame such that n D (x, y)"E (x, y)!E (x, y), n n n~1
(1)
G
(2)
D (x, y) if D (x, y)'0, n PDOE (x, y)" n n 0 otherwise,
where E (x, y) is an edge component at a point (x, y) and n E (x, y) is obtained from the previous frame. We use n~1 the Sobel operator to detect an edge E(x, y). Conventionally, di!erence methods in image processing use an
Fig. 1. Overall tracking block diagram.
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
Fig. 2. Block diagram of the PDOE.
absolute value of D(x, y). However, the PDOE removes negative components in D(x, y). As shown in Fig. 3, the PDOE can extract single edge component for the target while an absolute di!erence method produces double moving components for a moving target. This is useful for removing the background components and detecting motion components more stably. 2.2. Correlator Correlation-based tracking searches for the best matched point in the current frame to the reference image acquired from the previous frame. Let g(x, y) be a reference image of size m]n and f (x, y) be an image of size M]N to be searched, and we assume that m)M and
239
Fig. 4. Correlation layout for f (x, y) and g(x, y) at point (i, j).
n)N. The elements of the correlation surface R between f (x, y) and g(x, y) in Fig. 4 are given by R(i, j)"+ + f (x, y)g(x!i, y!j), x y
(3)
where i"0, 1, 2 ,2, M!1, j"0, 1, 2 ,2, N!1 and the summation is taken over the image region where f and g overlap. Thus, the matched point (iH, jH) is estimated by (iH, jH)"arg max (R(i, j)). i,j
(4)
This point indicates the position where g(x, y) best matches f (x, y).
Fig. 3. Moving component extraction for two arti"cial noisy images and two real images: (a) previous frame, (b) current frame, (c) absolute di!erence of edge images, and (d) PDOE image.
240
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
Fig. 5. Overall block diagram of the adaptive window method using sizing vectors.
A correlator is regarded as more robust against complex cluttered environment than a centroid-based tracker based on a center of geometric gravity from the tracking window. However, the walk-o! error often occurs when the situation for a target changes rapidly and a general correlator easily accumulates such error in the process of tracking. Furthermore, it is quite di$cult for a human operator to enclose the target tightly in starting an initial reference window for correlation-based tracking. Therefore, to reduce the error of a correlator, adjusting the size of a reference window adaptively is quite demanded.
3. Proposed adaptive window method Fig. 5 describes an overall block diagram of our proposed adaptive window scheme. The feature set used for adjusting an adaptive window is the PDOE image previously described. Our method needs mainly several steps: the adaptive window setting, the sizing vector estimation from corners and sides, and the window size determination and the reference center point relocation. Here, the adaptive window can expand or shrink independently in four side directions, each side having the sizing magnitude and direction derived from the "nal sizing vector S. In Fig. 5, we present several sizing parameters: S is a side sizing vector estimated from the side, S S is a corner sizing vector from the corner, S and C CH S are the orthogonally decomposed sizing vectors conCV tributed by corner-to-side conversion.
used as a criterion for determining whether pixels in the middle region are part of the object or not. Thus in order to extract more accurate information near the object boundary, the area of the middle region is "xed to be smaller than those of the others. Second, eight overlapped districts consisting of four side districts and four corner districts are then de"ned. The side sizing vector S will act as a dominant paraS meter in "nally determining a sizing direction of the window. A corner district will evaluate the edge distribution to provide a relevant corner sizing vector S that will C be decomposed to the two corresponding side sizing vectors: S and S . For this, the area of a corner is CH CV designed to be similar to that of a side district. Finally, each district has been further divided into three subdistricts: inner (I), middle (M), and outer zones (O), which are not overlapped. Actually, the statistical information from these zones will be used to identify various situations leading to the determination of the suitable sizing vector, which will be detailed in the following Sections. Fig. 6 describes a layout of the left side district of four side districts and a layout of the top right corner district of four corner districts. The remaining side and corner districts will be de"ned similarly.
3.1. Adaptive window setting For adaptively controlling the reference window and relocating the center point, we now construct three regions: inner, middle, and outer regions. First, we de"ne the outside boundary of the middle region as the window boundary given by the previous tracking stage. We design the inner region needed for extracting information within an object boundary and the extended outer region for obtaining useful clues about background information. The extracted information from a middle region will be
Fig. 6. The layout of the left side district and that of the top right corner district.
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
Fig. 7. Eight reference unit vectors for four sides and four corners.
3.2. Sizing vector estimation in each district 3.2.1. Reference direction of window sizing Now, in order to describe the detailed procedure, we introduce eight reference unit vectors that represent the expanding directions of window movements; ui for four S sides and ui for four corners as shown in Fig. 7 are C de"ned as ui "cos u i#sin u j and ui "cos u i#sin u j, (5) S S S C C C where p u" i S 2
p p and u " i# for i"0, 1, 2, 3. C 2 4
(6)
3.2.2. Direction and magnitude of sizing vector Here, we should estimate a sizing vector by comparing the means evaluated from three zones in each district. The detailed #owchart of determining a sizing vector in a district is shown in Fig. 8, where k , k , and k are I M O
241
means of gray levels in inner, middle, and outer zones, respectively. We heuristically use three conditions to determine a sizing vector. Condition I considers only the case that the mean in the inner zone is smaller than that in the outer zone in a district. Usually, in a normal condition, the mean in the inner zone (whose area is also larger than the area of the outer zone) is larger than that in the outer zone, especially when the PDOE signi"cantly reducing background components is used. Condition I alarms a tracker over an abnormal situation that undesirable strong clutter or the other moving components might exist around the outer zone and thus turns o! the adaptive procedure in this district. Condition II represents the situation of decreasing a window size in which an absolute mean di!erence between the inner zone and the middle zone is larger than that between the middle zone and the outer zone with some marginal value included. Condition III detects the opposite situation. Additionally, for reducing the adjusting sensitivity around the boundary of middle zone, we set a marginal factor a as 20% of k in Conditions II and III. As for the side M district, however, we found that such consideration of the marginal factor is not needed. When we determine the magnitudes of sizing vectors as in Fig. 8, we should consider the two issues. One issue is to assign a weight between the side and the corner based on frequency of occurrence of objects. When the window boundary is located on the border between the object and background components, as often happens in tracking, we found that an object exists more frequently in a side district rather than in a corner district. Thus, we set the corner weighting factor w to 0.7 to balance contribuC tions of a side sizing vector and its related two corner sizing vectors. We also note that the "nal assignment (as in Fig. 8) results in one of three cases: no change occurs in sizing, that is, S "0 or S "0; the window is S C
Fig. 8. Flowchart (a) and three conditions (b) for determining a sizing vector in a district.
242
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
recommended to expand, that is, S "u or S "w u ; S S C C C the third case is to shrink down the window, that is, S "!u or S "!w u . S S C C C Another consideration is a data loss problem due to a temporary sizing fault. If the window fails to enclose the whole part of an object by lavishly reducing its size, the tracker might lose some valuable information and this leads to quite undesirable situation. On the other hand, increasing a window size is more tolerant of sizing error, since the window still encloses a whole object. We thus make the sizing rate of increase be twice as large as that of decrease and re#ect this di!erence by the sign weighting factor w , which will be detailed in following Section s 3.4. This fact means that our window system is designed to be generous in expanding, but somewhat cautious in shrinking. 3.3. Final sizing vector determination using corner to side decomposition in four side directions We "rst decompose a corner sizing vector S into two C horizontally and vertically decomposed vectors S and CH S . CV S "S #S . (7) C CH CV Eventually, we can determine the "nal sizing vector for each side by performing the vector sum of original side component and the two components from neighboring corners. A "nal sizing vector Si for a horizontal direction i"0, 2, i.e., left and right of a window, is given by Si"Si #Si #S.0$4(i~1) (8) S CH CH and similarly, a "nal sizing vector for the vertical direction i"1, 3, i.e., top and bottom, is given by Si"Si #Si #S.0$4(i~1). (9) S CV CV A conceptual example has been forwarded in Fig. 9. The four "nal sizing vectors having information about the
Fig. 10. Fine tuning of tracking position for correcting the position error: positions before window sizing procedure (left) and after window sizing procedure (right).
magnitudes and directions of four sides have been shown in Fig. 9c. 3.4. Reference window relocation with xnal sizing vector Here, the "nal sizing vectors previously obtained should be converted to the true sizing units in pixels, and for this, the window size of the previous frame should be also considered. First, let the basic sizing weighting factor wi be de"ned as B
A B
w wi "max 1, i , B B
(10)
where the previous window size w in Fig. 10 is given by i
G
m if i"0, 2, w" i n if i"1, 3.
(11)
The parameter B is "xed to 50 in Eq. (10) to properly balance between the magnitude of the sizing vector and the size of the previous window. Still, we do not want wi to drop below 1.0. B Next, we put the origin of coordinates at the center of the window. Then we eventually obtain the window
Fig. 9. Conceptual example of the proposed method for determination of "nal sizing vector: (a) sizing vectors originally estimated from eight districts, (b) decompositions of corner sizing vectors, and (c) the "nal sizing vectors resulted from the sum of side sizing vectors and their decomposed sizing vectors.
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
sizing magnitude *Si in the ith direction. *Si now de"ned in pixel unit is evaluated by *Si"wi wi DDSiDD, S B
(12)
where a sign weighting factor wi is given by S
G
#2 if Si ' ui '0 S wi " S !1 otherwise.
(13)
Here, wi is employed to avoid a data loss risk referred to S in Section 3.2.2. When a corner weighting factor w is 0.7 C and a horizontal size of the previous window is about 100 pixels, the magnitude of the "nal sizing vector Si with wi is between !2 and #4. We found that the sizing S magnitude *Si ranges about from !4 pixels to #8 pixels. Finally, we now relocate the center point according to the change of window size. New window positions PI i, for i"0, 1, 2, 3, from the origin of coordinates can be simply given by PI i"Pi#sgn(Pi)*Si,
(14)
where
G
sgn(k)"
#1 if k'0, !1 if k(0,
(15)
and Pi is the coordinate value of the previous window position before the window sizing procedure. Fig. 10 illustrates such relocation of a center of the tracking window by using Eq. (14).
4. Experimental results We have applied our proposed method based on four independent sizing vectors to 45 real image sequences made of more than 3400 images and 39 arti"cial image sequences made of 4260 images and obtained satisfactory results for most of them. Now, the aim of this Section is to evaluate of the performance of our tracker with adaptive window sizing. However, when we actually performed tracking experiments based on the "xed size window for many cases in which the size of an object image underwent rapid change, the tracking failed so often that we did not bother to include these experimental results. Hence, we set up two methods designed with only side districts. These methods both lack in some sophisticated details of corner to side decomposition in our "nal tracker. In Method 1, wi information and only the direcB tion of Si are preserved by setting Si "0 in Eqs. (8) and S C (9), and DDSiDD is "xed as 2, which makes the magnitude of the resultant sizing vector comparable to that of the
243
proposed method. In this case, *Si3M!2wi , 0, #4wi N. B B As for Method 2, information about the previous window size (absorbed in wi ) as well as the magnitude of B the "nal sizing vector DDSiDD is further ignored. Here, wi "1, DDSiDD"1, and only the direction information is B just retained, i.e., *Si3M!1, 0, #2N, which means that the tracker can expand by two pixel steps or shrink by one pixel step. The key idea of Method 2 is quite similar to the gate sizing algorithm proposed by Chodos et al. [3]. For objective comparison, the initial position and the initial size of the adaptive window are set to be the same in the experiments. 4.1. Objects in artixcial image sequences Here, we selected arti"cial image sequences with 11 di!erent signal-to-noise ratios (SNRs) for the quantitative evaluation of the proposed method. Then, the SNR of the image and two error measures for evaluating the center position error and the size error of the window will be discussed below. The arti"cial image based on the Markov random "eld (MRF) [8] is constructed by using the stochastic features such as the brightness mean, the standard deviation, and the correlation coe$cient of the target and the background and then is added by the Gaussian noise. The SNR of the generated image is de"ned as Dk !k D B (dB), SNR"20 log T p N
(16)
where Dk !k D is the absolute di!erence of the brightT B ness means between the target and the background and p is the standard deviation of the added Gaussian noise. N We quantify the center position error E and the window P size error E as follows: S E "J(x!x8 )2#(y!y8 )2, P
(17)
DA!AI D E" , S AI
(18)
where the point (x, y) is the target position estimated by a tracker and the point (x8 , y8 ) is the actual target position on the image, which is available at the time of generating an image. A is the area of the adjusted tracking window and AI is the actual area of the known target. Fig. 11 represents part of the test image sequence with a SNR of 0.0 dB. The "nal tracking result is denoted by the white solid rectangle. The target, the non-homogeneously "lled rectangle, is designed to follow in clockwise direction along the boundary of a rhombus in order to produce much variation within the limited size of 256]256 pixels. The size of the moving target also changes slowly from the initial size of 50]50 pixels to the
244
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
Fig. 11. Sampled tracking example of the proposed method for an arti"cial image sequence using the MRF and Gaussian noise. The SNR considered here is 0.0 dB.
maximum size of 100]100 pixels and then backs to 50]50 pixels. This variation in size is intended to emulate a situation that a target approaches an observer to a certain point and retreats again. In order to evaluate the sizing capability of a tracker, the initial tracking window size is intentionally set to 121]121 pixels much larger than the initial target size. Furthermore, to test how fast a tracker responds to misalignment of the center
of an initial window with that of a target, the center of the tracking window is also initially set to be 20]20 pixels deviated from that of the target. This deviation amounts to E of 28.28. It was found in the test sequence of Fig. 11 P that the tracker could accomplish the correction of sizing and center misalignment at about Frame 23. Fig. 12 describes the evaluating results in terms of the two error measures of E and E for the same sequence of S P
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
245
Fig. 11. E for each frame is illustrated in Fig. 12a and S E in Fig. 12b. Here, we can "nd that Method 2 is slower P in reducing the window size error (E ) or the center S position error (E ) at the initial portion of tracking. This P is because Method 2 can expand by two pixels or shrink down by one pixel and cannot achieve fast correction of the initial erroneous setting of a window. Now, we consider Method 1 which is, in an aspect of sizing, somewhat similar to the proposed method but lacks in subtle adjustment of the window using corner to side decomposition. It can be concluded from Fig. 12a and b that this method could adjust as fast but shows more oscillatory behavior compared to the proposed method. Table 1 shows the averages of E s over all the frames P using the arti"cial image sequences with varying SNRs Table 2 Average of window size errors over all frames with respect to SNRs under normal conditions with proper initial window size and no center position error Average of window size errors (E s) S
SNR (dB)
Fig. 12. Window size error and center position error evaluation with respect to the separate frames of Fig. 11 (SNR is 0.0 dB): (a) window size error variation and (b) center position error variation.
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
Proposed method
Method 1
Method 2
0.37 0.37 0.37 0.34 0.34 0.34 0.35 0.34 0.34 0.33 0.33
0.53 0.45 0.40 0.39 0.40 0.43 0.37 0.36 0.35 0.39 0.36
0.54 0.44 0.43 0.39 0.37 0.38 0.40 0.31 0.32 0.35 0.36
Table 1 Average of center position errors over all frames with respect to SNRs with the large initial window of 121]121 pixels Average of center position errors (E s) P
SNR (dB)
0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0
Initial center position error"0.00
Initial center position error"14.14
Initial center position error"28.28
Proposed method
Method 1
Method 2
Proposed method
Method 1
Method 2
Proposed method
Method 1
Method 2
4.72 3.10 3.12 2.60 2.89 2.95 2.80 2.23 2.60 2.86 2.47
8.41 5.55 5.68 5.65 6.02 6.45 5.97 4.44 4.72 4.80 4.45
6.82 3.66 3.71 2.84 2.95 2.13 2.43 1.45 1.62 1.94 2.18
5.26 4.07 4.10 4.42 4.39 4.35 3.98 4.00 3.91 3.93 3.89
14.41 8.66 8.40 6.80 7.59 5.82 7.89 5.52 7.43 6.05 5.65
10.80 8.18 6.31 6.08 7.67 6.87 6.29 5.57 6.18 6.21 5.62
7.22 6.68 6.29 6.10 5.75 5.66 5.57 5.44 5.35 5.68 5.23
10.44 8.13 7.69 7.87 7.38 6.66 6.92 7.71 7.09 7.12 6.70
13.77 13.67 12.25 11.73 11.36 10.84 11.33 9.27 9.84 9.88 9.74
246
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
Fig. 13(a) and (b). Caption opposite.
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
247
over all SNRs in Table 2 is 0.35 and is also more e!ective than the other two methods. If we convert this average to the more intuitive pixel unit, we found that our tracker usually keeps a margin of about "ve to six pixels between the window border and the target border as shown in the latter part of Fig. 11. Referring to Tables 1 and 2, we can conclude the proposed method is more robust about the variation of SNR than Methods 1 and 2. 4.2. Objects in real image sequences
Fig. 13. Real image sequence that a helicopter goes over a ridge with complex background: (a) sampled tracking results by the proposed method, (b) window size error and center position error introduced in the tracking window of Methods 1 and 2 due to crossing over the complex background, (c) window size variation in horizontal and (d) in vertical direction.
from 0.0 to 10.0 dB. We classify the initial testing conditions into three initial E s of 0.00, 14.14, and 28.28. P For instance, when the initial E is 28.28 and the P SNR of the image is 0.0 dB as shown in Fig. 11, the proposed method with an average of E s of 7.22 P is the most superior to the other methods under the same condition. The averages of E s were found to P vary signi"cantly according to the initial E s. In fact, P we found that E falls to 2.91 when only averaged from P Frame 23, at which the tracking window enclosed the target properly as described in Fig. 12a, to the last frame. Meanwhile, Table 2 is included to show the sizing performance of the tracker under the normal condition. To simulate this condition, we let the initial E be 0.00 P and the initial window size be slightly bigger but similar to the size of the target. Our proposed method averaged
We evaluated the performance of the proposed method using the real image sequences with various situations acquired from a CCD camera and a frame grabber. A moving object such as an airplane, a "ghter, a helicopter, an automobile, a bike, a human being, and so on is chosen as a target. Usually, these targets move near or through the clutters in complex circumstances without imposing any constraint on their moving patterns. Fig. 13 represents the tracking results for a helicopter going away fast through clutters across a ridge. An initial window was manually chosen to be of larger size than the target. The intentional position error had been included to measure the speed of the window sizing and also to consider the real situation that an operator could not always set the initial window quite tightly. The proposed method could enclose the target properly as in Fig. 13a. From Fig. 13b that are magni"ed three times, however, we found that in Methods 1 and 2 the center position error as well as the window size error had been induced, as shown in Frames 19 and 20, respectively, by including part of clutters inside the tracking window. Here, we found that these two methods were more susceptible to the momentary sizing error due to the complex clutters. Fig. 13c and d describe the vertical and horizontal size variations of the window in this sequence made of 25 frames. Method 1 performed badly in tracking because the severe error had been accumulated about from Frame 16, that is, the point at which the target began to intersect the ridge. This accumulation could be observed in Fig. 13c and d. Method 2 also could not cope with the situation of the fast decreasing target and thus had failed in enclosing the target as desired. Our proposed method was most successful in tracking the indicated target stably by adjusting the window size properly. The tracking results for a vehicle going over the sloping road with huge trees standing nearby are also illustrated in Fig. 14. The wooded forest makes it hard for the tracker to obtain the reliable tracking performance. We tested a case of making the initial tracking window be wide in the horizontal direction as shown in Fig. 14a. Nevertheless, the proposed method could follow the target most correctly by rapidly controlling the window size. The window size errors for Frame 38 and the "nal frame of Methods 1 and 2 are described in Fig. 14b. As for
248
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
Fig. 14. Real image sequence that a vehicle is running over the inclined road near the forest: (a) sampled tracking results by the proposed method and (b) window size error and center position error of the tracking window occurred in Methods 1 and 2.
S-I. Chien, S-H. Sung / Pattern Recognition 33 (2000) 237}249
249
Method 1, the window size could not be adjusted suitably since the beginning of the sequence. Although the sizing e$ciency of Method 2 was better than that of Method 1, Method 2 was still found to be not so accurate as the proposed method around the end of the sequence.
retreat, and other moving patterns and this history information can be useful for aim-point estimation and other related applications. Since the adaptive window can tightly adjust window size around object, computation time for correlation could be also optimized.
5. Conclusions
References
We have presented an image tracking architecture employing the four-direction adaptive window method with independent sizing vectors for enhancing the performance of correlation-based tracking in the cluttered surroundings. Our method could control the sizing magnitudes fast enough in four directions to reduce the in#uence of the background and increase the occupancy rate of the target. At the onset of tracking, generally a human operator establishes an initial window with its size much larger than the object size. In this case, the tightly enclosing speed of the window with our proposed sizing vectors is found to be over three times faster than that of a window system without the adaptive sizing unit (Method 2). By virtue of the proposed method capable of adjusting the window size more rapidly and properly by generating the sizing vectors in four directions, we could achieve bene"t of minimizing the in#uence of complex background and clutters in correlation-based tracking process. Moreover, this method can more "nely tune the size and position of the reference window after the main correlation-based tracking routine has been terminated. Besides these advantages, we could obtain circumstantial judgment of a moving object: rotation, advance or
[1] D. Hughes, A.J.E. Moy, Advances in automatic electrooptical tracking systems, Proc. SPIE 1697 (1992) 353}366. [2] D.A. Montera, S.K. Rogers, D.W. Ruck, M.E. Oxley, Object tracking through adaptive correlation, Opt. Eng. 33 (1) (1994) 294}302. [3] S.L. Chodos, G.T. Pope, A.K. Rue, R.P. Verdes, Dual Mode Video Tracker, U.S. Patent No. 4849906, 1989. [4] R.L. Brunson, D.L. Boesen, G.A. Crockett, J.F. Riker, Precision trackpoint control via correlation track referenced to simulated imagery, Proc. SPIE 1697 (1992) 325}336. [5] S. Sung, S. Chien, M. Kim, J. Kim, Adaptive window algorithm with four-direction sizing factors for robust correlation-based tracking, Proceedings of the Ninth IEEE International Conference on Tools with Arti"cial Intelligence, 1997, pp. 208}215. [6] T. Kanade, M. Okutomi, A stereo matching algorithm with an adaptive window: theory and experiment, IEEE Trans. Pattern Anal. Mach. Intell. 16 (9) (1994) 920}932. [7] J.L. Lotti, G. Giraudon, Adaptive window algorithm for aerial image stereo, Proceedings of the Twelfth International Conference on Pattern Recognition, 1994, pp. 701}703. [8] A.K. Jain, Fundamentals of Digital Image Processing, Prentice-Hall, Englewood Cli!s, NJ, 1989.
About the Author*SUNG-IL CHIEN recieved the B.S. degree from Seoul National University, Seoul, Korea, in 1977, and the M.S. degree from the Korea Advanced Institute of Science and Technology, Seoul, Korea, in 1981, and Ph.D. degree in Electrical and Computer Engineering from Carnegie Mellon University in 1988. Since 1981, he has been with the School of Electronic and Electrical Engineering, Kyungpook National University, Taegu, Korea, where he is currently a professor. His research interests are pattern recognition, computer vision, and neural networks. About the Author*SI-HUN SUNG recieved the B.S. and M.S. degrees in Electronic Engineering from the Kyungpook National University, Taegu, Korea, in 1995 and 1997, respectively. He is currently working towards the Ph.D degree in Electronic Engineering at the Kyungpook National University as a research assistant. His research interests include the areas of the "eld application of computer and machine vision, image processing, pattern recognition, and neural networks. He is a member of the SPIE, the IEEE, and the Institute of Electronics Engineers of Korea.
Pattern Recognition 33 (2000) 251}261
Point-based projective invariants TomaH s\ Suk*, Jan Flusser Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Pod voda& renskou ve\ z\ n& 4, 182 08 Praha 8, Czech Republic Received 23 January 1998; accepted 3 February 1999
Abstract The paper deals with features of a 2-D point set which are invariant with respect to a projective transform. First, projective invariants for "ve-point sets, which are simultaneously invariant to the projective transform and to permutation of the points, are derived. They are expressed as functions of "ve-point cross-ratios. Then, the invariants for more than "ve points are derived. The algorithm for searching the correspondence between the points of two 2-D point sets is presented. The algorithm is based on the comparison of two projective and permutation invariants of "ve-tuples of the points. The best-matched "ve tuples are then used for the computation of the projective transformation and that with the maximum of corresponding points is used. Stability and discriminability of the features and behavior of the searching algorithm are demonstrated by numerical experiments. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Projective transform; Point set matching; Point-based invariants; Projective invariants; Permutation invariants; Registration; Control points
1. Introduction One of the important tasks in image processing and computer vision is a recognition of objects on images captured under di!erent viewing angles. However, this problem cannot be solved in a general case [1]. Nevertheless, if we restrict ourselves to planar objects only, then the distortion between two frames can be described by projective transform (sometimes called perspective projection) x@"(a #a x#a y)/(1#c x#c y), 0 1 2 1 2 y@"(b #b x#b y)/(1#c x#c y), (1) 0 1 2 1 2 where x and y are the coordinates in the "rst frame and x@ and y@ are the coordinates in the second one. Feature-based recognition of such objects requires features invariant to projective transform (1). Several di!er* Corresponding author. Tel.: #420-2-6605-2231; fax: #420-2-688-4903. E-mail addresses:
[email protected] (T. Suk), #
[email protected] (J. Flusser)
ent approaches to this problem have been published in recent works. One of them is based on the assumption that the non-linear term of the perspective projection is relatively small and thus the projective transform can be approximated by an a$ne transform. This assumption is true, if the distance from the sensor to the object is much greater than the size of the object. In such cases, various a$ne invariants can be applied such as moment invariants [2,3] or Fourier descriptors [4,5]. However, in some cases the projection cannot be approximated by the a$ne transform and therefore the use of exact projective invariants is required. The invariants, which have been developed for this purpose, can be categorized into two groups: di!erential invariants and point-based ones. Di!erential invariants are applicable only if the object boundary is a smooth continuous curve. A set of invariants based on boundary derivatives up to the sixth order was presented by Weiss [6]. Unfortunately, these invariants are not de"ned for such important curves as straight lines or conics. Weiss's invariants are numerically unstable because of the high-order derivatives. To overcome this di$culty, several improvements were presented [7}9].
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 9 - 7
252
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
The second group of invariants is de"ned on point sets [10], on sets composed both from points and straight lines [11,12] and on triangle pairs [13]. A detailed survey of the point-based methods can be found [14]. Another problem to be solved is to establish the correspondence between two point sets, which are projectively deformed. To calculate the invariants, we have to order those sets somehow. The solution when the points are vertices of a polygon has been published [15]. Another solution is to use features also invariant to the order or labeling of the points. Five-point projective and permutation invariants are presented [2,16]. This approach is also used in this paper. The plane transformed by projective transform (1) contains a straight line 1#c x#c y"0, 1 2
(2)
which is not mapped into the second plane (more precisely it is mapped into in"nity) and which divides the plane into two parts. If all elements of our point set lie in one half-plane, then some additional theorems about topology of the set hold for the transform, e.g. the convex hull is preserved during the transform in that case. This fact can be used to derive invariants with lower computational complexity [17]. This paper deals with a general case of the projective transform, when the points can lie in both parts of the plane. The convex hull is not preserved under the transform and all possibilities of the positions of the points must be taken into account. The only assumption is that the points do not lie directly on straight line (2). A projective invariant can be de"ned for at least "ve points. The simplest one is a "ve-point cross-ratio P(1, 2, 3)P(1, 4, 5) . (1, 2, 3, 4, 5)" , P(1, 2, 4)P(1, 3, 5)
(3)
where P(A, B, C) is the area of the triangle with vertices A, B and C. The point No. 1 is included in all four triangles and it is called the common point of the cross-ratio. Reiss [2] proposes to use the median of all possible values of . . A more precise description of the relations between various cross-ratio values under permutations of the given points can be found [16]. After the correspondence between the individual points in both sets has been established, we use them as the control points for image-to-image registration. However, there are often some points having no counterpart in the other image. An approach to solve this problem can be found [18], but that method becomes unstable if the number of the `wronga points increases. The goal of this paper is to derive projective and permutation invariants of point sets. Five-point projec-
tive and permutation invariants are derived in Section 2, they are generalized for more than "ve points in Section 3 and the sets with wrong points are discussed in Section 4. Experiments showing the numerical properties of the invariants as well as their usage for image registration are shown in Section 5.
2. Five-point permutation invariants 2.1. The derivation of the invariants First we derive permutation invariants by the simplest way and then more detailed analysis will be performed. The main principle is to use addition or multiplication (or another symmetric function) of all possible values of the projective invariants over the permutations of the points. The order of terms and factors is only changed during permutations, but the result stays invariant. To obtain permutation invariants, we can employ various functions of cross-ratio (3). Reiss [2] used the function . #. ~1, which is unstable near zero. If some triplet of "ve points in Eq. (3) is collinear, then the function . #. ~1 is in"nite. Thus the more suitable function is t"2/(. #. ~1)"2. /(. 2#1). If . or . ~1 is zero, then the function t is zero. The function . can have only three distinct values during permutations of four points, therefore the functions: F@ (1, 2, 3, 4, 5)"t(1, 2, 3, 4, 5)#t(1, 2, 3, 5, 4) ` #t(1, 2, 4, 5, 3), F@ (1, 2, 3, 4, 5)"t(1, 2, 3, 4, 5) ) t(1, 2, 3, 5, 4) > ) t(1, 2, 4, 5, 3)
(4)
are invariant to the choice of labeling of the last four points, but the point No. 1 must be common at all cross-ratios. To obtain full invariance to the choice of labeling, we must alternate all "ve points as common ones: I 1 2(1, 2, 3, 4, 5)"F 1(1, 2, 3, 4, 5)s F 1(2, 3, 4, 5, 1) s ,s s 2 s s F 1(3, 4, 5, 1, 2)s F 1(4, 5, 1, 2, 3) 2 s 2 s s F 1(5, 1, 2, 3, 4), (5) 2 s where s and s are either sign # or ) . 1 2 The set of n points has 2n degrees of freedom and the projective transform has eight parameters. Therefore, the set can have only m"2n!8
(6)
independent invariants to the projective transform. That is why only two of the four invariants I , I , I and ` >` `> I can be independent. >>
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
253
2.2. The roots of the invariants Lenz and Meer [16] dealt with the "ve-point projective and permutation invariants in detail. They discovered that if the common point stays the same and the other points are permuted, then the values of the crossratios are 1 1 . ". , . " , . "1!. , . " , 1 2 . 3 4 1!.
. . !1 . " , . " . 5 . !1 6 .
(7)
If we construct a function F(. ), which has the same value for all these values of . (. , . ,2, . ), it is invari1 2 6 ant to the permutation of four points. If we change the common point, we receive another value of the crossratio, let us say p, and the function
AB A B
K(. , p)"F(. )#F(p)#F
A
#F
B
. (p!1) p(. !1)
. . !1 #F p!1 p
(8)
is a "ve-point projective and permutation invariant. As our study implies, if the function F(. ) is constructed as the quotient of two polynomials and . is its root, then each of the values . , . ,2, . must be its root. We can 1 2 6 consider it in the form P (. ) F(. )" 1 , (9) P (. ) 2 P (. )"(. !. )(. !. )(. !. )(. !. ) 1 1 2 3 4 (10) ](. !. )(. !. ) 5 6 and similarly for P (. ) (when all roots di!er from zero). 2 It is advantageous if F(. ) is de"ned for any real . . Thus, P (. ) should not have real roots. Two such invari2 ants are proposed [16] 2. 6!6. 5#9. 4!8. 3#9. 2!6. #2 F " , 14 . 6!3. 5#3. 4!. 3#3. 2!. 3#1
(11)
(. 2!. #1)3 F " . 15 . 6!3. 5#5. 3!3. #1
(12)
where there are the following relations between the roots: b b 1 2 , a "1!a , b "b , b " " 2 1 2 1 3 (a !1)2#b2 a2#b2 1 1 2 2
Theorem 1. If the roots of P (. ) are imaginary, then they 2 lie on the following curves:
(13)
(14)
where a and b , i"1, 2, 3 are real and imaginary parts of i i the roots. The theorem is illustrated in Fig. 1. Proof. If we express P (. ) in the form 2 P (. )"(. !a !b i)(. !a #b i)(. !a !b i) 1 1 1 1 2 2 2 ](. !a #b i)(. !a !b i)(. !a #b i), 2 2 3 3 3 3 i"J!1,
The following Theorem describes the properties of the roots of P (. ). 2
a2#b2"1, 1 1 (a !1)2#b2"1, 2 2 a "1, 3 2
Fig. 1. Illustration of Theorem 1. The circles and the straight line show possible positions of the roots of the invariants in the complex plane (a is real part and b imaginary part).
(15)
then there are 6!"720 possibilities of assignment between a $b i, a $b i, a $b i and . }. in Eq. (7). If 1 1 2 2 3 3 1 6 we use the assignment a #b i". , 1 1 1
(16)
a !b i". , 1 1 2
(17)
a #b i". , 2 2 6
(18)
a !b i". , 2 2 3
(19)
a #b i". 3 3 4
(20)
254
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
and a !b i". , 3 3 5 then from Eqs. (7) and (16) 1 1 a !b i 1 . " " " 1 2 . a #b i a2#b2 1 1 1 1 1 and from Eq. (17)
(21)
(22)
!b a 1 , 1 . a " !b " (23) 1 a2#b2 1 a2#b2 1 1 1 1 Therefore the "rst two roots must lie on the circle a2#b2"1. From Eq. (7) 1 1 . !1 a !1#b i a2!a #b2#b i 1" 1 1 1 1 (24) . " 1 " 1 6 . a #b i a2#b2 1 1 1 1 1 and
. "1!. "1!a !b i 3 1 1 1 from Eqs. (18), (23) and (24) a2!a #b2 b 1 1"1!a , b " 1 "b a " 1 2 1 2 a2#b2 1 a2#b2 1 1 1 1 and from Eqs. (19) and (26)
(25)
(26)
a "1!a , b "b , (27) 2 1 2 1 therefore the second two roots must lie on the circle (a !1)2#b2"1. From Eq. (7) 2 2 1 1 1!a #b i 1 1 . " " " (28) 4 1!. 1!a !b i (1!a )2#b2 1 1 1 1 1 and
. a #b i a2!a #b2!b i 1 " 1 1 " 1 1 1 1 , (29) . " 5 . !1 a !1#b i (1!a )2#b2 1 1 1 1 1 from Eqs. (20), (23) and (28) 1!a 1 b b 1 2 1 " , b " " a " 3 (1!a )2#b2 2 3 (1!a )2#b2 a2#b2 1 1 1 1 2 2 (30)
that each individual case falls into one of the following categories: 1. The result is some permutation of the previous case. 2. The result is only a "nite set of values, typically J 1$i 3. 2 2 3. The case has no solution. Thus, there is no other solution and the theorem has been proven. h Our invariants have the form in this context 2(1!. ) 2. /(. !1) 2. # # F@ " ` . 2#1 (1!. )2#1 . 2/(. !1)2#1 "2
. 6!3. 5#3. 4!. 3#3. 2!3. #1 , 2. 6!6. 5#11. 4!12. 3#11. 2!6.#2
(!8). 2(1!. )2 F@ " > (1#. 2)(2!2. #. 2)(1!2. #2. 2) !8. 2(1!. )2 " . (32) 2. 6!6. 5#11. 4!12. 3#11. 2!6. #2 The choice of the invariant has one degree of freedom, we can choose one root of the denominator on some curve from Fig. 1, other roots must be de"ned by Theorem 1 and the numerator de"nes the range of values of the invariant. Since both F@ and F@ have ` > the same denominator (roots 1$i, $i and 0.5$0.5i), it is suitable to change one of them. The other can be (. 2!. #1)3 with roots 0.5$i J3. Then, if we want the 2 range of values of the consequential invariants I and `` I from 0 to 1, our invariants will be >` 8 . 2(1!. )2 F " ` 5 (. 2!. #1)3 3. 2(1!. )2 F" > 2. 6!6. 5#11. 4!12. 3#11. 2!6. #2
(33)
and relations to the original invariants will be F "16(1!1/(6!F@ )) and F "!3F@ . ` > 8 > ` 5
and from Eqs. (21) and (29)
2.3. The normalization of the invariants
a2!a #b2 1 b b 1 1" , b " 1 a " 1 " 2 3 (1!a )2#b2 2 3 (1!a )2#b2 a2#b2 1 1 1 1 2 2 (31)
The invariants I "I and I "I corresponding 1 `` 2 >` to the functions F and F utilize the feature space ` > ine!ectively (see Fig. 2). A better result can be reached by the sum and the di!erence (see Fig. 3)
therefore the third two roots must lie on the straight line a "1. 3 2 We cannot investigate all other 719 possibilities because of insu$cient space. The number of cases can be reduced signi"cantly, if we consider mutual relations among the roots only. Thus, we have to deal with 5!"120 cases only. The other 600 cases are just permutations. We treated all of the 120 cases and we proved
I@ "(I #I )/2, I@ "(I !I #0.006) ) 53, (34) 1 1 2 2 1 2 but the best utilization of the given range is reached by subtraction and division by the polynomials (see Fig. 4) 1!I@ #p(I@ ) 2 1. IA "I@ , IA " 1 1 2 d(I@ ) 1
(35)
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
Fig. 2. Possible range of values of the invariants I , I . It was 1 2 acquired numerically by computing invariants for all combinations of "ve points with integer coordinates from 0 to 511.
255
Fig. 4. Possible range of values of the invariants IA , IA . 1 2
3. Invariants for more than 5ve points There is a number of approaches to the problem of the generalization of the invariants from the previous section to n points (n'5). One of them, yielding good experimental results, consists in summation of powers of the invariants IA , IA over all possible combinations of 5 from 1 2 n points Cn . 5 Theorem 2. I " + IAk (Q), I " + IAk (Q), 1,k 1 2, k 2 Q3Cn5 Q3Cn5 k"1, 2,2n!4
(36)
are projective and permutation invariants of a set of n points.
Fig. 3. Possible range of values of the invariants I@ , I@ . 1 2
Exact coe$cients of the polynomials p(I@ ) and d(I@ ) 1 1 are shown in Appendix B. This normalization is not necessary, but it makes possible to use a simpler classi"er.
Proof. IA and IA are projective invariants (see Eq. (3)) and 1 2 an arbitrary function of invariants is also invariant (if it does not depend on the parameters of the transform), therefore I and I are also projective invariants. 1,k 2,k IA and IA are also 5-point permutation invariants and 1 2 summation over all combinations guarantees permutation invariance of the IA and IA . h 1 2 The number of these invariants is chosen as 2n!8 according to Eq. (6). The computing complexity is approximately (n )¹, i.e. O(n5), where ¹ is the computing 5 complexity of one "ve-point invariant. However, the number of the terms is very high and that is why a
256
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
normalization of the results is suitable. To preserve the values inside acceptable intervals, we can use
S NA B n
+ IAk(Q), (37) s Q3Cn5 where s"1 or 2. Another, perhaps more sophisticated, normalization is the following. We can consider the "ve-point invariants IA , IA as independent random variables with uniform 1 2 distribution in the interval from 0 to 1. Then the distribution function F (x) of the kth power of the invariant is k F (x)"0 from!R to 0, k F (x)"x1@k from 0 to 1, (38) k F (x)"1 from 1 to R k with the mean value k "1/(1#k) and the variance k p2"k2/(1#k)2(1#2k). The number of terms in sum k (36) is relatively high and the Central Limit Theorem implies that the distribution of the sum is approximately Gaussian, its mean value is the sum of its mean values k and the variance is the sum of the variances p2. The k k given range is the best utilized in case of uniform distribution of the resulting invariants and therefore we can normalize the invariants with Gaussian distribution function I@ "k 1 s,k
5
P
1 x G(x; k, p)" e~(m~k)2@2p2 dm J2pp ~=
(39)
with the mean value k"(n )/(1#k), the variance 5 p2"(n )k2/(1#k)2(1#2k) and the standard deviation 5 p"k/(1#k)J(n )/(1#2k) 5 IA "G(I ; k, p). (40) sk s,k An approximate substitution of the function G is used in practice.
4. Point matching The problem we are dealing with in this section can be formulated as follows. Let us have two-point sets selected from two projectively deformed images. The sets can include some wrong points, i.e. points without a counterpart in the other set. We look for the parameters of the projective transform to register the images. We deal with the methods, which do not consider the image functions, but only the positions of the points. 4.1. Full search algorithm The simplest algorithm is the full search of all possibilities of the correspondence. We can examine each four points from the input image against each four points
from the reference image. If we have n points in the input image and l points in the reference one, we must examine (n )( l )4! possibilities. 4 4 The examination means the computation of the projective transform, the transformation of the points in the input image and judgment of the quality of the transform. We performed this judgment in the following way. The two nearest points are found and removed and again two nearest points from the remaining ones are found. The search is complete when the distance between the nearest points exceeds the suitable threshold. The number of corresponding points is used as the criterion of the quality of the transform. If the number is the same, the distance of the last two points is used. The best transform according to this criterion is used as the solution. The threshold must correspond to the precision of the selection of the points. The threshold 5 pixels proved its suitability in usual situations. If n and l are approximately the same and high, this algorithm has the computing complexity O(n11) and in our experience it is too time consuming. 4.2. Pairing algorithm by means of the projective and permutation invariants We can compute the distance in the feature space between invariants of each "ve points from the input image against invariants of each "ve points in the reference image. Nevertheless, it was found that wrong "ve points often match one another randomly, this false match can be better than the correct one and we must search not only the best match, but also each good match. We carried out experiments with a number of searching algorithms. We consider the following as the best one. We "nd the "rst b best matches and the full search algorithm from the previous section is applied on each pair of "ve tuples corresponding each match. The number b was chosen as (.!9(n, l) ), but this number is not 5 critical. Note: the total number of pairs of "ve tuples is (n )( l ). 5 5 In Ref. [19] the convex hull constraint is proposed. It is based on the assumption that the sets lie in one halfplane of Eq. (2) and that the projective transform preserves the position of points on or inside the convex hull. Then the pairs of "ve-tuples with the di!erent number of points on the convex hull need not be considered. As was written in the introduction, we consider the general case of the projective transform and therefore this constraint is not used. In the same work the idea of partial search is proposed. The authors randomly chose about one-"fteenth of all pairs and tried to search them only. They found the decrease of reliability relatively small. This constraint can be used in our algorithm too, but the following numerical experiment used the general algorithm without this constraint, because the amount of time saved is relatively small.
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
257
5. Numerical experiments How do we investigate the stability and discriminability of the invariants? Let us imagine the following situation. We have two similar sets of points and we would like to recognize one from the other, but one of them can be distorted by the noise in its coordinates and we would like the noise not to in#uence the recognition. The following numerical experiment was carried out to observe the behavior of invariants (37) and (40) in this situation. Let us have three sets of 11 points. One of them was created by a pseudo-random generator. The point coordinates are uniformly distributed from 0 to 511. The second set was created from the "rst one by moving one point a relatively large distance. The coordinates of the movement are also de"ned by a pseudo-random generator, but with Gaussian distribution. The third set was created from the "rst one by adding small noise to all point coordinates. The noise is independent with zeromean Gaussian distribution and with gradually increasing standard deviation. The standard deviation p of the 1 movement in the second set increased from 0 to 190 with the step 10 and the standard deviation p of the noise in 2 the third set increased from 0 to 9.5 with the step 0.5. The distances d between original and noisy set in the space of the invariants were computed. Since one experiment would not be representative enough, 10, 20, 100 and 1000 experiments were gradually carried out for each standard deviation. A curve of dependency of the distance on the noise standard deviation was acquired as the average of 1000 experiments, because
Fig. 5. The distance between the "rst and second sets (solid line) and between the "rst and third sets (dashed line) in the Euclidean space of the invariants normalized by the average and root as a function of the noise standard deviation.
Fig. 6. The distance between the "rst and second sets (solid line) and between the "rst and third sets (dashed line) in the Euclidean space of the invariants normalized by the Gaussian distribution function as a function of the noise standard deviation.
the average of fewer values was too dependent on the concrete realization of the noise. The result of invariants (37) normalized by the average and root is given in Fig. 5. The scale of the horizontal axis is di!erent for both cases so the curves were as similar as possible, more precisely, the area of the square of the di!erence between them was minimized. The ratio of the scales is 7.36; it means if two sets di!er by only one point, then the distance of the points must be at least approximately 7.36 times greater than the noise standard deviation to be possible to distinguish both sets. In another words, if the ratio of the standard deviations is 7.36 and their value is such that the dashed line is under the solid one, the sets will be recognized correctly. If the ratio increases, the limit of correct recognition increases too, but if the noise standard deviation is greater than approximately 9, i.e. about 2% of the coordinate range, then the sets cannot be distinguished at all, because the dashed line is always above the solid one. The results of invariants (40) normalized by the Gaussian distribution function are given in Fig. 6. The result is similar to the previous case, the ratio of the scales is 11.18, that means a little bit worse discriminability. The main di!erence is the scale on the vertical axis, which is about twice larger. It means these invariants utilize the given range better. The second experiment demonstrates using the pairing algorithm by means of the projective and permutation invariants. A cut of a Landsat Thematic Mapper image of north-east Bohemia (surroundings of the town Trutnov) from 29 August 1990 (256]256) was used as the
258
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
Fig. 7. The satellite image used as reference one (] the control points with counterparts in the input image, # the points without counterparts).
reference image (see Fig. 7) and an aerial image from 1984 (180]256) with a relatively strong projective distortion was used as the input one (see Fig. 8). Sixteen points were selected in the input image (see their coordinates in Table 1), 18 points were selected in the reference one (see their coordinates in Table 2) and 10 points in both images had counterparts in the other image (numbers 1}10 in the input correspond to numbers 9}18 in the reference). The "rst (18)"8568 best matches 5 was examined and the 476th one was correct. All 10 pairs of control points were found, the distance of the tenth pair was 2.23 pixels. The result is shown in Fig. 9. The "nal parameters of the transform were computed from all 10 control points by means of the least-square method. The deviations on the control points were from 0.17 to 1.71 pixels, the average was 0.94 pixels. The time of the search of the best matches was about an hour and a half on the workstation HP 9000/700 and the time of the examination of all 8568 matches was about two hours and a half, but the correct 476th match was found in 8 min. The method supposes plane point sets, that is satis"ed on these images only approximately. In our experience, if the height of the #ight is signi"cantly greater than altitude di!erences between hills and valleys, then the in#uence of the terrain causes only small perturbations of point coordinates. Owing to the robustness of the algorithm, we can handle those cases satisfactorily.
Fig. 8. The aerial image used as the input one (] the control points with counterparts in the reference image, # the points without counterparts). Table 1 The coordinates of the points marked by ] and # in the input image in Fig. 8 No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
x
y
35 233 104 253 16 202 73 130 55 176 172 152 47 72 126 114 113 155
42 106 166 147 243 235 39 40 111 67 197 215 122 86 118 146 181 182
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261 Table 2 The coordinates of the points marked by ] and # in the reference image in Fig. 7 No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
x
y
22 161 117 64 8 50 90 61 35 96 65 14 35 145 139 161
26 27 189 214 31 15 52 73 114 142 143 116 153 149 51 75
259
suitable, because then they can be used for recognition in Euclidean feature space without any additional weights. The normalization by the Gaussian distribution function is suitable in case of less noise for better distinguishing of the sets. We can use the invariants also for registration of the images by means of control points. If the a$ne and simpler transforms can be used for approximation of the distortion between the images, other methods are suitable. In case of strong projective distortion between the images, the described algorithm is one of the possible solutions of the task. The minimum number of corresponding pairs is six and correspondence between point sets with less pairs of points cannot be found principally. In case of six corresponding pairs only once the wrong correspondence was found during tens of experiments, in case of more than six corresponding pairs no error was found. It means that in case of more than six corresponding pairs the hope of a successful result is very high.
7. Summary
Fig. 9. The registered image.
6. Conclusion The roots of the polynomials in the "ve-point projective and permutation invariants have one degree of freedom. We can choose one of them and the others must lie symmetrically on certain curves. The normalization of these invariants is suitable for improving numerical stability of following computations with them. The normalization of the invariants for more than "ve points is also
The paper deals with features of a point set which are invariant with respect to a projective transform. First, projective invariants for "ve-point sets, which are simultaneously invariant to the projective transform and to permutation of the points, are derived. They are expressed as functions of "ve-point cross-ratios. The roots of the polynomials in the "ve-point projective and permutation invariants have one degree of freedom. We can choose one of them and the others must lie symmetrically on certain curves. The normalization of these invariants is suitable for improving the numerical stability of following computations with them. The invariants for more than "ve points are derived. The normalization of the invariants for more than "ve points is also suitable, because then they can be used for recognition in Euclidean feature space without any additional weights. The normalization by the Gaussian distribution function is suitable in case of less noise for better distinguishing of the sets else the normalization by the average and the root should be used. The algorithm for searching the correspondence between the points of two 2-D point sets is presented. The algorithm is based on the comparison of two projective and permutation invariants of "ve-tuples of the points. The best-matched "ve-tuples are then used for the computation of the projective transformation and that with the maximum of corresponding points is used. Stability and discriminability of the features and behavior of the searching algorithm are demonstrated by numerical experiments.
260
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261
Acknowledgements
References
This work has been supported by the Grant Nos. 102/98/P069 and No. 102/96/1694 of the Grant Agency of the Czech Republic.
[1] J.B. Burns, R.S. Weiss, E.M. Riseman, The non-existence of general-case view-invariants, in: J.L. Mundy, A. Zisserman (Eds.), Geometric Invariance in Computer Vision, MIT Press, Combridge, MA, 1992, pp. 120}131. [2] T.H. Reiss, Recognition planar objects using invariant image features, Lecture Notes in Computer Science, vol. 676, Springer, Berlin, 1993. [3] J. Flusser, T. Suk, Pattern recognition by a$ne moment invariants, Pattern Recognition 26 (1993) 167}174. [4] C.C. Lin, R. Chellapa, Classi"cation of partial 2-D shapes using Fourier descriptors, IEEE Trans. Pattern Anal. Mach. Intell. 9 (1987) 686}690. [5] K. Arbter, W.E. Snyder, H. Burkhardt, G. Hirzinger, Application of a$ne-invariant Fourier descriptors to recognition of 3-D objects, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 640}647. [6] I. Weiss, Projective invariants of shapes, Proceedings of the Image Understanding Workshop, Cambridge, MA, USA, 1988, pp. 1125}1134. [7] C.A. Rothwell, A. Zisserman, D.A. Forsyth, J.L. Mundy, Canonical frames for planar object recognition, Proceedings of the Second ECCV, Springer, Berlin, 1992, pp. 757}772. [8] A.M. Bruckstein, R.J. Holt, A.N. Netravali, T.J. Richardson, Invariant signatures for planar shape recognition under partial occlusion, Proceedings of the 11th International Conference on Pattern Recognition, The Hague, The Netherlands, 1992 pp. 108}112. [9] I. Weiss, Di!erential invariants without derivatives, Proceedings of the 11th International Conference on Pattern Recognition, The Hague, The Netherlands, 1992, pp. 394}398. [10] P. Meer, I. Weiss, Point/line correspondence under 2D projective transformation, Proceedings of the 11th International Conference on Pattern Recognition, The Hague, The Netherlands, 1992, pp. 399}402. [11] T.H. Reiss, Object recognition using algebraic and di!erential invariants, Signal Process. 32 (1993) 367}395. [12] D. Forsyth, J.L. Mundy, A. Zisserman, C. Coelho, A. Heller, C. Rothwell, Invariant descriptors for 3-D object recognition and pose, IEEE Trans. Pattern Anal. Mach. Intell. 10 (1987) 971}991. [13] S. Linnainmaa, D. Harwood, L.S. Davis, Pose determination for a three-dimensional object using triangle pairs, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1988) 634}647. [14] J.L. Mundy, A. Zisserman (Eds.), Geometric Invariance in Computer Vision, MIT Press, Combridge, MA, 1992. [15] T. Suk, J. Flusser, Vertex-based features for recognition of projectively deformed polygons, Pattern Recognition 29 (1996) 361}367. [16] R. Lenz, P. Meer, Point con"guration invariants under simultaneous projective and permutation transformations, Pattern Recognition 27 (1994) 1523}1532. [17] N.S.V. Rao, W. Wu, C.W. Glover, Algorithms for recognizing planar polygonal con"gurations using perspective images, IEEE Trans. Robotics Automat. 8 (1992) 480}486.
Appendix A Sometimes a task on how to save and load information about combinations to and from a memory may be required to be solved. We have got the combinations of k elements from n and we can save this information in the following way: a"0 for i :"0; i (n 1 1 for i :"i #1; i (n 2 1 2 F for i :"i #1; i (n k k~1 k Mm[a]:"information (i , i ,2, i ) 1 2 k a:"a#1N When information about k-tuple (i , i ,2, i ) is re1 2 k quired, we need to compute the address a from the k-tuple. If we sort the indices by size so it holds i (i (2(i , 1 2 k then this address can be computed as
AB
a"
n
!1
k
A
BA B
n k k~j`1 # + + (!1)m`1 k!j!m#1 j/1 m/0
i #m j . m (41)
Appendix B p(I@ )"10.110488 ) IA 6!27.936483 ) I@5 1 1 1 #31.596612 ) I@4!16.504259 ) I@3 1 1 !0.32251158 ) I@ #3.0473587 ) I@ 1 1 !0.66901966. If I@ (0.475 then 1 d(I@ )"17.575974 ) I@4!16.423212 ) I@3#9.111527 ) I@2 1 1 1 1 !0.43942294 ) I@ #0.016542258 1 else d(I@ )"3.9630392 ) I@4!13.941518 ) I@3#21.672754 ) I@2 1 1 1 1 !17.304971 ) I@ #5.6198814. 1
T. Suk, J. Flusser / Pattern Recognition 33 (2000) 251}261 [18] P.J. Besl, N.D. McKay, A method for registration of 3-D shapes, IEEE Trans. Pattern Anal. Mach. Intell. 14 (1992) 239}256. [19] P. Meer, S. Ramakrishna, A. Lenz, Correspondence of coplanar features through P2-invariant representations,
261
Applications of Invariance in Computer Vision, Lecture Notes in Computer Science, vol. 825, Springer, Berlin 1993, pp. 473}492.
About the Author*TOMAD S[ SUK was born in Prague, Czech Republic, on April 30, 1964. He received the M.Sc. degree in Technical Cybernetics from the Czech Technical University, Prague, Czech Republic in 1987 and the Ph.D. degree in Computer Science from the Czechoslovak Academy of Sciences in 1992. Since 1987 he has been with the Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague. His current research interests include image pre-processing, geometric invariants and remote sensing. He has authored or coauthored more than 20 scienti"c publications in these areas.
About the Author*JAN FLUSSER was born in Prague, Czech Republic, on April 30, 1962. He received the M.Sc. degree in Mathematical Engineering from the Czech Technical University, Prague, Czech Republic in 1985 and the Ph.D. degree in Computer Science from the Czechoslovak Academy of Sciences in 1990. Since 1985 he has been with the Institute of Information Theory and Automation, Academy of Sciences of the Czech Republic, Prague. Since 1995 he has been holding the position of a head of the Department of Image Processing. Since 1991 he has also been a$liated with the Faculty of Mathematics and Physics, Charles University, Prague, where he gives courses on Digital Image Processing. His current research interests include digital image processing, pattern recognition and remote sensing. He has authored or coauthored more than 30 scienti"c publications in these areas. Dr Flusser is a member of the Pattern Recognition Society, the IEEE Signal Processing Society, the IEEE Computer Society and the IEEE Geoscience and Remote Sensing Society.
Pattern Recognition 33 (2000) 263}280
Boundary detection by contextual non-linear smoothing Xiuwen Liu!,",*, DeLiang L. Wang!,#, J. Raul Ramirez" !Department of Computer and Information Science, The Ohio State University, 2015 Neil Avenue, Columbus, OH 43210, USA "Center for Mapping, The Ohio State University, 1216 Kinnear Road, Columbus, OH 43212, USA #Center for Cognitive Science, The Ohio State University, 1961 Tuttle Park Place, Columbus, OH 43210, USA Received 7 April 1998; received in revised form 16 November 1998; accepted 18 February 1999
Abstract In this paper we present a two-step boundary detection algorithm. The "rst step is a nonlinear smoothing algorithm which is based on an orientation-sensitive probability measure. By incorporating geometrical constraints through the coupling structure, we obtain a robust nonlinear smoothing algorithm, where many nonlinear algorithms can be derived as special cases. Even when noise is substantial, the proposed smoothing algorithm can still preserve salient boundaries. Compared with anisotropic di!usion approaches, the proposed nonlinear algorithm not only performs better in preserving boundaries but also has a non-uniform stable state, whereby reliable results are available within a "xed number of iterations independent of images. The second step is simply a Sobel edge detection algorithm without non-maximum suppression and hysteresis tracking. Due to the proposed nonlinear smoothing, salient boundaries are extracted e!ectively. Experimental results using synthetic and real images are provided. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Nonlinear smoothing; Contextual information; Anisotropic di!usion; Edge detection; Boundary detection
1. Introduction One of the fundamental tasks in low-level machine vision is to locate discontinuities in images corresponding to physical boundaries between a number of regions. A common practice is to identify local maxima in local gradients of images } collectively known as edge detection algorithms. The Sobel edge detector [1] consists of two 3]3 convolution kernels, which respond maximally to vertical and horizontal edges respectively. Local gradients are estimated by convolving the images with the two kernels, and thresholding is then applied to get rid of noisy responses. The Sobel edge detector is computationally e$cient but sensitive to noise. To make the estimation of gradients more reliable, the image can be convolved with a low-pass "lter before estimation and two
* Corresponding author. Tel.: #1-614-292-7402; fax: #1614-688-0066. E-mail address:
[email protected] (X. Liu)
in#uential methods are due to Marr and Hildreth [2] and Canny [3]. By convolving the image with a Laplacian of Gaussian kernel, the resulting local maxima, which are assumed to correspond to meaningful edge points, are zero-crossings in the "ltered image [2]. Canny [3] derived an optimal step edge detector using variational techniques starting from some optimal criteria and used the "rst derivative of a Gaussian as a good approximation of the derived detector. Edge points are then identi"ed using a non-maximum suppression and hysteresis thresholding for better continuity of edges. As noticed by Marr and Hildreth [2], edges detected at a "xed scale are not su$cient and multiple scales are essentially needed in order to obtain good results. By formalizing the multiple scale approach, Witkin [4] and Koenderink [5] proposed Gaussian scale space. The original image is embedded in a family of gradually smoothed images controlled by a single parameter, which is equivalent to solving a heat equation with input as the initial condition [5]. While Gaussian scale space has nice properties and is widely used in machine vision [6], a major limitation is that Gaussian smoothing inevitably blurs edges and
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 4 - 0
264
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
other important features due to its low-pass nature. To overcome the limitation, anisotropic di!usion, which was proposed by Cohen and Grossberg [7] in modeling the primary visual cortex, was formulated by Perona and Malik [8]:
ing spatial regularization [10] and edge-enhancing anisotropic di!usion [11], the general framework remains the same. As shown by You et al. [12], anisotropic di!usion given by Eq. (1) is the steepest gradient descent minimizer of the following energy function:
LI "div (g(DD+IDD)+I). Lt
E(I)"
(1)
Here div is the divergence operator, and g is a nonlinear monotonically decreasing function and +I denotes the gradient. By making the di!usion conductance dependent explicitly on local gradients, anisotropic di!usion prefers intra-region smoothing over inter-region smoothing, resulting in immediate localization while noise is reduced [8]. Because it produces visually impressive results, anisotropic di!usion generates much theoretical as well as practical interest (see Ref. [9] for a recent review). While many improvements have been proposed, includ-
P
f (DD+IDD) d)
(2)
X
with f @(DD+IDD) . g(DD+IDD)" DD+IDD Under some general conditions, the energy function given by Eq. (2) has a unique and trivial global minimum, where the image is constant everywhere, and thus interesting results exist only within a certain period of di!usion. An immediate problem is how to determine the termination time, which we refer to as the termination
Fig. 1. An example with non-uniform boundary gradients and substantial noise. (a) A noise-free synthetic image. Gray values in the image: 98 for the left &[' region, 138 for the square, 128 for the central oval, and 158 for the right &]' region. (b) A noisy version of (a) with Gaussian noise of p"40. (c) Local gradient map of (b) using the Sobel operators. (d)}(f) Smoothed images from an anisotropic di!usion algorithm [13] at 50, 100, and 1000 iterations. (g)}(i) Corresponding edge maps of (d)}(f), respectively, using the Sobel edge detector.
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
problem. While there are some heuristic rules on how to choose the stop time [10,11], in general it corresponds to the open problem of automatic scale selection. As in Gaussian scale space, a "xed time would not be su$cient to obtain good results. Another problem of anisotropic di!usion is that di!usion conductance is a deterministic function of local gradients, which, similar to non-maximum suppresion in edge detection algorithms, makes an implicit assumption that larger gradients are due to true boundaries. When noise is substantial and gradients due to noise and boundaries cannot be distinguished based on magnitudes, the approach tends to fail to preserve meaningful region boundaries. To illustrate the problems, Fig. 1a shows a noise-free image, where the gradient magnitudes along the central square change considerably. Fig. 1b shows a noisy version of Fig. 1a by adding Gaussian noise with zero mean and p"40, and Fig. 1c shows its local gradient magnitude obtained using Sobel operators [1]. While the three major regions in Fig. 1b may be perceived, Fig. 1c is very noisy and the strong boundary fragment is barely visible. Fig. 1d}f show the smoothed images by an anisotropic di!usion algorithm [13] with speci"ed numbers of iterations. Fig. 1g}i show the edge maps of Fig. 1d}f, respectively, using the Sobel edge detection algorithm. While at the 50th iteration the result is still noisy, the result becomes meaningless at the 1000th iteration. Even though the result at the 100th iteration is visually good, the boundaries are still fragmented and it is not clear how to identify a `gooda number of iterations automatically. These problems to a large extent are due to the assumption that local maxima in gradient images are good edge points. In other words, due to noise, responses from true boundaries and those from noise are not distinguishable based on magnitude. To overcome these problems, contextual information, i.e., responses from neighboring pixels, should be incorporated in order to reduce ambiguity as in relaxation labeling and related methods [14}17]. In general, relaxation labeling methods use pair-wise compatibility measure, which is determined based on a priori models associated with labels, and convergence is not known and often very slow in numerical simulations [18]. In this paper, by using an orientation-sensitive probability measure, we incorporate contextual information through the geometrical constraints on the coupling structure. Numerical simulations show that the resulting nonlinear algorithm has a non-uniform stable state and good results can be obtained within a "xed number of iterations independent of input images. Also, the oriented probability measure is de"ned on input data, and thus no a priori models need to be assumed. In Section 2, we formalize our contextual nonlinear smoothing algorithm and show that many nonlinear smoothing algorithms can be treated as special cases. Section 3 gives some theoretical results as well as numerical simulations regarding the stability and convergence
265
of the algorithm. Section 4 provides experimental results using synthetic and real images. Section 5 concludes the paper with further discussions.
2. Contextual nonlinear smoothing algorithm 2.1. Design of the algorithm To design a statistical algorithm, with no prior knowledge, we assume a Gaussian distribution within each region. That is, given a pixel (i , j ) and a window 0 0 R 0 0 at pixel (i , j ), consisting of a set of pixel locations, (i , j ) 0 0 we assume that
G
H
1 (I !k )2 R , P(I 0 0 , R)" exp ! (i0, j0) (3) (i , j ) 2p2 J2pp R R where I is the intensity value at pixel location (i, j). To (i, j) simplify notation, without confusion, we use R to stand for R 0 0 . Intuitively, P(I 0 0 , R) is a measure of com(i , j ) (i , j ) patibility between intensity value at pixel (i , j ) and 0 0 statistical distribution in window R. To estimate the unknown parameters of k and p , consider the pixels in R R R as n realizations of Eq. (3), where n"DRD. The likelihood function of k and p is [19] R R 1 n ¸(R; k , p )" R R J2pp R 1 ]exp ! + (I !k )2 . (4) (i, j) R 2p2 R (i, j)|R By maximizing Eq. (4), we get the maximum likelihood estimators for k and p : R R 1 k( " + I , (5a) R n (i, j) (i, j)|R 1 J + (I !k( )2. (5b) p( " (i, j) R R Jn (i, j)|R To do a nonlinear smoothing, similar to selective smoothing "lters [20,21], suppose that there are M windows R(m), where 1)m)M, around a central pixel (i , j ). Here these R(m)'s can be generated from one or 0 0 several basis windows through rotation, which are motivated by the experimental "ndings of orientation selectivity in the visual cortex [22]. Simple examples are elongated rectangular windows (refer to Fig. 6), which are used throughout this paper for synthetic and real images. The probability that pixel (i , j ) belongs to R(m) can be 0 0 estimated from Eqs. (3), (5a) and (5b). By assuming that the weight of each R(m) should be proportional to the probability, as in relaxation labeling [14,15], we obtain an iterative nonlinear smoothing "lter:
A
A
B
+ P(It , R(m))k( t (m) R It`1 " m (i0, j0) (i0, j0) + P(It 0 0 , R(m)) m (i , j )
B
(6)
266
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
A problem with this "lter is that it is not sensitive to weak edges due to the linear combination. To generate more semantically meaningful results and increase the sensitivity even to weak edges, we apply a nonlinear function on weights, which is essentially same as anisotropic di!usion [8]: + g(P(It 0 0 , R(m)))k( t (m) (i , j ) R . It`1 " m (7) (i0, j0) + g(P(It 0 0 , R(m))) (i , j ) m Here g1 is a nonlinear monotonically increasing function. A good choice for g is an exponential function, which is widely used in nonlinear smoothing and anisotropic diffusion approaches: g(x)"exp(x2/K)
(8)
Because k(m) is a linear combination of random variables with a Gaussian distribution, k(m) has also a Gaussian distribution with the same mean and a standard deviation given by 1 p (m)" p (m). k JDR(m)D R
This provides a probability measure of how likely that the M windows are sampled from one homogenous region. Given a con"dence level a, for each pair of windows R(m1) and R(m2) we have Dk(m1)!k(m2)D
AS
)min Here parameter K controls the sensitivity to edges [23]. Eq. (7) provides a generic model for a wide range of nonlinear algorithms, the behavior of which largely depends on the sensitivity parameter K. When K is large, Eq. (7) reduces to the equally weighted average smoothing "lter. When K is around 0.3, g is close to a linear function in [0, 1] and Eq. (7) then reduces to Eq. (6). When K is a small positive number, Eq. (7) will be sensitive to all discontinuities. No matter how small the weight of one window can be, theoretically speaking, if it is nonzero, when tPR, the system will reach a uniform stable state. Similar to anisotropic di!usion approaches, the desired results will be time-dependent and the termination problem becomes a critical issue for autonomous solutions. To overcome this limitation, we restrict smoothing only within the window with the highest probability similar to selective smoothing [20,21]: (9) mH" max (P(I 0 0 , R(m))). (i , j ) 1xmxM The nonlinear smoothing through Eq. (9) is desirable in regions that are close to edges. By using appropriate R(m)'s, Eq. (9) encodes discontinuities implicitly. But in homogenous regions, Eq. (9) may produce arti"cial block e!ects due to intensity variations. Under the proposed statistical formulation, there is an adaptive method to detect homogeneity. Based on the assumption that there are M R(m) windows around a central pixel (i , j ), where 0 0 each window has a Gaussian distribution, consider the mean in each window as a new random variable: 1 k(m)" DR(m)D
+ I . (i, j) (i, j)|R(m)
(10)
1 Because the probability measure given by (1) is inversely related to gradient measure used in most non-linear smoothing algorithms, (8) is an increasing function instead of a decreasing function in our method.
(11)
log(1/a) p( (m1), DR(m1)D R
S
B
log(1/a) p( (m2) . DR(m2)D R
(12)
If all the pairs satisfy Eq. (12), the M windows are likely from one homogenous region with con"dence a. Intuitively, under the assumption of a Gaussian distribution, when we have more samples, i.e., the window R(m) is larger, the estimation of the mean is more precise and so the threshold should be smaller. In a region with a larger standard deviation, the threshold should be larger because larger variations are allowed. The nonlinear smoothing algorithm outlined above works well when noise is not very large. In cases when signal to noise ratio is very low, the probability measure given in Eq. (3) would be unreliable because pixel values change considerably. This problem can be alleviated by using the mean value of pixels sampled from R which are close to the central pixel (i , j ), or along a certain direc0 0 tion to make the algorithm more orientation sensitive. To summarize, we obtain a nonlinear smoothing algorithm. We de"ne M oriented windows which can be obtained by rotating one or more basis windows. At each pixel, we estimate parameters using Eqs. (5a) and (5b). If all the M windows belong to a homogenous region according to Eq. (12), we do the smoothing using all the M windows. Otherwise, the smoothing is done only within the most compatible window given by Eq. (9). 2.2. A generic nonlinear smoothing framework In this section we will show how to derive several widely used nonlinear algorithms from the nonlinear smoothing algorithm outlined above. Several early nonlinear "lters [20,21] do the smoothing in a window where the standard deviation is the smallest. These "lters can be obtained by simplifying Eq. (3) to 1 C P(I 0 0 , k( , p( )" (i , j ) J2pp(
(13)
where C is a constant. Then the solution to Eq. (9) is the window with the smallest deviation. Recently, Higgins
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
and Hsu [24] extended the principle of choosing the window with the smallest deviation for edge detection. Another nonlinear smoothing "lter is the gradientinverse "lter [25]. Suppose that there is one window, i.e., M"1, consisting of the central pixel (i , j ) itself only, 0 0 the estimated deviation for a given pixel (i, j) in Eq. (5b) now becomes p( "DI !I 0 0 D. (i, j) (i , j )
(14)
Eq. (14) is a popular way to estimate local gradients. Using Eq. (13) as the probability measure, Eq. (6) becomes exactly the gradient inverse nonlinear smoothing "lter [25]. Smallest Univalue Segment Assimilating Nucleus (SUSAN) nonlinear smoothing "lter [26] is proposed based on SUSAN principle. It is formulated as =(i , j , di, dj) + It 0 0 , It`1 " (di, dj)E(0,0) (i0`di, j0`dj) (i0, j0) + =(i , j , di, dj) (di, dj)E(0,0) 0 0
(15)
where
A
B
r2 (It !It 0 0 )2 (i , j ) . =(i , j , di, dj)"exp ! ! (i0`di, j0`dj) 0 0 2p2 ¹2 Here (i , j ) is the central pixel under consideration, and 0 0 (di, dj) de"nes a local neighborhood. Essentially, it integrates Gaussian smoothing in spatial and brightness domains. The parameter ¹ is a threshold for intensity values. It is easy to see from Eq. (15) that the weights are derived based on pair-wise intensity value di!erences. It would be expected that the SUSAN "lter performs well when images consist of relatively homogenous regions and within each region noise is smaller than ¹. When noise is substantial, it fails to preserve structures due to the pair-wise di!erence calculation, where no geometrical constraints are incorporated. This is consistent with the experimental results, which will be discussed later. To get the SUSAN "lter, we de"ne one window including the central pixel itself only. For a given pixel (i, j) in its neighborhood, Eq. (3) can be simpli"ed to
G
H
(I !k( )2 R , P(I , R)"C exp ! (i, j) (i, j) ¹2
If we have four singleton regions, Eq. (17) is essentially a simpli"ed version of Eq. (7) with an adaptive learning rate.
3. Analysis 3.1. Theoretical results One of the distinctive characteristics of the proposed algorithm is that it requires spatial constraints among responses from neighboring locations through coupling structure as opposed to pair-wise coupling structure. Fig. 2 illustrates the concept using a manually constructed example. Fig. 2a shows the oriented windows in a 3]3 neighborhood, and Fig. 2b shows the coupling structure if we apply the proposed algorithm to a small image patch. The directed graph is constructed as follows. There is a directed edge from (i , j ) to (i , j ) if and 1 1 0 0 only if (i , j ) contributes to the smoothing of (i , j ) 1 1 0 0 according to Eqs. (12) and (9). By doing so, the coupling structure is represented as a directed graph as shown in Fig. 2b. Connected components and strongly connected components [27] of the directed graph can be used to analyze the temporal behavior of the proposed algorithm. A strongly connected component is a set of vertices, or pixels here, where there is a directed path from any vertex to all the other vertices in the set. We obtain a connected component if we do not consider the direction of edges along a path. In the example shown in Fig. 2b, all the black pixels form a strongly connected component and so do all the white pixels. Also, there are obviously two connected components. Essentially our nonlinear smoothing algorithm can be viewed as a discrete dynamic system, the behavior of which is complex due to spatial constraints imposed by coupling windows and adaptive coupling structure by probabilistic grouping. We now prove that a constant region satisfying certain geometrical constraints is a stable state of the smoothing algorithm. Theorem. If a region S of a given image I satisxes
(16)
where C is a scaling factor. Because now k( is I 0 0 , R (i , j ) Eq. (6) with the probability measure given by Eq. (16) is equivalent to Gaussian smoothing in the brightness domain in Eq. (15). Now consider anisotropic di!usion given by Eq. (1). By discretizing Eq. (1) in image domain with four nearest-neighbor coupling [13] and rearranging terms, we have It`1 "gt It #j+ g(P(It , R(m)))k( t (m). (i, j) (i, j) (i, j) (i, j) R m
267
(17)
(18a) (i , j )3S and (i , j )3SNI 1 1 "I 2 2 (i , j ) 1 1 2 2 (i , j ) ∀(i, j)3SN& m R(m) -S, (18b) (i, j) then S is stable with respect to the proposed algorithm. Proof. Condition (18a) states that S is a constant region and the standard deviation is zero if R(m) is within S according to Eq. (5b). Consider a pixel (i , j ) in S. Inequal0 0 ity (12) is satis"ed only when all R(m)'s are within S. In this case, the smoothing algorithm does not change the intensity value at (i , j ). Otherwise, R(mH) according to Eq. (9) 0 0 must be within S because there exists at least one such window according to Eq. (18b) and thus the smoothing
268
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
Fig. 2. Illustration of the coupling structure of the proposed algorithm. (a) Eight oriented windows and a fully connected window de"ned on a 3]3 neighborhood. (b) The resulting coupling structure. There is a directed edge from (i , j ) to a neighbor (i , j ) if and 1 1 0 0 only if (i , j ) contributes to the smoothing of (i , j ) according to Eqs. (12) and (9). Filled circles represent black pixels, empty circles 1 1 0 0 represent white pixels, and hatched circles represent gray pixels. Ties in Eq. (9) are broken according to left-right and top-down preference of the oriented windows in (a).
algorithm does not change the intensity value at (i , j ) 0 0 also. So S is stable. h A maximum connected component of the constructed graph is stable when its pixels are constant and thus maximum connected components of the constructed graph are a piecewise constant stable solution of the proposed algorithm. For the image patch given in Fig. 2b, for example, a stable solution is that pixels in each of the two connected components are constant. The noise-free image in Fig. 1a is also a stable solution by itself, as we will demonstrate through numerical simulations later on.
It is easy to see from the proof that any region which satis"es conditions (18a) and (18b) during temporal evolution will stay unchanged. In addition, due to the smoothing nature of the algorithm, a local maximum at iteration t cannot increase according to the smoothing kernel by Eq. (12) or Eq. (9), and similarly, a local minimum cannot decrease. We conjecture that any given image approaches an image that is almost covered by homogenous regions. Due to the spatial constraints given by Eq. (18b), it is not clear if the entire image converges to a piece-wise constant stable state. Within each resulting homogenous region, Eq. (18b) is satis"ed and thus the region becomes stable. For pixels near boundaries,
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
269
Fig. 3. Temporal behavior of the proposed algorithm with respect to the amount of noise. Six noisy images are obtained by adding zero-mean Gaussian noise with p of 5, 10, 20, 30, 40 and 60, respectively, to the noise-free image shown in Fig. 1a. The plot shows the deviation from the ground truth image with respect to iterations of the noise-free image and six noisy images.
corners, and junctions, it is possible that Eq. (18b) is not uniquely satis"ed within one constant region, and small changes may persist. The whole image in this case attains a quasi-equilibrium state. This is supported by the following numerical simulations using synthetic and real images. While there are pixels which do not converge within 1000 iterations, the smoothed image as a whole does not change noticeably at all. The two maximum strongly connected components in Fig. 2b satisfy condition (18b). Both of them are actually uniform regions and thus are stable. Gray pixels would be grouped into one of the two stable regions according to pixel value similarity and spatial constraints. 3.2. Numerical simulations Because it is di$cult to derive the speed of convergence analytically, we use numerical simulations to demonstrate the temporal behavior of the proposed algorithm. Since smoothing is achieved using equally weighted average within selected windows, the algorithm should converge rather quickly in homogenous regions. To obtain quantitative estimations, we de"ne two measures similar to variance. For synthetic images, where a noise-free image is available, we de"ne the deviation from the ground truth image as
S
D " (I)
+ + (I !Igt )2 i j (i, j) (i, j) . DID
(19)
Here I is the image to be measured and Igt is the ground truth image. The deviation gives an objective measure of how good the smoothed image is with respect to the true image. To measure the convergence, we de"ne relative variance for image I at time t:
S
+ + (It !It~1 )2 i j (i, j) (i, j) . DID
(20)
We have applied the proposed algorithm on the noisefree image shown in Fig. 1a and six noisy images generated from it by adding zero-mean Gaussian noise with p from 5 to 60. Fig. 3 shows the deviation from the ground truth image with iterations, and Fig. 4 shows the relative variance of the noise-free image and four selected noisy images to make the "gure more readable. As we can see from Fig. 3, the noise-free image is a stable solution by itself, where the deviation is always zero. For the noisy images, the deviation from the true image is stabilized within a few number of iterations independent of the amount of noise. Fig. 4 shows that relative variance is bounded with a small upper limit after 10 iterations. This variance is due to the pixels close to boundaries, corners and junctions that do not belong to any resulting constant region. As discussed before, because the spatial constraints cannot be satis"ed within one homogenous region, these pixels have connections from pixels belonging to di!erent homogenous regions, and thus #uctuate. These pixels are a small fraction of the input image in general, and thus the #uctuations do not a!ect the
270
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
Fig. 4. Relative variance of the proposed algorithm for the noise-free image shown in Fig. 1a and four noisy images with Gaussian noise of zero-mean and p of 5, 20, 40 and 60, respectively.
Fig. 5. Relative variance of the proposed algorithm for real images shown in Figs. 9}12.
quality of the smoothed images noticeably. As shown in Fig. 3, the deviation is stabilized quickly. Real images are generally more complicated than synthetic images statistically and structurally, and we have also applied our algorithm to the four real images shown in Figs. 9}12 which include a texture image. Fig. 5 shows the relative variance in 100 iterations, where the variance is bounded after 10 iterations independent of images. This indicates that the proposed algorithm behaves similarly for synthetic and real images.
4. Experimental results 4.1. Results of the proposed algorithm The nonlinear smoothing algorithm formalized in this paper integrates discontinuity and homogeneity through the orientation-sensitive probability framework. Eq. (9) represents discontinuity implicitly and Eq. (12) encodes homogeneity explicitly. Because of the probability measure, the initial errors for choosing smoothing
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
271
Fig. 6. The oriented bar-like windows used throughout this paper for synthetic and real images. The size of each kernel is approximately 3]10 in pixels.
windows due to noise can be overcome by the coupling structure. Essentially only when majority of the pixels in one window make a wrong decision, the "nal result would be a!ected. As illustrated in Fig. 2, the coupling structure is robust. To achieve optimal performance, the size and shape of the oriented windows are application dependent. However, due to the underlying coupling structure, the proposed algorithm gives good results for a wide range of parameter values. For example, the same oriented windows are used throughout the experiments in this paper. As shown in Fig. 6, these oriented windows are generated by rotating two rectangular basis windows with size of 3]10 in pixels. The preferred orientation of each window is consistent with orientation sensitivity of cell responses in the visual cortex [22]. Asymmetric window shapes are used so that 2-D features such as corners and junctions can be preserved. As is evident from numerous simulations, the proposed algorithm generates stable results around 10 iterations regardless of input images. Thus, all the boundaries from the proposed algorithm are generated using smoothed images at the 11th iteration. As stated above, boundaries are detected using the Sobel edge detector due to its e$ciency. Fig. 7 shows the results by applying the proposed algorithm on a set of noisy images obtained from the noise-free image shown in Fig. 1a by adding Gaussian noise with p of 10, 40, and 60, respectively. Same parameters for smoothing are used for the three images. When noise is relative small, the proposed algorithm preserves boundaries accurately as well as corners and junctions, as shown in Fig. 7a. When noise is substantial, due to the coupling structure, the proposed algorithm is robust to noise and salient boundaries are well preserved. Because only local information is used in the system, it would be expected that the boundaries are less accurate when noise is larger. This uncertainty is an intrinsic property of the proposed algorithm because reliable estimation gets more di$cult when noise gets larger as shown in Fig. 7b and c. The results seem consistent with our perceptual experience. Fig. 8 shows the result for another synthetic image, which was extensively used by Sarkar and Boyer [28]. As shown in Fig. 8b, noise is reduced greatly and boundaries
as well as corners are well preserved. Even using the simple Sobel edge detector, the result is better than the best result from the optimal in"nite impulse responses "lters [28] obtained using several parameter combinations with hysteresis thresholding. This is because their edge detector does not consider the responses from neighboring pixels, but rather assumes the local maxima as good edge points. Fig. 9 shows an image of a grocery store advertisement which was used throughout the book by Nitzberg et al. [29]. In order to get good boundaries, they "rst applied an edge detector and then several heuristic algorithms to close gaps and delete noise edges. In our system, the details and noise are smoothed out due to the coupling structure and the salient boundaries, corners and junctions are preserved. The result shown in Fig. 9c is comparable with the result after several post-processing steps shown on p. 43 of the book. Fig. 10 shows a high-resolution satellite image of a natural scene, consisting of a river, soil land, and a forest. As shown in Fig. 10b, the river boundary which is partially occluded by the forest is delineated. The textured forest is smoothed into a homogenous region. The major boundaries between di!erent types of features are detected correctly. Fig. 11 shows an image of a woman which includes detail features and shading e!ects, the color version of which was used by Zhu and Yuille [30]. In their region competition algorithm, Zhu and Yuille [30] used a mixture of Gaussian model. A nonconvex energy function consisting of several constraint terms was formulated under Bayesian framework. The algorithm, derived using variational principles, is guaranteed to converge to only a local minimum. For our nonlinear algorithm, as shown in Fig. 11b, the details are smoothed out while important boundaries are preserved. The "nal result in Fig. 11c is comparable with the result from the region competition algorithm [30] applied on the color version after 130 iterations. Compared with the region competition algorithm, the main advantage of our approach is that local statistical properties are extracted and utilized e!ectively in the oriented probabilistic framework instead of "tting the image into a global model which, in general, cannot be guaranteed to "t the given data well.
272
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
Fig. 7. The smoothed images at the 11th iteration and detected boundaries for three synthetic images by adding speci"ed Gaussian noise to the noise-free image shown in Fig. 1a. Top row shows the input image, middle the smoothed image at the 11th iteration, and bottom the detected boundaries using the Sobel edge detector.
Fig. 8. The smoothed image at the 11th iteration and detected boundaries for a synthetic image with corners.
To further demonstrate the e!ectiveness of the proposed algorithm, we have also applied it to a texture image as shown in Fig. 12a. As shown in Fig. 12b, the boundaries between di!erent textures are preserved while most of detail features are smoothed out. Fig. 12c shows
the detected boundaries by the Sobel edge detector. While there are some noisy responses due to the texture patterns, the main detected boundaries are connected. A simple region growing algorithm would segment the smoothed image into four regions. While this example is
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
273
Fig. 9. The smoothed image at the 11th iteration and detected boundaries for a grocery store advertisement. Details are smoothed out while major boundaries and junctions are preserved accurately.
Fig. 10. The smoothed image at the 11th iteration and detected boundaries for a natural satellite image with several land use patterns. The boundaries between di!erent regions are formed from noisy segments due to the coupling structure.
Fig. 11. The smoothed image at the 11th iteration and detected boundaries for a woman image. While the boundaries between large features are preserved and detected, detail features such as facial features are smoothed out.
274
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
Fig. 12. The smoothed image at the 11th iteration and detected boundaries for a texture image. The boundaries between di!erent textured regions are formed while details due to textures are smoothed out.
not intended to show that our algorithm can process texture images, it demonstrates that the proposed algorithm can be generalized to handle distributions that are not Gaussian, which was assumed when formalizing the algorithm. 4.2. Comparison with nonlinear smoothing algorithms In order to evaluate the performance of the proposed algorithm relative to existing nonlinear smoothing methods, we have conducted a comparison with three recent methods. The SUSAN nonlinear "lter [26] has been claimed to be the best by integrating smoothing both in spatial and brightness domains. The original anisotropic model by Perona and Malik [8] is still widely used and studied. The edge-enhancing anisotropic di!usion model proposed by Weickert [9,11] incorporates true anisotropy using a di!usion tensor calculated from a Gaussian kernel, and is probably by far the most sophisticated di!usion-based smoothing algorithm. To do an objective comparison using real images is di$cult because there is no universally accepted ground truth. Here we use synthetic images where the ground truth is known and the deviation calculated by Eq. (19) gives an objective measure of the quality of smoothed images. We have also tuned parameters to achieve best possible results for the methods to be compared. For the SUSAN algorithm, we have used several di!erent values for the critical parameter ¹ in Eq. (15). For the Perona and Malik model, we have tried di!erent nonlinear functions g in Eq. (1) with di!erent parameters. For the Weickert model, we have chosen a good set of parameters for di!usion tensor estimation. We in addition choose their best results in terms of deviation from the ground truth, which are then used for boundary detection. Because the three methods and proposed algorithm all can be applied iteratively, "rst we compare their tem-
poral behavior. We apply each of them to the image shown in Fig. 7b for 1000 iterations and calculate the deviation and relative variance with respect to the number of iterations using Eqs. (19) and (20). Fig. 13 shows the deviation from the ground-truth image. The SUSAN "lter, which quickly reaches a best state, and converges quickly also to a uniform state due to the Gaussian smoothing term in the "lter (see Eq. (15)). The temporal behavior of the Perona-Malik model and the Weickert model is quite similar while the Weickert model converges more rapidly to and stay longer in good results. The proposed algorithm converges and stabilizes quickly to a non-uniform state, and thus the smoothing can be terminated after several iterations. Fig. 14 shows the relative variance of the four methods along the iterations. Because the SUSAN algorithm converges to a uniform stable state, the relative variance goes to zero after a number of iterations. The relative variance of Perona}Malik model is closely related to the g function in Eq. (1). Due to the spatial regularization using a Gaussian kernel, Weickert model changes continuously and the di!usion lasts much longer, which accounts for the fact why good results exist for a longer period of time than Perona}Malik model. As shown in Figs. 4 and 5, the proposed algorithm generates bounded small ripples in the relative variance measure. Those ripples do not a!ect smoothing results noticeably as the deviation from the ground truth, shown in Fig. 13, is stabilized quickly. Now we compare the e!ectiveness of the four methods in preserving meaningful boundaries. Following Higgins and Hsu [24], we use two quantitative performance metrics to compare the edge detection results: P(AEDTE), the probability of a true edge pixel being correctly detected by a given method; P(TEDAE), the probability of a detected edge pixel being a true edge pixel. Due to the uncertainty in edge localization, a detected edge pixel is considered to be correct if it is within two pixels from ground-truth edge points using the noise-free image. For
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
275
Fig. 13. Deviations from the ground truth image for the four nonlinear smoothing methods. Dashed line: The SUSAN "lter [26]; Dotted line: The Perona}Malik model [8]; Dash-dotted line: The Weickert model of edge enhancing anisotropic di!usion [11]; Solid line: The proposed algorithm.
Fig. 14. Relative variance of the four nonlinear smoothing methods. Dashed line: The SUSAN "lter [26]; Dotted line: The Perona}Malik di!usion model [8]; Dash-dotted line: The Weickert model [11]; Solid line: The proposed algorithm.
each method, the threshold on the gradient magnitude of the Sobel edge detector is adjusted to achieve a best trade-o! between detecting true edge points and rejecting false edge points. For the proposed algorithm, we use the result at the 11th iteration because the proposed algorithm converges within several iterations. As mentioned before, for the other three methods, we tune critical parameters and
choose the smoothed images with the smallest deviation. Fig. 15 shows the smoothed images along with the detected boundaries using the Sobel edge detector, for the image shown in Fig. 7a, where added noise is Gaussian with zero mean and p"10. Table 1 summarizes the quantitative performance metrics. All of the four methods perform well and the proposed method gives the best numerical scores. The boundary of the square is
276
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
Fig. 15. Smoothing results and detected boundaries of the four nonlinear methods for a synthetic image shown in Fig. 7a. Here noise is not large and all of the methods perform well in preserving boundaries.
Table 1 Quantitative comparison of boundary detection results shown in Fig. 15 Models
SUSAN [26]
Perona}Malik [8]
Weickert [9,11]
Our method
P(TEDAE) P(AEDTE) Average
0.960 0.956 0.958
0.963 0.964 0.963
0.877 0.880 0.878
0.988 0.979 0.983
preserved accurately. For the central oval, the proposed algorithm gives a better connected boundary while the other three have gaps. Also the proposed algorithm generated the sharpest edges while edges from Weickert model are blurred most, resulting in the worst numerical metrics among the four methods. Fig. 16 shows the results for the image in Fig. 7b, where noise is substantial, and Table 2 shows the quantitative performance metrics. As shown in Fig. 16a, the SUSAN "lter tends to fail to preserve boundaries, resulting in noisy boundary fragments. The Perona}Malik model produces good but fragmented boundaries. Due to that only local gradient is used, the Perona}Malik model is noise-sensitive and thus generates more false responses than other methods in this case. The false responses substantially lower the quantitative metrics of the model, making it the worst among the four methods. The Weickert model produces good boundaries for strong segments but weak segments are blurred considerably. The proposed algorithm preserves the connected boundary of the square and partially fragmented boundaries of the central oval also, yielding the best numerical metrics
among the four methods. As shown in Fig. 13, the smoothed image of the Weickert model has a smaller deviation than the result from our algorithm, but the detected boundaries are fragmented. This is because our algorithm produces sharp boundaries, which induce larger penalties according to Eq. (19) when not accurately marked. Comparing Tables 1 and 2, one can see that our proposed method is most robust in that the average performance is degraded by about 13%. Perona}Malik model is most noise-sensitive, where the performance is degraded by about 35%. For SUSAN and Weickert model, the average performance is degraded by about 24% and 19%, respectively. We have also applied the four methods on the natural satellite image shown in Fig. 10 (Fig. 17). The result from the proposed algorithm is at the 11th iteration as already shown in Fig. 10b. The results from the other three methods are picked up manually for best possible results. Due to the termination problem, results from most nonlinear smoothing algorithms have to be chosen manually, making them di$cult to be used automatically. The
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
277
Fig. 16. Smoothing results and detected boundaries of the four nonlinear methods for a synthetic image with substantial noise shown in Fig. 7b. The proposed algorithm generates sharper and better connected boundaries than the other three methods.
Table 2 Quantitative comparison of boundary detection results shown in Fig. 16 Models
SUSAN [26]
Perona}Malik [8]
Weickert [9,11]
Our method
P(TEDAE) P(AEDTE) Average
0.720 0.713 0.717
0.609 0.618 0.613
0.692 0.688 0.690
0.853 0.854 0.853
Fig. 17. Smoothing results and detected boundaries of a natural scene satellite image shown in Fig. 10a. Smoothed image of the proposed algorithm is at the 11th iteration while smoothed images of the other three methods are chosen manually. While the other three methods generate similar fragmented boundaries, the proposed algorithm forms the boundaries between di!erent regions due to its coupling structure.
278
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
results from the other three methods are similar, and boundaries between di!erent regions are not formed. In contrast, our algorithm generated connected boundaries separating major di!erent regions.
5. Conclusions In this paper we have presented a two-step robust boundary detection algorithm. The "rst step is a nonlinear smoothing algorithm based on an orientation sensitive probability measure. This algorithm is motivated by the orientation sensitivity of cells in the visual cortex [22]. By incorporating geometrical constraints through the coupling structure, the algorithm is robust to noise while preserving meaningful boundaries. Even though the algorithm was formulated based on Gaussian distribution, it performs well for real and even textured images, showing the generalization capability of the algorithm. It is also easy to see that the formalization of the proposed algorithm would extend to other known distribution by changing Eqs. (3), (4), (5a) and (5b) accordingly. One such an extension would be to use a mixture of Gaussian distributions [31] so that the model may be able to describe arbitrary probability distribution. Compared with recent anisotropic di!usion methods, our algorithm approaches a non-uniform stable state and reliable results can be obtained after a "xed number of iterations. In other words, it provides a solution for the termination problem. When noise is substantial, our algorithm preserves meaningful boundaries better than the di!usion-based methods, because the coupling structure employed is more robust than pair-wise coupling structure. Scale is an intrinsic parameter in machine vision as interesting features may exist only in a limited range of scales. Scale spaces based on linear and nonlinear smoothing kernels do not represent semantically meaningful structures explicitly [32]. A solution to the problem could be to use parameter K in Eq. (8) as a control parameter [23], which is essentially a threshold in gray values. Under this formalization, Eq. (12) could o!er an adaptive parameter selection. With the robust coupling structure, our algorithm with adaptive parameter selection may be able to provide a robust multiscale boundary detection method. Another advantage of the probability measure framework is that there is no need to assume a priori knowledge about each region, which is necessary in relaxation labeling [14,15] and the comparison across windows with di!erent sizes and shapes is feasible. This could lead to an adaptive window selection that preserves small but important features which cannot be handled well by the current implementation. There is one intrinsic limitation common to many smoothing approaches including our proposed one. After
smoothing, the available feature is the average gray value, resulting in loss of information for further processing. One way to overcome this problem is to apply the smoothing in feature spaces derived from input images [33]. Another disadvantage of the proposed algorithm is relatively intensive computation due to the use of oriented windows. Each oriented window takes roughly as long in one iteration as the edge-enhancing di!usion method [11]. On the other hand, because our algorithm is entirely local and parallel, computation time would not be a problem on parallel and distributed hardware. Computation on serial computers could be reduced dramatically by decomposing the oriented "lters hierarchically so that oriented windows would be used only around discontinuities rather than in homogenous regions. The decomposition techniques for steerable and scalable "lters [34] could also help to reduce the number of necessary convolution kernels.
Acknowledgements The authors would like to thank Dr. Ke Chen and Erdogan Cesmeli for their useful discussions and Dr. Ke Chen for letting us use his implementation of the edgeenhancing anisotropic di!usion algorithm [11]. The authors would also like to thank an anonymous reviewer for valuable comments. This work was partially supported by an NSF grant (IRI-9423312) and an ONR Young Investigator Award (N00014-96-1-0676) to DLW. The authors also thank National Imagery and Mapping Agency and the Ohio State University Center for Mapping for partially supporting this research. Thanks to K. Boyer, S. Zhu, and M. Nitzberg for making their test images available on-line.
References [1] K.R. Castleman, Digital Image Processing, Prentice Hall, Englewood Cli!s, NJ, 1996. [2] J. Canny, A computational approach to edge detection, IEEE Trans. Pattern Anal. Mach. Intell. 8 (1986) 679}698. [3] D. Marr, E. Hildreth, Theory of edge detection, Proc. Roy. Soc. London B 207 (1980) 187}217. [4] A. Witkin, Scale-space "ltering, in: Proceedings of the Eighth International Joint Conference on Arti"cial Intelligence, 1983, pp. 1023}1026. [5] J. Koenderink, The structure of images, Bio. Cybernet. 50 (1984) 363}370. [6] T. Linderberg, Scale-Space Theory in Computer Vision, Kluwer Academic Publishers, Dordrecht, Netherlands, 1994. [7] M.A. Cohen, S. Grossberg, Neural dynamics of brightness perception: features, boundaries, di!usion and resonance, Perception Psychophys. 36 (1984) 428}456.
X. Liu et al. / Pattern Recognition 33 (2000) 263}280 [8] P. Perona, J. Malik, Scale space and edge detection using anisotropic di!usion, IEEE Trans. Pattern Anal. Mach. Intell. 12 (1990) 16}27. [9] J. Weickert, A review of nonlinear di!usion "ltering, in: Proceedings of the First International Conference on Scale-Space, 1997, pp. 3}28. [10] F. Catte, P.-L. Lions, J.-M. Morel, T. Coll, Image selective smoothing and edge detection by nonlinear di!usion, SIAM J. Numer. Anal. 29 (1992) 182}193. [11] J. Weickert, Theoretical foundations of anisotropic di!usion in image processing, in: W. Kropatsch, R. Klette, F. Solina (Eds.), Theoretical Foundations of Computer Vision, also Computing Supplement, vol. 11, Springer, Wien, 1996, pp. 221}236. [12] Y.L. You, W. Xu, A. Tannenbaum, M. Kaveh, Behavioral analysis of anisotropic di!usion in image processing, IEEE Trans. Image Process. 5 (1996) 1539}1553. [13] P. Perona, T. Shiota, J. Malik, Anisotropic di!usion, in: B.M. ter Haar Romeny (Ed.), Geometry-Driven Di!usion in Computer Vision, Kluwer Academic Publishers, Dordrecht, Netherlands, 1994, pp. 73}92. [14] A. Rosenfeld, R.A. Hummel, S.W. Zucker, Scene labeling by relaxation operations, IEEE Trans. Systems Man Cybernet. 6 (1976) 420}433. [15] R.A. Hummel, S.W. Zucker, On the foundations of relaxation labeling processes, IEEE Trans. Pattern Anal. Mach. Intell. 5 (1983) 267}287. [16] C.H. Li, C.K. Lee, Image smoothing using parametric relaxation, Graphical Models Image Process. 57 (1995) 161}174. [17] M.W. Hansen, W.E. Higgins, Relaxation methods for supervised image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 949}962. [18] D. Marr, Vision, W.H. Freeman, San Francisco, 1982. [19] H. Stark, J.W. Woods, Probability, Random Processes and Estimation Theory for Engineers, Prentice-Hall, Englewood Cli!s, NJ, 1994. [20] F. Tomita, S. Tsuji, Extraction of multiple regions by smoothing in selected neighborhoods, IEEE Trans. Systems Man Cybernet. 7 (1977) 107}109.
279
[21] M. Nagao, T. Matsuyama, Edge preserving smoothing, Comput. Graphics Image Process. 9 (1979) 394}407. [22] D.H. Hubel, Eye, Brain, and Vision, W.H. Freeman and Company, New York, 1988. [23] P. Saint-Marc, J.-S. Chen, G. Medioni, Adaptive smoothing: a general tool for early vision, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 514}529. [24] W.E. Higgins, C. Hsu, Edge detection using two-dimensional local structure information, Pattern Recognition 27 (1994) 277}294. [25] D.C.C. Wang, A.H. Vagnucci, C.C. Li, Gradient inverse weighted smoothing scheme and the evaluation of its performance, Comput. Graphics Image Process. 15 (1981) 167}181. [26] S.M. Smith, J.M. Brady, SUSAN } a new approach to low level image processing, Int. J. Comput. Vision 23 (1997) 45}78. [27] T.H. Cormen, C.E. Leiserson, R.L. Rivest, Introduction to Algorithms, MIT Press, Cambridge, MA, 1997. [28] S. Sarkar, K.L. Boyer, On optimal in"nite impulse response edge detection "lters, IEEE Trans. Pattern Anal. Mach. Intell. 13 (1991) 1154}1171. [29] M. Nitzberg, D. Mumford, T. Shiota, Filtering, Segmentation, and Depth, Springer, Berlin, 1993. [30] S.C. Zhu, A. Yuille, Region competition: unifying snakes, region growing, and Bayes/MDL for multiband image segmentation, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1996) 884}900. [31] B.S. Everitt, D.J. Hand, Finite Mixture Distributions, Chapman & Hall, London, 1981. [32] M. Tabb, N. Ahuja, Multiscale image segmentation by integrated edge and region detection, IEEE Trans. Image Process. 6 (1997) 642}655. [33] R. Whitaker, G. Gerig, Vector-valued di!usion, in: B.M. ter Haar Romeny (Ed.), Geometry-Driven Di!usion in Computer Vision, Kluwer Academic Publishers, Dordrecht, Netherlands, 1994, pp. 93}134. [34] P. Perona, Deformable kernels for early vision, IEEE Trans. Pattern Anal. Mach. Intell. 17 (1995) 488}499.
About the Author*XIUWEN LIU received the B.Eng. degree in Computer Science in 1989 from Tsinghua University, Beijing, China and the M.S. degrees in Geodetic Science and Surveying in 1995 and Computer and Information Science in 1996 both from the Ohio State University, Columbus, OH. From August 1989 to February 1993, he was with the Department of Computer Science and Technology at Tsinghua University. Since September 1994, he has been with the Center for Mapping at the Ohio State University working on various satellite image understanding projects. Now he is a Ph.D. candidate in the Department of Computer and Information Science at the Ohio State University. His current research interests include image segmentation, neural networks, pattern recognition with applications in automated map revision. He is a student member of IEEE, IEEE Computer Society, IEEE Signal Processing Society, ACSM, and ASPRS.
About the Author*DELIANG L. WANG was born in Anhui, the People's Republic of China in 1963. He received the B.Sc. degree in 1983 and the M.Sc. degree in 1986 from Peking (Beijing) University, Beijing, China, and the Ph.D. degree in 1991 from the University of Southern California, Los Angeles, CA, all in Computer Science. From July 1986 to December 1987 he was with the Institute of Computing Technology, Academia Sinica, Beijing. Since 1991, he has been with the Department of Computer and Information Science and Center for Cognitive Science at the Ohio State University, Columbus, OH, where he is currently an Associate Professor. His present research interests include neural networks for perception, neurodynamics, neuroengineering, and computational neuroscience. He is a member of IEEE, AAAS, IEEE Systems, Man, and Cybernetics Society, and the International Neural Network Society. He is a recipient of the 1996 US O$ce of Naval Research Young Investigator Award.
280
X. Liu et al. / Pattern Recognition 33 (2000) 263}280
About the Author*J. RAUL RAMIREZ is a Senior Research Scientist at the Ohio State University Center for Mapping. His research interests include Cadastral Information System, Cartographic Automation, Cartographic Expert Systems, Cartographic Generalization, Digital Data Exchange, Geographic Information Systems, Land Information Systems, Quality of Digital Spatial Data, Spatial Data Revision, Theoretical Cartography, and Visualization of Spatial Informaiton. Dr. Ramirez got his MS and Ph.D. from the Ohio State University and his BS from the Universidad Distrital of Bogota, Colombia, his country of origin. Dr. Ramirez directed the GISOM (Generating Information from Scanning Ohio Maps) project.
Pattern Recognition 33 (2000) 281}293
A global energy approach to facet model and its minimization using weighted least-squares algorithm C.H. Li*, P.K.S. Tam Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong Received 6 April 1998; received in revised form 27 October 1998; accepted 5 January 1999
Abstract A global energy approach and a weighted least-squares facet algorithm have been developed using the facet model. The global energy method gives a near-optimal estimation of the ideal image as predicted by the facet model using a regularization framework. The weighted least-squares facet algorithm improves the estimation of the facet parameters by taking into account the variances of estimations from di!erent resolution cells. Experimental results show the validity of the global energy approach and the superiority of the WLS facet algorithms in both synthetic images and real range images. ( 1999 Published by Elsevier Science Ltd. All rights reserved. Keywords: Image "ltering; Facet model; Global energy approach; Weighted least squares
1. Introduction Facet models are important tools in computer vision and image processing. In the facet model, the observed digital image is considered as a noisy realization of an underlying piecewise continuous intensity surface. Popular forms of the facet model include #at facet model, sloped facet model and quadratic facet model. A large variety of applications have been developed around the facet model, including image "ltering [1], edge detection [2,3], image segmentation [4], shape from shading [5] and topography analysis [6]. In the application to edge detection, Huang and Tseng [7] have found that the sloped facet model by Haralick and Shapiro gives a better interpretation of the local intensity changes than other techniques. In this paper, we are interested in the iterated facet algorithm for estimating gray-level values developed by Haralick [1,2]. The theoretical simplicity and intuitive nature of the facet model allows an easy understanding of the algorithm and its suitability of application to di!erent scenarios. Strong edge retaining ability of the facet model makes the algorithm especially
suitable as a pre-processing routine in image analysis applications. While the iterated facet algorithm gives a good estimation when the image is not heavily contaminated, its performance degrades rapidly as the noise variance increases. The performance of noise "ltering can be improved by increasing the size of the facets at the expense of losing the resolution of "ne details in image. Thus the size of the facet has to be chosen as a compromise between noise reduction and resolution limitations. The resolution and noise "ltering trade-o! can be resolved by developing an appropriate energy approach to the facet model. The iterated facet algorithm developed by Haralick can be viewed as a local winner-take-all minimizer for the global facet energy. The local estimation can be improved by a weighted least-squares estimation instead of a winner-take-all estimation. Experimental results show the validity of the global facet energy formulation and the superiority of the weighted least-squares estimation method over the iterated facet algorithm.
2. Iterated facet model * Corresponding author. Tel.: #852-2766-6043; fax: #8522346389. E-mail address:
[email protected] (C.H. Li)
The iterated facet model for "ltering is proposed by Haralick in Refs. [1,2]. The iterated model for image data
0031-3203/99/$20.00 ( 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 1 - 5
282
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
assumes that the image can be partitioned into connected regions called facets, which satisfy a prede"ned model. The pixels of the image are lexigraphically indexed by the set S"M1,2, mN where m is the number of pixels in the image. The gray values of the pixels are denoted by g"Mg ,2, g N. A resolution cell = is a group of neigh1 m ik boring pixels with speci"c geometry and size where k"M1,2, K]KN. The resolution cells overlap with each other, thus each pixel is contained in more than one resolution cell. For example, if the resolution cell is de"ned as a square of K]K pixels, then any interior pixel of the image is contained in up to K]K resolution cells. The pixels inside a resolution cell are assumed to follow a facet model. For example, if a sloped facet model is assumed for any resolution cell = , any pixel j inside ik = is assumed to obey ik g( ik"a r#b c#c , j ik ik ik
(1)
where r, c are the indices corresponding to the row and column of the pixel j; a , b , c are constants for the ik ik ik = cell. If we adopt the #at facet model, any pixel j inside ik the cell = is assumed to obey ik g( "c , ik jik
(2)
where c is constant for the = cell. In all facet models, ik ik the "t of the facet model to the = th cell is de"ned by the ik following measure of &goodness of "t': o " + (g( !g )2, (3) ik jik j j|Wik which is simply the sum of the residues between the gray values of "tted surface and the gray values of the pixels. As the gray value of each pixel i is predicted by a number of cells containing pixel i, there are di!erent predictions for pixel i from di!erent cells = . Haralick ik selects the prediction value for pixel i from the best-"tting block containing pixel i. The iterated facet "ltering is given by g(t`1)"g( (t)H where kH : oH "min o , ik ik ik i k
(4)
where g(t`1) is the gray value of the ith pixel at the i (t#1)th iteration and g( (t) is the estimated value for pixel ik i using the resolution cell = at the tth iteration. This ik iterative procedure has been proven to converge and the resultant image has all its pixels obeying the facet assumption. A major drawback of the processed image is that the continuity among neighboring facets is weak and the facet does not grow signi"cantly larger than the size of the resolution cells and thus cannot represent any larger object with accuracy. This can be ameliorated by choosing a larger resolution cell at the expense of the losses in the "ne details of the image. In the next section, a global energy approach to the facet model will be
developed which will solve the problem on the resolution and noise "ltering trade-o! of the iterated facet model.
3. Regularization approaches to the facet model The regularization approach [8,9] to the image reconstruction problem can be modeled using a cost function u(u), u(u)"v(u)#jh(u, d),
(5)
where d is the observed image, u is the unknown image to be estimated, v(u) is the regularization term or the prior cost function containing information speci"ed a priori and h(u, d) is the data term which incorporates the knowledge on the corruption process. A popular choice of the data function h is the quadratic function where h(g, d)"(g!d)2. The most challenging part of the problem is the speci"cation of the prior cost function. An intuitive solution is to assume that the image has gray values which vary smoothly and employ a measure which re#ects the degree of smoothness in the solution. The constant j is the regularization constant controlling the "delity of the reconstruction to the data and the smoothness constraint. Classical image models constructed using this type of framework also includes Geman's line process model [10]. We will use the facet model to construct the regularization function. The reconstructed image u( will be given by the minimizer, u( "arg min Mv(u)#j(u, d)2N.
(6)
u
The energy for reconstruction is given by u(u( )"+ oH : min o #j+ (g !d )2, (7) ik ik i i k i|S i|S where j is the regularization constant. The iterated facet algorithm can be considered as a local minimizer of the above energy function, where for each pixel i, the new gray value is considered as being predicted by its best"tting neighboring cells. When compared with Geman's line process model [10], the global facet model does not rely on an extra set of line process variables and is thus computationally simpler. When compared with the constrained restoration approach [11], the facet energy does not need the construction of an extra concave constraint cost function. As the above energy term is non-convex and not di!erentiable, gradient descent algorithms are not applicable. Stochastic optimization techniques such as simulated annealing [12] can be employed for "nding the global minimizer of Eq. (7). In the experimental section, simulated annealing using Gibb's sampler [10] will be employed for "nding the solutions. A major drawback of stochastic optimization techniques like simulated
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
283
annealing is their high computational cost. For applications where a simple and e!ective algorithm is desired, an approach based on the weighted least-squares method is proposed.
4. Iterated weighted least-squares facet estimation In Haralick's iterated facet model, estimation for a pixel's gray value depends only on the best-"tting hypothesis cell and ignores the outputs from all other cells containing the pixel. In the case of prediction of gray values of pixel i being covered by k resolution cells, there are k predicted values for pixel i given by Mg( , g( ,2, g( N, where each prediction value has goodi1 i2 ik ness-of-"t values given by Mo , o ,2, o N respectively. i1 i2 ik As shown in Eq. (4), the original facet model estimates the gray value of pixel i by choosing the best-"t estimate and ignoring the other estimates. The error of the estimation can be reduced by taking into account all the estimates involved. Estimations from resolution cells with better "ts are taken as more important than estimations from resolution cells with lower con"dence. This type of estimation is motivated by the generalized least-squares approach [13] to estimation where the estimation depends on all the respective variables and their variances. To study the accuracy of the facet model, consider the following case where the resolution cells are of dimensions 3]3 pixels. Consider the two cases where nine resolution cells = ,2,= are located in a 5]5 homoi1 i9 geneous region with gray values k and near the bound1 ary between two regions with gray values k and k as 1 2 shown in Fig. 1. Assuming a white Gaussian noise corruption with noise standard deviation p we can derive the expected estimated value of the gray values and the associated standard deviations in the estimation using a #at facet model. The exact expected value is only obtainable as a limit when there are an in"nite number of samples available. When a small number of samples are available, we would prefer to have an estimator that has a smaller variance. In the case of the homogeneous region as shown in Fig. 1b, the estimated value for the nine resolution cells covering pixel i is given by E(g( )"k , k"1,2, 9, ik 1 E(o )"9p2, k"1,2, 9. (8) ik In the case of the homogeneous region, all nine resolution cells predict the correct value of k and the goodness 1 of "t of the estimator is proportional to the noise variance. In the conventional facet model, only the value from one of the resolution cell will be selected as the predicted value. The existence of the non-trivial noise and small number of pixels in the resolution cells does not guarantee the correct estimated value of k . A linear 1 combination of the estimated values from the nine
Fig. 1. Resolution cells and model gray values of a 5]5 neighborhood (a) = ,2,= indicate the centers of the respective i1 i9 3]3 resolution cells, (b) model gray values of a homogeneous 5]5 region, (c) model gray values of a 5]5 boundary region.
resolution cells would give a more accurate estimated value of k . 1 In the case of the boundary region shown in Fig. 1c the estimated values for the nine resolution cells are given by E(g( )"k , ik 1
k"1,2, 3,
E(g( )"1 (2k #k ), ik 3 1 2
k"4,2, 6,
284
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
E(g( )"1 (k #2k ), k"7,2, 9, ik 3 1 2 E(o )"9p2, k"1,2, 3, ik E(o )"9p2#2(k !k )2, k"4,2, 6 ik 1 2 E(o )"9p2#2(k !k )2, k"7,2, 9. (9) ik 1 2 The derivation of the expected goodness of "t is given in Appendix A. In the case of the boundary region, the resolution cells = , = , = predict the correct value of i1 i2 i3 k , the resolution cells = ,2,= give biased estimates. 1 i4 i9 The goodness of "t of the three resolution cells with correct estimates are the smallest and are signi"cantly smaller than the resolution cells with biased estimates. Thus the value of the goodness of "t can be employed as a weighting factor in combining the various estimates where a resolution cell with a small o should be more signi"cant and that with a large o should be less signi"cant. The increased accuracy resulting from combining more estimators compensates for the small increases in bias associated with using estimators with bias. The iterated weighted least-squares (WLS) procedure is given by
estimate is often insu$cient to represent the quality of the "ltered image. The edges of the boundaries between regions in the image are especially important in many image analysis applications. Thus an intuitive criterion is to measure the mean squared error of the set of pixels lying adjacent to an edge. The set of edge pixels are shown in black in Fig. 2. In cases where a single criterion is needed, the geometric mean of the mean squared error (m.s.e.) and the edge mean squared error is taken as the
9 o~1 ik , (10) g(t`1)" + w g( (t) where w " ik ik ik +9 o~1 i k/1 ik k/1 where g(t`1) is the gray value of the ith pixel at the i (t#1)th iteration. The weights w are selected to be ik proportional to 1/o . ik A major advantage of weighted regression is that it often results in an estimator with smaller variances. A dif"culty of the weighted regression lies in the estimation of the variances of the regressors. In the application of facet models, each pixel i is predicted by di!erent cells = . ik Each prediction itself is a least-squares "t to the pixel i. The sum of residues of the "tting error is thus selected as a function for estimating the variance of the "tted value.
5. Results and discussions A number of synthetic images have been generated to test the performance of the algorithms. The "rst test image is generated using the piecewise constant properties. Both rectangles and circles of di!erent gray values and di!erent sizes are included in the image. Circles are included since their edges contain segments of di!erent directions and blocking artifacts in any speci"c orientation or size can be easily detected by looking at its circumference. The reference image is then corrupted with additive Gaussian noise of standard deviations 10.0 or 20.0. The performance criteria for evaluating the algorithms are the mean squared error and the edge mean squared error. The mean squared error represents the performance of the di!erent "ltering algorithms. However, this
Fig. 2. (a) Synthetic #at image and (b) the set of pixels for evaluating edge m.s.e.
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
combined error. This criterion is chosen to represent the scenario where both error criteria have to be considered simultaneously. The noise corrupted version shown in Fig. 3 has an additive Gaussian noise of standard deviation 20.0. Fig. 4 shows the m.s.e. of "ltering the test image with Gaussian noise of variance 20.0. The m.s.e. of the #at facet algorithm decreases to the minimum value of 9.80 at the third iteration. The WLS #at facet algorithm
285
attains its minimum of 7.43 at the fourth iterations. Moreover, the WLS facet algorithm attains a lower m.s.e. at its "rst iteration than the lowest attainable mean square error of the iterated #at facet algorithm. Fig. 5 shows the edge m.s.e. of "ltering the synthetic #at image corrupted with Gaussian noise of s.d. 20. The edge m.s.e. of the #at facet algorithm falls to its minimum at the "rst iteration. In the case of severe noise corruption, the conventional facet model fails to obtain a good estimate of the uncorrupted gray values. The edge m.s.e. of the WLS #at facet algorithm is better than the #at facet algorithm at all iterations since the weighted combinations of the various resolution cells give more accurate estimates. Looking at the "gures of the "ltered image, the WLS facet "ltered image has less grainy texture than the #at
Fig. 4. Mean squared errors in the "rst seven iterations.
Fig. 3. (a) Flat image contaminated with Gaussian noise of s.d. 20 and (b) its edges.
Fig. 5. Edge mean squared errors in the "rst seven iterations.
286
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
facet algorithm (Figs. 6 and 7). The edge images are generated using the Canny edge algorithm with standard deviation of 1. There are signi"cantly less false edges in the iterated WLS facet algorithm when compared to the iterated #at facet algorithm. To compare the two facet methods with conventional low-pass "ltering, a Gaussian kernel of width 6]6 is convolved with the corrupted image. As indicated in Fig. 8, the additive white-Gaussian noise in the corrupted image is e!ectively "ltered with low-pass "ltering, however, the edges in the restored
Fig. 7. (a) Restored #at image using WLS #at facet and (b) its edges.
Fig. 6. (a) Restored #at image using #at facet and (b) its edges.
images are blurred and a subsequent Canny edge detector reveals that there are signi"cant dislocations of the straight edges when compared to the facet algorithms. A summary of performances has been tabulated in Table 1 to compare the performances of the iterated median "ltering, the iterated #at facet and the WLS iterated #at facet on "ltering images with noises of standard deviation 10.0 and 20.0. The iterated median "ltering employs a window size of 3]3. Each algorithm is
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
Fig. 8. (a) Restored #at image using Gaussian smoothing and (b) its edges.
287
applied iteratively to the corrupted images and the lowest m.s.e. and the iteration at which it is attained are noted for each algorithm. The iterated median "ltering has the largest m.s.e. and edge m.s.e. Although the iterated median "lter has been known for having good edge retaining ability, its performance is signi"cantly poorer than the #at facet algorithm and the WLS #at facet algorithm. The WLS #at facet algorithm has lower m.s.e. and edge m.s.e. than the #at facet algorithm. Although the WLS #at facet algorithm may required more iterations to attain its minimum, its errors are lower than the #at facet algorithm at all iterations. A piecewise sloped image is also generated to test the performances of the sloped facet model and the WLS sloped facet model. Fig. 9 shows the sloped image and the set of points for calculating the edge mean square error. Note that this image has a constant background and thus contains a mixture of both #at and sloped surfaces. The noise corrupted version shown in Fig. 10 has an additive Gaussian noise of standard deviation 20.0. Table 2 shows the summary results using the combined errors in restoring the sloped image. Ten iterations of the iterated median, #at facet, sloped facet, WLS #at facet and WLS sloped facet are applied to two synthetic images with noise s.d. of 10.0 and 20.0, respectively. The lowest attained error is recorded together with the iteration number at which it is achieved. The iterated median "lter has the largest combined error. As the iterated #at facet algorithm does not cater for the sloped nature of the image intensities, its error is larger than the iterated sloped facet algorithm. The WLS #at and WLS sloped facet "ltered images have the lowest errors. The low mean squared error of the WLS #at facet model shows that the WLS #at facet algorithm has a much higher approximating capability than the original #at facet algorithm under sloping condition. Figs. 11 and 12 show the "ltered images using the sloped facet algorithm and the WLS sloped facet algorithm. The WLS sloped facet "ltered image has less grainy texture than the sloped facet "ltered image. To compare the edges extracted, the Canny edge algorithm is applied to the "rst di!erence of the images. The WLS sloped facet "ltered image has much less false edges than the sloped facet "ltered image.
Table 1 Summary of performances for "ltering of corrupted images: (a1) m.s.e. and (a2) edge m.s.e. of iterated median; (b1) m.s.e. and (b2) edge m.s.e. of iterated #at facet; (c1) m.s.e. and (c2) edge m.s.e. of iterated WLS #at facet Noise s.d.
(a1)
(a2)
(b1)
(b2)
(c1)
(c2)
10.0 20.0
5.27 (10) 7.14 (10)
10.27 (2) 19.12 (2)
4.95 (3) 9.80 (3)
6.23 (1) 10.94 (2)
3.83 (3) 7.43 (4)
5.94 (2) 9.88 (5)
288
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
Fig. 9. (a) Synthetic sloped image and (b) its edges.
Fig. 13 shows the restoration result of minimizing the global facet energy de"ned in Eq. (7) using simulated annealing. The #at facet resolution cell is chosen to be of size 2]2. The corrupted image has Gaussian noise of standard deviation of 20.0 and is shown in Fig. 3. The noisy image is used as the annealing input. The initial temperature of annealing is taken as 100.0 to give enough energy to escape from local states. An exponential cooling schedule is employed with decay factors 0.9995 and the annealing is stopped at a temperature of 0.01. The
Fig. 10. (a) Sloped image contaminated with Gaussian noise of s.d. 20 and (b) its edges.
restored image shows the capability of the global energy approach in successfully restoring piecewise objects of di!erent sizes independent of the sizes of the resolution cells. The regions are smooth and the edges are sharp. The facet algorithms are also applied to a range image. The range image has various types of inherent noises, including salt and pepper noise and correlated signaldependent noise. Fig. 14 shows the range image and its edges as detected by the Canny edge algorithm. Ten
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
289
Table 2 Summary of performances using combined error for "ltering of corrupted images with (a) iterated median, (b) iterated #at facet, (c) iterated sloped facet (d) iterated WLS #at facet (e) iterated WLS sloped facet Noise s.d.
(a)
(b)
(c)
(d)
(e)
10.0 20.0
10.63 (1) 13.63 (2)
7.32 (1) 12.06 (2)
6.92 (1) 10.57 (2)
6.04 (1) 9.87 (2)
6.03 (1) 9.67 (3)
Fig. 11. (a) Restored sloped image using sloped facet and (b) its edges.
Fig. 12. (a) Restored sloped image using WLS sloped facet and (b) its edges.
290
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
Fig. 13. (a) Restored #at image using global facet energy and (b) its edges.
iterations of the sloped facet algorithm and WLS sloped facet algorithm is applied to the range image. The larger number of iterations is due to the di$culty in removing correlated signal-dependent noise compared to additive independent noise. In order to have a better understanding of the nature of the noises and the restoration results of the two facet algorithms, a horizontal scan-line of range data is taken from the original image and restored images. Fig. 15 shows the plot of the range data of the
Fig. 14. (a) Range image and (b) its edges.
horizontal scan-line. The original range image consists of #at and sloped surfaces. The noises are correlated and is most signi"cant between 140 and 220. Various spikes noises are also present. The sloped facet algorithm removes most of the independent noises and the restored image is signi"cantly smoother. However, the restored surfaces show small facets with slopes #uctuating from its neighbors. This is particularly evident on the right side of the scan-line where the data is contaminated with a high amount of correlated noise. The WLS facet algorithm
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
291
Fig. 16. (a) Restored range image using sloped facet and (b) its edges.
Fig. 15. A scan-line of gray values from (a) range image (b) image "ltered by slope facet (c) image "ltered by WLS slope facet.
shows much better performance in estimating slopes of the images where the data are corrupted with correlated noises. Figs. 16 and 17 show the restored range images and their edges. Both the visual quality of the images and
the edges extracted from them show the superiority of the WLS sloped facet algorithm. To conclude, the weighted least-squares facet algorithm and the global energy method have been developed using the facet model. The WLS facet algorithm improves the estimation of the facet parameters by taking into account the variances of estimations from di!erent resolution cells. The global energy method gives a nearoptimal estimation of the ideal image as predicted by the
292
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293
digital image is considered as a noisy realization of an underlying piecewise continuous intensity surface. Popular forms of the facet model include #at facet model, sloped facet model and quadratic facet model. A global energy approach and a weighted least-squares facet algorithm have been developed using the facet model. The global energy method gives a near-optimal estimation of the ideal image as predicted by the facet mode using a regularization framework. The weighted leastsquares facet algorithm improves the estimation of the facet parameters by taking into account the variances of estimations from di!erent resolution cells. Experimental results show the validity of the global energy approach and the superiority of the WLS facet algorithms in both synthetic images and real range images.
Acknowledgement The work described in this paper was supported by grants from the Research Grants Council of the Hong Kong Special Administrative Region (Project No. HKP98/95E) and The Hong Kong Polytechnic University (A/Cs A-P178, G-S732).
Appendix A The expected goodness of "t for the resolution cells = , = , and = is identical and can be calculated i4 i5 i6 from the expected variances of the di!erences between the estimated values and the noise corrupted sample values in the resolution cells, E(<ar(g( !g ))"E(<ar(1 Dk !k D#d)), j"1,2, 6, jik 3 2 1 jik "1(k !k )2#p2, 9 2 1 E(<ar(g( ik!g ik))"E(<ar(2Dk !k D#d)), j"7,2, 9, j 3 2 1 j "4 (k !k )2#p2, 9 2 1 Fig. 17. (a) Restored range image using WLS sloped facet and (b) its edges.
facet model using a regularization framework. Experimental results show the validity of the global energy approach and the superiority of the WLS facet algorithms in both synthetic images and real range image.
(11)
where d is a Gaussianly distributed random variable with standard deviation of p. The goodness of "t is the total expected variance summing the expected variances of the pixels j where j"1,2, 9. The expected goodness of "t for the resolution cells = , = and = is similarly i7 i8 i9 derived.
References 6. Summary Facet models are important tools in computer vision and image processing. In the facet model, the observed
[1] R.M. Haralick, L.T. Watson, A facet model for image data, Comput. Graphics Image Process. 15 (1981) 113}129. [2] R.M. Haralick, L.G Shapiro, Computer and Robot Vision, Addison-Wesley, Reading, MA, 1992.
C.H. Li, P.K.S. Tam / Pattern Recognition 33 (2000) 281}293 [3] A. Rangarajan, R. Chellappa, Y.T. Zhou, A model-based approach for "ltering and edge detection in noisy images, IEEE Trans. Circuits Systems 37 (1990) 140}144. [4] K. Schutte, Region growing with planar facets, Proceedings of the 8th Scandinavian Conference on Image Analysis, vol 2, 1993, pp. 719}725. [5] C.P. Ting, R.M. Haralick, L.G. Shapiro, Shape from shading using facet model, Pattern Recognition 22 (1989) 683}695. [6] Q. Zheng, R. Chellappa, Estimation of surface topography from stereo SAR images, Proceedings of the IEEE ICASSP 1989, vol. 3 (1989) 1614}1617. [7] J.S. Huang, D.H. Tseng, Statistical theory of edge detection, Comput. Vision Graphics and Image Process. 43 (1988) 337}346. [8] G. Demoment, Image reconstruction and restoration: overview of common estimation structures and problems,
[9]
[10]
[11]
[12] [13]
293
IEEE Trans. Accoustic, Speech Signal Process. 37 (12) (1989) 2024}2036. N.B. Karayiannis, A.N. Venetsanopoulos, Regularization theory in image restoration-the stabilizing functional approach, IEEE Trans. Systems Man, Cybernet. 7 (1977) 435}441. S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell. 6 (1984) 721}741. D. Geman, G. Reynolds, Constrained restoration and the recovery of discontinuities, IEEE Trans. Pattern Anal. Mach. Intell. 14 (3) (1992) 367}383. S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing, Science 220 (1983) 671}680. R.I. Jennrich, An Introduction to Computational Statistics, Prentice-Hall, Englewood Cli!s, NJ, 1995.
About the Author*CHUN HUNG LI received his Ph.D. in Electronic Engineering from The Hong Kong Polytechnic University in 1996. He is currently working as a post-doctoral fellow at the Department of Computer Science in the Hong Kong Baptist University. His research interests include stochastic image models, image analysis and pattern recognition. About the Author*PETER K.S. TAM received his B.E., M.E. and Ph.D. degrees in 1971, 1973 and 1976, respectively, all in Electrical Engineering from the University of Newcastle, Australia. From 1967 to 1980, he held a number of industrial and academic positions in Australia. In 1980, he joined The Hong Kong Polytechnic as a senior lecturer. He is now an Associate Professor in the Department of Electronic and Information Engineering, The Hong Kong Polytechnic University. Dr. Tam is a member of the IEEE, and has participated in organising a number of conferences. His research interests include signal processing, automatic control, fuzzy systems and neural networks.
Pattern Recognition 33 (2000) 295}308
Active vision-based control schemes for autonomous navigation tasksq Sridhar R. Kundur!,*, Daniel Raviv!," !Robotics Center and Department of Electrical Engineering, Florida Atlantic University, Boca Raton, FL 33431, USA "Intelligent Systems Division, National Institute of Standards and Technology (NIST), Bldg. 220, Room B124, Gaithersburg, MD 20899, USA Received 28 August 1998; accepted 2 March 1999
Abstract This paper deals with active-vision-based practical control schemes for collision avoidance as well as maintenance of clearance in a-priori unknown textured environments. These control schemes employ a visual motion cue, called the visual threat cue (VTC) as a sensory feedback signal to accomplish the desired tasks. The VTC provides some measure for a relative change in range as well as clearance between a 3D surface and a moving observer. It is a collective measure obtained directly from the raw data of gray level images, is independent of the type of 3D surface texture. It is measured in [time~1] units and needs no 3D reconstruction. The control schemes are based on a set of If}Then fuzzy rules with almost no knowledge about the vehicle dynamics, speed, heading direction, etc. They were implemented in real-time using a 486-based Personal Computer and a camera capable of undergoing 6-DOF motion. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Active vision; Visual navigation; Collision avoidance
1. Introduction The problem of automating vision-based navigation is a challenging one and has drawn the attention of several researchers over the past few years (see for example Refs. [1}18]). When dealing with a camera-based autonomous navigation system, a huge amount of visual data is captured. For vision-based navigation tasks like obstacle avoidance, maintaining safe clearance, etc., relevant visual information needs to be extracted from these visual data and used in real-time closed-loop control system. Several questions need to be answered, including: (1) What is the relevant visual information to be extracted from a sequence of images? (2) How does one extract this information from a sequence of 2D images? (3) How
q This work was supported in part by a grant from the National Science Foundation, Division of Information, Robotics and Intelligent Systems, Grant d IRI-9115939. * Corresponding author. E-mail address:
[email protected] (S.R. Kundur)
to generate control commands to the vehicle based on the visual information extracted? This paper provides answers to all three questions with emphasis on the third one, i.e., generation of control signals for collision avoidance and maintenance of clearance using visual information only. Vision-based autonomous navigation systems consist of a vehicle (such as a car, golf cart), visual sensing devices (camera, frame grabber for digitizing images) and other mechanical actuators for braking/steering of the vehicle. Relevant visual information is extracted from an image sequence and serves as input(s) to the feedback controller. The feedback controller generates appropriate signals to the mechanical actuators to brake/steer the vehicle (as shown in Fig. 1). Design of conventional feedback controllers needs a mathematical model of the system (including the vehicle as well as the mechanical actuators). Mathematical models for such systems are usually complex and may be di$cult to de"ne in some cases. On the other hand fuzzy logic control, which is closer in spirit to human thinking, can implement linguistically expressed heuristic control policies directly without any knowledge
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 8 - 8
296
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
formation as the primary source of sensory feedback in building intelligent robotic systems. A brief review of related work in vision-based autonomous navigation is described in the following paragraphs.
Fig. 1. Proposed controller.
about the mathematical model of a complex process. This paper presents two practical control schemes for vision-based autonomous navigation tasks such as collision avoidance and maintenance of clearance. The control schemes are based on a set of If}Then fuzzy rules and need almost no information about the vehicle kinematics, dynamics, speed, heading direction, etc. Also no a-priori information about the relative distances between the camera and the surfaces is necessary. The main focus of this paper is to present details of the controllers to accomplish the above-mentioned tasks. The input to these controllers is the visual threat cue (VTC) that can be extracted from a sequence of images (see Refs. [24}26] for details on the VTC) and the output is appropriate braking/steering commands to the mobile robot (see Fig. 1). These control schemes were implemented in textured environments and are almost independent upon the type of texture in the scene. These approaches can be extended to texture-less environments as well [27]. 1.1. Related work in vision-based autonomous navigation Autonomous intelligent robots play an important role in many applications including industrial automation, space exploration, autonomous driving/#ying, handling of hazardous materials, etc. Over the past few decades several researchers have been exploring approaches to build intelligent, goal driven robotic systems which can interact with the environment autonomously (see for example Refs. [1}18]). In the absence of a-priori information about the environment, such an intelligence in the robots may be imparted by using external sensors such as tactile, visual, audio, etc., to sense the environment and interact with it in an intelligent manner. Since our approach is based on visual sensing, we restrict our attention to vision-based intelligent robots only. In the animate world visual information plays a key role in controlling animal behavior in the environment (see for example Refs. [22,23]). Several psychologists have suggested that vision is the primary source of information about the surroundings and is responsible for controlling visual behavior of humans in the environment (see for example Refs. [19}23]). These observations have motivated many researchers to employ visual in-
1.1.1. Vision-based autonomous navigation using a-priori information Papanikolopoulos and Khosla [1] presented algorithms for real-time visual tracking of arbitrary 3D objects moving at unknown velocity in a plane whose depth information is assumed to be known a-priori. They proposed an optical #ow-based approach to compute the vector of discrete displacements each instant of time. In Ref. [2], they described a vision sensor in the feedback loop within the framework of controlled active vision. This approach requires partial knowledge of the relative distance of the target with respect to the camera which obviates the need for o!-line calibration of the eye-inhand robotic system. Feddema and Lee [3] proposed an adaptive approach for visual tracking of an a-priori known moving object with a monocular mobile camera. They employed a geometrical model of the camera to determine the linear di!erential transformation from image features to the camera position and orientation. Computer simulations were provided to verify the proposed algorithms. Reid et al. [4] described an active vision system to perform a surveillance task in a dynamic scene. These ideas were implemented in a special purpose high performance robotic head/eye platform. The controller was divided into two parts namely the high-level control which has some knowledge about vision and behaviors and the low-level control had information about head kinematics and dynamics, via joint angles and velocities from the encoders and the motor torques. Han and Rhee [5] describe a navigation method for a mobile robot that employs a monocular camera and a guide mark. They instruct the robot by means of a path drawn on a monitor screen. The images of the guide mark obtained by the camera provides information regarding the robot's position and heading direction. It adjusts the heading direction if any deviation in the speci"ed path is detected. This approach was implemented on a real system with average speeds of 2.5 feet/s. with deviations of less than one foot in an indoor environment. Turk et al. [6] described an approach to distinguish road and non-road regions by employing color images. They generate a new trajectory by minimizing a cost function based on the current heading of the vehicle, curvature of the road scene model, attraction to a goal location, and changes in the road edges. It is then sent to the pilot module which controls vehicle motion using lateral position, heading, and velocity error signals. They successfully implemented this approach to drive an autonomous land vehicle (ALV) at speeds up to 20 km/h. The vehicle motion is assumed to be known.
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
1.1.2. Autonomous navigation using conventional feedback control approaches Feng and Krogh [7] describe a general approach for local navigation problems for autonomous mobile robots and its applications to omnidirectional and conventionally steered wheel-bases. They formulate the problem of driving an autonomous mobile robot as a dynamic feedback control problem in which local feedback information is employed to make steering decisions. A class of satis"cing feedback strategies is proposed to generate reasonable collision-free trajectories to the goal by employing robot dynamics and constraints. This approach was tested in simulations. Krotkov and Ho!man [8] present a terrain mapping system that constructs quantitative models of surface geometry for the Ambler, an autonomous walking robot designed to traverse terrains as on Mars. It employs a laser range "nder to construct elevation maps at arbitrary resolutions. A PI control scheme that employed the elevation error as an error signal to increase the accuracy of the elevation maps was implemented to adjust the elevation values. 1.1.3. Optical yow-based autonomous navigation In Ref. [9], Olson and Coombs outlined a general approach to vergence control that consisted of a control loop driven by an algorithm that estimates the vergence error. Coombs and Roberts [10] demonstrated the centering behavior of a mobile robot by employing the peripheral optical #ow. The system employs the maximum #ow observed in left and right peripheral visual "eld to indicate obstacle proximity. A steering command to the robot is generated on the basis of left and right proximities extracted using optical #ow information. Santos-Victor et al. [11] presented a qualitative approach to vision-based autonomous navigation on the basis of optical yow. It is based on the use of two cameras mounted on a mobile robot and with the optical axis directed in opposite directions such that there is no overlap between the two visual "elds. They implemented these schemes on a computer controlled mobile platform TRC Labmate. 1.1.4. Non-optical yow-based autonomous navigation Romano and Ancona [12] present a visual sensor to obtain information about time-to-crash based on the expansions or contractions of the area without any explicit computation of the optic #ow "eld. This information extracted from images was fed to an opto-motor re#ex, operating at 12.5 Hz. This controller was able to keep a constant distance between a frontal obstacle and itself. The whole approach was implemented and tested on a mobile platform. Joarder and Raviv [13] describe a looming-based algorithm [14] for autonomous obstacle avoidance. Visual looming is extracted from relative temporal variations of
297
projected area in the image and employed it as a sensory feedback to accomplish obstacle avoidance. In Ref. [15], they have implemented a similar algorithm for obstacle avoidance by measuring looming from relative temporal variations of the edge density in a small window around the "xation point. Both the algorithms were implemented on a 6DOF #ight simulator in indoor environment. In Ref. [16], Broggi presented a vision-based road detection system that is implemented on a land vehicle called the MOB-LAB. It is assumed that the road is yat and the complete acquisition parameters (camera optics, position, etc.) are known. The system is capable of detecting road markings on structured roads even in extremely severe shadow conditions. In Ref. [17], Leubbers describes a neural-networkbased feature extraction system for an autonomous high-mobility multi-wheeled vehicle application (HMMWV). A video camera is employed to obtain the images of the road and a neural network is employed to extract visual features from image sequences. The road following task was posed as a regulatory control task and an expert system was used to improve the robustness of the control system. Yakali [18] describe several 2D visual cues for autonomous landing and road following tasks. Using these visual cues road following tasks were successfully tested on a US army land vehicle HMMWV equipped with a video camera in outdoor environments and also on a Denning mobile robot in indoor environments. The autonomous landing task was implemented on a 6-DOF #ight simulator in indoor environment. The above-mentioned references indicate that some autonomous vision-based navigation systems need apriori information about the environment [1}6]. Information about the environment may not be known a-priori in some situations. Some approaches employ conventional feedback controllers [7,8]. The design of conventional controller needs mathematical models of the navigation system. The navigation systems are usually complex and may be di$cult to obtain their mathematical models. The reliability of optical #ow-based approaches depends upon the reliability of measurement of optical #ow from a sequence of images. Reliable extraction of optical #ow may be di$cult in some outdoor scenarios such as variations in lighting, vehicular vibrations, wind, etc. The non-optical #ow-based approaches need information about image features such as areas, centroids, edges, texture, etc. These image features usually depend upon the type of texture in the environment, the camera used to capture the image and the camera parameters such as focus, zoom, etc. This paper describes control schemes for collision avoidance as well as maintenance of clearance tasks in a-priori unknown textured environments (it is possible to extend these approaches to texture-less environments also). These control schemes employ a visual motion cue,
298
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
called the visual threat cue (VTC) as a sensory feedback signal to accomplish the desired tasks. The VTC is a collective measure that can be obtained directly from the raw data of gray level images, is independent of the type of 3D surface texture. It is measured in [time~1] units and needs no 3D reconstruction. The VTC is described in the following section.
2. Overview of the visual threat cue (VTC) Mathematically the VTC is de"ned (for R'R ) as 0 follows [24}27]: (d/dt)(R) , 0R(R!R ) 0 where R is the range between the observer and a point on the 3D surface, d(R)/dt is the di!erentiation of R with respect to time and R is the desired minimum clearance. 0 Note that the units of the VTC are [time~1]. There are imaginary 3D iso-VTC surfaces attached to an observer in motion and are moving with it [24}27]. A qualitative shape of the iso-VTC surfaces is presented in Fig. 2a. A positive value of the VTC corresponds to the <¹C"!R
space in front of the observer and a negative value corresponds to the region back of the observer. The points that lie on a relatively smaller surface corresponds to a relatively larger value of VTC, indicating a relatively higher threat of collision. The VTC information can be used to demarcate the region around an observer into safe, high risk, and danger zones (Fig. 2b). Based on this knowledge one can take an appropriate control action to prevent collisions or maintain clearance [28]. A practical method to extract the VTC from a sequence of images of a 3D textured surface obtained by a xxated, xxed-focus monocular camera in motion has been presented in Refs. [24}27]. This approach is independent of the type of 3D surface texture and needs almost no camera calibration. For each image in such a 2D image sequence of a textured surface, a global variable (which is a measure for dissimilarity) called the image quality measure (IQM) is obtained directly from the raw data of the gray-level images. The VTC is obtained by calculating relative temporal changes in the IQM. This approach by which the VTC is extracted can be seen as a sensory fusion of focus, texture and motion at the raw-data level. The algorithm to extract this cue works better on natural images including fractallike images, where more details of the 3D scene are visible in the images as the range shrinks and also can be implemented in parallel hardware. The VTC can be used to directly maintain clearance in unstructured environments. 2.1. Image quality measure (IQM) Mathematically, the IQM is de"ned as follows [24}26]: IQM
A
B
1 x f yf Lc Lr " + + + + DI(x, y)!I(x#p,y#q)D , DDD x/xi y/yi p/~Lc q/~Lr
Fig. 2. (a) Visual "eld of VTC. (b) Qualitative demarcation of space into threat zones by the VTC.
where I(x, y) is the intensity at pixel (x, y) and x and i x are the initial and "nal x-coordinates of the window f respectively; y and y are the initial and "nal y-coordii f nates of the window in the image respectively and ¸ and c ¸ are positive integer constants; and D is a normalizr ation factor de"ned as D " (2¸ #1)](2¸ #1)] t r (x !x )](y !y ). The IQM is a measure for the disf i f i similarity of gray level intensity in the image. The advantages of using this measure are: (1) It gives a global measure of quality of the image, i.e., one number which characterizes the image dissimilarity is obtained, (2) It does not need any preprocessing, i.e., it works directly on the raw gray level data without any spatiotemporal smoothing or segmentation, (3) It does not need a model of the texture and is suitable for many textures and (4) It is simple and can be implemented in real time on parallel hardware.
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
2.2. Extraction of the VTC from IQM Based on experimental results (indoor as well as outdoor) [26], we observed that the relative temporal changes in the IQM behave in a very similar fashion to the VTC, i.e., d(IQM)/dt d(R)/dt +!R . 0 IQM R(R!R ) 0 This means that the VTC can be measured using the IQM. The VTC is independent of the magnitude of the IQM. A sample set of four images (out of 71) that corresponds to a texture from Brodatz's album [30] as seen by a visually "xating, moving, "xed-focus camera is shown in Fig. 3a. A comparison of the measured as well as theoretical values of the VTC is shown in Fig. 3b. Very similar results were reported in Ref. [25] for 12 di!erent textures of the same album [30]. 2.3. Qualitative view of Md(IQM)/dtN/MIQMN As shown in the previous sections the VTC is de"ned only in a region beyond a certain desired minimum clearance R and is not de"ned when the distance be0 tween the camera is less than R . Though we restrict 0 ourselves to regions beyond the desired minimum clearance there might be situations when one is in the region for which the distance between the camera and the surface is less than R . Since the VTC is unde"ned in this 0 region it cannot be employed when the robot is in this
299
region. However the IQM and relative temporal variations in IQM (Md(IQM)/dtN/MIQMN) can be used since it is an image measure and is de"ned irrespective of the distance between the camera and the surface. Note that the VTC is very similar to the relative temporal variations of the IQM (see Fig. 4a}c). 3. Control objectives Two vision-based control schemes have been implemented on a six DOF #ight simulator using the VTC as a sensory feedback signal. This section describes the desired control tasks, the constraints and the motivation for the choice of the control schemes employed. 3.1. Control task I: Collision avoidance The objective of this control task is to stop a moving robot in front of an a-priori unknown textured obstacle when the distance between the camera and the obstacle is equal to a certain desired clearance R (see Fig. 5a), 0 employing visual information only. 3.2. Control task II: maintenance of clearance The objective of this control task is to maintain a constant clearance between an a-priori unknown textured surface and a mobile robot using visual information only (see Fig. 5b).
Fig. 3. (a) Images of D110, d is the relative distance. (b) Comparison of theoretical and measured values of the VTC.
300
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
Fig. 5. (a) Control objective I. (b) Control objective II. Fig. 4. (a) Qualitative behavior of IQM vs. relative range. (b) Qualitative behavior of the relative temporal variations of IQM vs. relative range. (c) Qualitative behavior of the VTC vs. relative range. Note: for R'R VTC is very similar to relative 0 temporal variations of IQM.
3.3. The constraints The above-mentioned control tasks have to be accomplished with the following constraints: f The input to the controllers is visual information only. f No information about the vehicle speed or dynamics is available. f The controllers have no a-priori information about the distance between the camera and the obstacles or the type of the texture on the obstacles. f Obstacles must have texture on them. 3.4. Choice of control schemes In the absence of vehicle dynamics conventional control schemes such as PID schemes are di$cult to implement. On the other hand, Fuzzy Logic Control which
consists of a set of collection of rules seems to be more appropriate for the control tasks with the above mentioned constraints. Research in the area of fuzzy control was initiated by Mamdani's pioneering work [31], which had been motivated by Zadeh's seminal papers on fuzzy algorithms [32] and linguistic analysis [33]. In the past few years several researchers have addressed the use of fuzzy control for various ill-de"ned processes for which it is di$cult to model the dynamics (see for example Refs. [34}36]). Fuzzy control is closer in spirit to human thinking and can implement linguistically expressed heuristic control policies directly without any knowledge about the dynamics of the complex process. Several collision avoidance schemes based on fuzzy approaches have been suggested for autonomous navigation tasks [37}41]. These approaches required many parameters such as the range between the camera and the surface, slant of the surface, heading angle of the robot, width of the road, shape of the road, etc. Usually these control schemes are simulated on a computer without real implementations. The VTC mentioned in Section 2 provides an indication for relative variations in ranges as well as clearances.
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
In other words, if the VTC values increase one can say that the vehicle is moving forward and vice-versa. In order to judge whether the vehicle is moving far away or close to the desired clearance it may not be necessary to know the mathematical model of the vehicle. Autonomous navigation tasks such as collision avoidance and maintenance of clearance can be accomplished by a set of heuristic rules based on the VTC, without vehicle models. This is the main motivation for choosing fuzzy control schemes among several possible conventional control schemes.
4. Fuzzy approach to task I: collision avoidance This section describes the fuzzy logic control scheme employed to accomplish task I, namely to stop a moving robot in front of a textured surface when the distance between the surface and the robot equals a desired range R . No a-priori information such as the relative distance, 0 type of texture, etc., about the obstacle is available to the robot. The design of this controller assumes no mathematical model of the robot and is based on a set of simple IF}THEN rules. A block diagram of the proposed control scheme is shown in Fig. 6. The camera is initially focused to the distance which is equal to R which is the desired stopping gap between 0 the robot and the textured surface. For ranges R greater than R , as the range increases the VTC value decreases 0 and vice-versa. Based on the VTC values, we divide the space in front of the mobile robot into three di!erent regions as shown in Fig. 7a and b. Region I can be seen as a safe region and regions II and III can be seen as danger zones. If the VTC value is greater than a certain positive threshold say VTCTh then the textured surface is in the danger zone of the robot. When the measured VTC is smaller compared to the threshold VTCTh then the tex-
301
tured surface is in the safe zone (region I in Fig. 7a). If the measured value of the VTC is greater than the threshold VTCTh then the textured surface is in the danger zone of the robot. Finally when the VTC values change from positive to negative it provides an indication that the textured surface has entered the desired clearance zone and the robot has to stop moving forward to avoid a collision with the textured surface. Based on the heuristic information about the behavior of the VTC as a function of the range between the robot and the textured surface, we formulate the following rules suitable for achieving the desired control task. It should be noted that we try to demonstrate the use of the VTC as sensory feedback information for collision avoidance and these set of the rules are not the only possible set of rules to accomplish the desired task. Alternative set of rules could be formulated for better control. Rule I: This rule corresponds to the case when the robot is in the safe zone (region I in Fig. 7a). In this zone, no control action should be taken, i.e., no change in speed is necessary. The sensing and action corresponding to this region can be expressed in the IF}THEN format as follows: If the measured VTC value is less than the threshold VTCTh then take no action. Rule II: This rule corresponds to region II in Fig. 7a. If the textured surface is in this region then the value of the value of the measured VTC is greater than the threshold and the robot has to be prepared to stop any time the measured VTC values changes from positive to negative. The condition can be expressed in an IF}THEN format as follows: If the measured VTC is greater than the threshold VTCTh then be prepared to stop anytime the measured VTC value becomes negative.
Fig. 6. Block diagram of control scheme I.
302
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
Fig. 7. (a) Qualitative plot of relative temporal variations of IQM. (b) Fuzzy demarcation of space around the mobile robot for control task I.
Rule III: This rule corresponds to region III in Fig. 7a. In this region the robot is required to stop if it is moving towards the surface. Note that in this region VTC is negative. This condition can be expressed in the IF}THEN format as follows:
Fig. 8. (a) Control task II. Note: Region A corresponds to the left of the camera and region B corresponds to the right of the camera. (b) Regions of interest in images for control task II.
If the measured VTC is negative then Stop. Rule IV: When the robot is either stationary or moving away from the surface, none of the above conditions are satis"ed and hence no control action is taken. This condition can be expressed in IF}THEN format as follows: IF none of the above situations occur THEN take no action.
5. Fuzzy approach to task II: maintenance of clearance
Fig. 9. Block diagram of the control scheme II.
This section describes the fuzzy logic control scheme employed to accomplish task II, i.e., maintenance of clearance (refer to Fig. 8a and b). A block diagram of the proposed control scheme is shown in Fig. 9. In Fig. 8a the left region (region A) is closer to the camera than the right region (region B). The camera is
initially focused at a desired minimum clearance R . 0 When the distance between the camera and the surface is greater than the desired minimum clearance, the points located at a greater distance have relatively smaller values of the VTC than those located at a relatively
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
smaller distance. In other words, the di!erence between the VTC values of the left window (denoted as region A) and the VTC value of the right window (denoted as region B) can be used to generate appropriate control action to maintain a safe clearance between the textured surface and the mobile robot. The di!erence in the measured VTC values of the left and the right windows is the only information that is fed to the controller and the controller generates appropriate control action to the mobile robot to accomplish the desired task. Based on the heuristic information about the behavior of the VTC as a function of the range between the robot and the textured surface, we formulate the following rules suitable for achieving the desired control task. It should be noted that we try to demonstrate the use of the VTC as sensory feedback information for collision avoidance and these set of the rules are not the only possible set of rules to accomplish the desired task. Alternative set of rules could be formulated for better control. For the sake of simplicity let the di!erence between VTCA and VTCB be denoted as Err}AB. In other words, Err}AB " VTCA ! VTCB. Rule I: This rule is formulated to take care of the control when the Err}AB is almost equal to zero. In such a situation the motion to the right (see Fig. 8a) is almost zero. This can be expressed in an IF}THEN format as follows: IF Err}AB is approximately zero THEN motion to the right is approximately zero.
303
can be expressed in an IF}THEN format as follows: IF VTCA(0 and VTCB'0 THEN motion to the right is big and reverse the current direction of motion. Rule VI: According to this region both region A and region B are within the desired clearance region. The desired control action is to move the robot backwards. IF VTCA(0 and VTCB(0 THEN motion to the right is big and reverse the current direction of motion. Rule VII: When the textured surface is perpendicular to the direction of motion of the mobile robot, Err}AB is going to be zero irrespective of the distance between the robot and the mobile robot. In such a case Rules I } VI will fail and the robot might collide with the textured surface instead of maintaining a safe clearance. This situation may be overcome by the following IF}THEN condition: IF Err}AB is almost zero and VTCA(0 and VTCB (0 THEN move sidewards (either right or left). Rule VIII: When the robot is either stationary or moving away from the surface none of the above mentioned conditions are satis"ed. This situation can be expressed in an IF}THEN format as follows: IF none of the above situations occur THEN take no change in the velocity. 5.1. Membership functions employed
Rule II: This rule is formulated to take care of the control when the Err}AB small. In such a situation the motion to the right is small (see Fig. 8a). This can be expressed in an IF}THEN format as follows: IF Err}AB is positive small THEN motion to the right is small. Rule III: This rule is formulated to take care of the control when the Err}AB is medium. In such a situation the motion to the right medium (see Fig. 8a). This can be expressed in an IF}THEN format as follows:
Simple linear membership functions are employed in the fuzzy rule base (as shown in Fig. 10). 5.2. Defuzzixcation Defuzzi"cation of the inferred fuzzy control action is necessary in order to produce a crisp control action. Since monotonic membership functions are used, we use Tsukamoto's defuzzi"cation method [37], which is stated
IF Err}AB is positive medium THEN motion to the right is medium. Rule IV: This rule is formulated to take care of the control when the Err}AB is big. In such a situation the motion to the right is big (see Fig. 8a). This can be expressed in an IF}THEN format as follows: IF Err}AB is positive big THEN motion to the right is big. Rule V: According to this rule when the region A is within the desired clearance and region B is in the region beyond the desired clearance, the desired control is to move to the right and move backwards (see Fig. 8a). This
Fig. 10. Qualitative membership functions for control scheme II.
304
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
simulator controlled by a 486-based Personal Computer. This section presents implementation details of the control schemes. 6.1. Experimental setup The system used in the experiments include:
Fig. 11. Evaluation of weight of a particular rule.
as follows: +n a y ZH" i/1 i i, +n a i/1 i where ZH is the defuzzi"ed crisp control command and a is the weight corresponding to the rule i; y is the i i amount of control action recommended by rule i and n is the number of rules. The ratio of the shaded area to the area of the triangle is used as the "ring strength (see Fig. 11) and is employed as the weight corresponding to that particular rule: a "b (2!b ), i i i when b equals 1, the shaded area equals the area of the i triangle, hence a is 1. i 6. Implementation details The control algorithms presented in the previous sections are implemented on a 6-DOF vision-based #ight
1. 2. 3. 4. 5.
Six DOF miniature #ight simulator CCD video camera Imaging Technologytm Frame Grabber 486-based personal computer Photocopies of texture plates D9, D110 from Brodatz's Album [30] pasted on a #at board employed as textured surface.
6.2. Six-DOF miniature yight simulator An IBM gantry robot has been modi"ed such that all the six motor controllers can accept velocity inputs. A monocular camera is attached to the end e!ector of the robot. This camera is capable of undergoing six-DOF motion within the workspace of the robot (a block diagram is shown in Fig. 12). Various types of miniature environments (structured as well as unstructured) can be simulated in the workspace by physical placing of objects such as toy mountains, trees, etc. A sequence of images is obtained by the camera and the relevant image processing is done by the image processing hardware/software housed in the 486-based Personal Computer. A single 486-based, 50 MHz, Personal Computer is employed to control the robot as well as perform the relevant image processing. Control
Fig. 12. Block diagram of the simulator and setup.
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
305
According to the rules presented in the previous section the crisp control action is (either move or stop) is generated (see Fig. 6). Two di!erent speeds were employed in this control scheme (speed2'speed1). 6.5. Control scheme II Two windows (left and right) each 50]50 pixels are opened in the image. In each the visual parameter VTC is evaluated and based on the di!erence between left and right values an appropriate control signal is generated. This control scheme was tested for four di!erent orientations of the texture surfaces used.
7. Results and analysis 7.1. Control scheme I
Fig. 13. (a) Camera mounted on the robot. (b) Textures used in the control experiments.
commands to the robot are generated on the basis of relevant visual information extracted from the image sequence. 6.3. Procedure A CCD video camera is used to obtain the images of the 3D textured environments. These images are digitized by an image processing PC-board ITEX PC-VISION PLUS. A block diagram of the experimental setup is shown in Fig. 12. Textured pattern (D5, D110 from Ref. [30], see Fig. 13b) pasted on a #at surface is presented as the obstacle along the path of the robot (see Fig. 13a). For both control schemes the camera is initially focused to the desired minimum clearance (R "200 mm.). 0 Qualitative measures for fuzzy sensing and action (small, medium, big, etc.) are employed, rather than the exact speeds. 6.4. Control scheme I A window of size 51]51 pixels is chosen in the center of the image to evaluate the visual feedback signal VTC.
Two di!erent speeds were used to test the braking capability of the control algorithm. We observed that the greater the speed of the robot, the greater is the error between the desired and actual values of the clearance between the robot and the surface. The results are summarized in Table 1. From Table 1 it can be seen that there is an error between the desired stopping point and the actual stopping distance. This error is due to the inertia of the robot and mainly depends upon the speed at which the robot is traversing, in otherwords, at higher speeds the error is high and at lower speeds the error is lower. This error can be minimized by applying the brakes to the robot even before it reaches the desired clearance point. The point where it should start applying braking before reaching the desired clearance may be determined by employing additional visual motion cues (see Ref. [29] for additional visual motion cues). 7.2. Control scheme II The lateral and longitudinal components of the heading vector were recorded. The resultant was plotted manually (see Figs. 14}17). Two sets of results using two texture patterns (shown in Fig. 13b) are presented. Each
Table 1 Summary of vision-based collision avoidance results No.
Texture
Speed
Desired
Actual
Error
1 2 3 4
D5 D5 D110 D110
Speed1 Speed2 Speed1 Speed2
200 mm 200 mm 200 mm 200 mm
180 mm 165 mm 180 mm 165 mm
20 mm 35 mm 20 mm 35 mm
306
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
Fig. 14. Results of control scheme II for D110: Case 1.
Fig. 15. Results of control scheme II for D110: Case 2.
Fig. 16. Results of control scheme II for D5: Case 1.
Fig. 17. Results of control scheme II for D5: Case 2.
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
texture pattern was tested under two di!erent orientations. All the four experiments in this control scheme employed the same rule base. The error between the desired path and the actual path is highly dependent upon the choice of fuzzy membership functions, rule-base and defuzzi"cation schemes used. Addition of more rules to the existing ones may improve the error between the desired and actual paths. Also by employing temporal smoothing to the measured VTC values may improve the error.
[5] [6]
[7]
[8]
8. Conclusions and future work This paper presented implementable active-visionbased real-time closed-loop control schemes for collision avoidance and maintenance of clearance tasks. The control schemes are based on a new measure called the VTC that can be extracted directly from raw gray-level data of monocular images. In other words, the VTC is the relevant visual information used in the control tasks. The VTC needs no optical #ow information, segmentation or feature tracking. The control schemes are based on a set of If}Then fuzzy rules and needs no information about the robot dynamics, speed, heading direction, etc. From the experimental results, it can be seen there is some error between the desired trajectory and the actual trajectory. Some possible sources of this error include: slower computation of the VTC, no temporal smoothing of the measured values of the VTC, choice of rules in the rule base, etc. It is possible to obtain better performances by using some temporal smoothing for the measured values of the IQM as well as using high-speed computers (may be parallel hardware implementations). Currently we are working on the extensions of the control schemes mentioned in this paper to unstructured outdoor environments using a golf cart (known as LOOMY) designed and built at Florida Atlantic University. Preliminary results in real outdoor environments are highly encouraging.
References [1] N.P. Papanikolopoulos, P.K. Khosla, Visual tracking of a moving target by a camera mounted on a robot: a combination of control and vision, IEEE Trans. Robot. Automat. 9 (1) (1993) 14}18. [2] N.P. Papanikolopoulos, P.K. Khosla, Adaptive robotic visual tracking: theory and experiments, IEEE Trans. Automat. Control 38 (3) (1993) 429}444. [3] J.T. Feddema, C.S.G. Lee, Adaptive image feature prediction and control for visual tracking with a hand-eye coordinated camera, IEEE Trans. Systems, Man, Cybernet. 20 (5) (1990) 1172}1183. [4] I.D. Reid, K.J. Bradshaw, P.F. McLauchlan, P.M. Sharkey, D.W. Murray, From saccades to smooth purs uit: real-
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19] [20] [21] [22] [23]
307
time gaze control using motion feedback, Proc. IEEE/RSJ Int. Conf. on Intelligent Robots and Systems, Yokohama, Japan, 1993, pp. 1013}1020. M-H. Han, S-Y. Rhee, Navigation control for a mobile robot, J. Robot. Systems 11 (3) (1994) 169}179. M.A. Turk, D.G. Morgenthaler, K.D. Gremban, M. Marra, VITS-A vision system for autonomous land vehicle navigation, IEEE Trans. Pattern Anal. Mach. Intell. 10 (3) (1988). D. Feng, B.H. Krogh, Satis"cing feedback strategies for local navigation of autonomous mobile robots, IEEE Trans. Systems, Man Cybernet. 20 (6) (1989). E. Krotkov, R. Ho!man, Terrain mapping for a walking planetary rover, IEEE Trans. Robot. Automat. 10 (6) (1994) 728}739. T.J. Olson, D.J. Coombs, Real-time vergence control for binocular robots, Int. J. Comput. Vision 7 (1) (1991) 67}89. D. Coombs, K. Roberts, Centering behavior using peripheral vision, Proc. Computer Vision and Pattern Recognition, 1993, pp. 440}445. J. Santos-Victor, G. Sandini, F. Curotto, S. Garibaldi Divergent stereo for robot navigation: learning from bees, Proc. Computer Vision and Pattern Recognition, 1993, pp. 434}439. M. Romano, N. Ancona, A real-time re#ex for autonomous navigation, Proc. Intelligent Vehicles Symposium, 1993, pp. 50}55. K. Joarder, D. Raviv, Autonomous obstacle avoidance using visual "xation and looming, Proc. SPIE Vol. 1825 Intelligent Robots and Computer Vision XI, 1992, pp. 733}744. D. Raviv (1992), A quantitative approach to looming, Technical Report, NISTIR 4808, National Institute for Standards and technology, Gaithersburg, MD 20899. K. Joarder, D. Raviv, A new method to calculate looming for autonomous obstacle avoidance, in Proc. Computer Vision and Pattern Recognition, 1994, pp. 777}779. A. Broggi, A massively parallel approach to real-time vision-based road markings detection, Proc. Intelligent Vehicles Symposium, Detroit, 1995, pp. 84}89. P.G. Luebbers, An arti"cial neural network architecture for interpolation, function approximation, time series modeling and control applications, Ph.D. Dissertation, Electrical Eng. Dept., Florida Atlantic University, Boca Raton, FL, 1993. H.H. Yakali, Autonomous landing and road following using 2D visual cues, Ph.D. Dissertation, Electrical Eng. Dept., Florida Atlantic University, Boca Raton, FL, 1994. J.J. Gibson, The Perception of the Visual World, Houghton Mu%in, Boston, 1950. J.J. Gibson, The Senses Considered as Perceptual Systems, Houghton Mu%in, Boston, 1954. J.E. Cutting, Perception with an Eye for Motion, The MIT Press, Cambridge, MA, 1986. I. Rock, Perception, Scienti"c American Library, New York, 1984. D.J. Ingle, M.A. Goodale, R.J.W. Mans"eld (Eds.), Analysis of Visual Behavior, The MIT Press, Cambridge, MA, 1982.
308
S.R. Kundur, D. Raviv / Pattern Recognition 33 (2000) 295}308
[24] S.R. Kundur, D. Raviv, An image-based visual threat cue for autonomous navigation, Proc. IEEE International Symposium on Computer Vision, Coral Gables, FL, 1995, pp. 527}532. [25] S.R. Kundur, D. Raviv, Novel active vision based visual threat cue for autonomous navigation tasks, Comput. Vision Image Understanding 73 (2) (1995) 169}182. [26] S.R. Kundur, D. Raviv, Novel active-vision-based visual threat cue for autonomous navigation tasks, Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition, San Francisco, 1996, pp. 606}611. [27] S. Kundur, D. Raviv, E. Kent, An image-based visualmotion-cue for autonomous navigation, Proc. IEEE Intl. Conference on Computer Vision and Pattern Recognition, 17}19 June, San Juan, Puerto Rico, 1997. [28] S. Kundur, D. Raviv, vision-based fuzzy controllers for autonomous navigation tasks, Proc. IEEE Int. Symp. on Intelligent Vehicles '95, Detroit, 1995. [29] S.R. Kundur, D. Raviv, A vision-based pragmatic strategy for autonomous navigation, Pattern Recog. 31 (31) (1998) 1221}1239. [30] P. Brodatz, Textures: a photographic album for artists and designers, Dover Publications, New York, 1966. [31] E.H. Mamdani, Applications of fuzzy algorithms for control of simple dynamic plant, Proc. IEE. 121 (2) (1974) 1585}1588.
[32] L.A. Zadeh, Fuzzy algorithms, Inform. Control 12 (1968) 94}102. [33] L.A. Zadeh, Outline of a new approach to the analysis of complex systems and decision processes, IEEE Trans. Systems, Man, and Cybernet. SMC-3 (1973) 28}44. [34] C.C. Lee, Fuzzy logic in control systems: fuzzy logic controller - Part I, IEEE Trans. Systems, Man, Cybernet. 2 (2) (1990) 404}418. [35] C.P. Pappis, E.H. Mamdani, A fuzzy logic controller for a tra$c junction, IEEE Trans. Systems, Man, Cybernet. SMC-7 (10) (1977) 707}717. [36] M. Sugeno, M. Nishida, Fuzzy control of model car, Fuzzy Sets Systems 16 (1985) pp. 103}113. [37] H.J. Zimmermann, Fuzzy Set Theory and its Applications, Kluwer Academic Publishers, Boston, 1991. [38] M. Sugeno, M. Nishida, Fuzzy control of model car, Fuzzy Sets Systems 16 (1985) 103}113. [39] M. Maeda, Y. Maeda, S. Murakami, Fuzzy drive control of an autonomous mobile robot, Fuzzy Sets and Systems 39 (1991) 195}204. [40] Y. Nagai, N. Enomoto, Fuzzy control of a mobile robot for obstacle avoidance, J. Informat. Sci. 45 (2) (1988) 231}248. [41] P-S. Lee, L-L. Wang, Collision Avoidance by fuzzy logic control for automated guided vehicle navigation, J. Robot. Systems 11 (8) (1994) 743}760.
About the Author*SRIDHAR R. KUNDUR received the B. Tech. degree in Electrical and Electronics Engineering from Jawahar Lal Nehru Technological University, Hyderabad, India, and the M.S. degree in Electrical Engineering from Southern Illinois University, Edwardsville, Illinois, in 1990 and 1992 respectively, and the Ph.D. degree in Electrical Engineering from Florida Atlantic University, Boca Raton, FL, in 1996. He is currently an applications engineer at Sharp Digital Information Products, Inc., Huntington Beach, CA. His research interests are in image processing, computer vision, and vision-based autonomous navigation. About the Author*DANIEL RAVIV received the B.Sc. and M.Sc. degrees in electrical engineering from the Technion in Haifa, Israel, in 1980 and 1982, respectively, and the Ph.D. degree in Electrical Engineering and Applied Physics from Case Western Reserve University, Cleveland, Ohio in 1987. He is currently an associate professor of Electrical Engineering at Florida Atlantic University, Boca Raton Florida and a visiting researcher at the National Institute of Standards and Technology (NIST), Gaithersburg, Maryland. His research interests are in vision-based autonomous navigation, computer vision, and inventive thinking.
Pattern Recognition 33 (2000) 309}315
Similarity normalization for speaker veri"cation by fuzzy fusion Tuan Pham*, Michael Wagner Faculty of Information Sciences and Engineering, University of Canberra ACT 2601, Australia Received 11 June 1998; accepted 25 January 1999
Abstract Similarity or likelihood normalization techniques are important for speaker veri"cation systems as they help to alleviate the variations in the speech signals. In the conventional normalization, the a priori probabilities of the cohort speakers are assumed to be equal. From this standpoint, we apply the theory of fuzzy measure and fuzzy integral to combine the likelihood values of the cohort speakers in which the assumption of equal a priori probabilities is relaxed. This approach replaces the conventional normalization term by the fuzzy integral which acts as a non-linear fusion of the similarity measures of an utterance assigned to the cohort speakers. We illustrate the performance of the proposed approach by testing the speaker veri"cation system with both the conventional and the fuzzy algorithms using the commercial speech corpus TI46. The results in terms of the equal error rates show that the speaker veri"cation system using the fuzzy integral is more #exible and more favorable than the conventional normalization method. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Speaker veri"cation; Similarity normalization; Fusion; Fuzzy measure; Fuzzy integral
1. Introduction Speaker veri"cation is one of the challenging areas of speech research and has many applications including telecommunications, security systems, banking transactions, database management, forensic tasks, command and control, and others. Technically, it is one of the two tasks in speaker recognition. In other words, a speaker recognition system can be divided into two categories: speaker identixcation and speaker verixcation. A speaker identi"cation recognizer tries to assign an unknown speaker to one of the reference speakers based on the closet measure of similarity, whereas a speaker veri"cation recognizer is aimed to either accept or reject an unknown speaker by verifying the identity claim. Thus, the main point to distinguish between these two tasks is the number of decision alternatives. For speaker identi"cation, the decision alternatives are equal to the number
* Corresponding author. Tel.: #612-6201-2394; fax: #6126201-5231. E-mail address:
[email protected] (T. Pham)
of the speakers. For speaker veri"cation, there are only two alternatives, i.e. either accept or reject the claimed speaker. Di!erent tasks of recognition can be used to serve di!erent purposes. The veri"cation systems are more appropriate for most commercial applications; whereas the identi"cation systems are useful for the study of parametric and speech material modeling. For more details in recent developments on speaker recognition, the readers are referred to Refs. [1}3]. In speaker veri"cation systems, the normalization techniques are important as they help to alleviate the variations in the speech signals, which are due to noise, di!erent recording and transmission conditions [1]. There are two types of normalization techniques for speaker recognition: parameter and similarity. Some typical works in the parameter type were proposed by Atal [4], Furui [5], and in the similarity type were by Higgin et al. [6], Matsui and Furui [7]. It has also been reported that most of speaker veri"cation systems are based on the similarity-domain normalization [8]. We therefore, in this paper, will focus our attention to the veri"cation mode with respect to the similarity normalization.
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 2 - 4
310
T. Pham, M. Wagner / Pattern Recognition 33 (2000) 309}315
Generally in most similarity normalization techniques, the likelihood values of the utterance coming from the cohort speakers, whose models are closest to the claimant model, are assumed to be equal likely. In reality, however, this assumption is not often true as the similarity measures between each cohort speaker and the client speaker may be di!erent. Basing our motivation on this drawback, we introduce a new normalized log-likelihood method using the concept of fuzzy fusion. We relax the assumption of equal likelihood by imposing the fuzzy measures of the similarities between the cohort speaker models and the client model. Then the scoring of the cohort models can be obtained by the fuzzy integral which acts as a fusion operator with respect to the fuzzy measures. The rest of this paper is organized as follows. In Section 2, we present the basic formulations of the normalization techniques according to the similarity domain. In Section 3, the concepts of fuzzy measure and fuzzy integral are introduced. The fuzzy fusion for scoring the normalized log likelihood is implemented in Section 4. We compare the performance between the conventional and the proposed techniques using a commercial speech database in Section 5. Finally, Section 6 concludes the new application for speaker recognition and suggests possible development.
where the term log p(XDS ) in both Eqs. (4) and (5) can be b calculated as in Eq. (3), and except for the scale 1/N, it is the log likelihood of an utterance X coming from one of the cohort speakers with the assumption that the a priori probabilities being equal. As the main purpose of this paper is to attempt to improve the scoring of the similarity normalization, we will simply use the vector quantization (VQ) method to generate the acoustical models. Thus, the log likelihood in terms of the VQ distortion measure between the set of vectors X of the claimed speaker and the codebook of a speaker S can be expressed as
2. Similarity-domain normalization
log p(x DS)"!min [d(x , b (S))], k"1, 2,2, K, n n k k
Given an input set of speech feature vectors X"Mxo , xo , 2, xo N, the veri"cation system has to deci1 2 N de if X was spoken by the client (for the sake of simplicity, from now on we will denote xo as x). Based on the similarity domain, this can be seen as a statistical test between H : S and H : S@ where H is the null hypothesis 0 1 0 that the claimant is the client S@, while H is the alterna1 tive hypothesis that the claimant is an impostor S@. The decision according to the Bayesian rule for minimum risk is given by
G
X3H , 0 (1) X3H , 1 where h is a prescribed threshold. Taking the logarithm, the likelihood ratio of Eq. (1) becomes
p(XDS) 'h ¸(X)" p(XDS@) )h
: :
G
log ¸(X)"log p(XDS)!log p(XDS@)
'log h : X3H , 0 )log h : X3H , 1 (2)
where log ¸(X) is also called the normalized log-likelihood score. The normalized log-likelihhood value of X given the client model can be determined as 1 N log p(XDS)" + log p(x DS). n N n/1
(3)
Two common methods called the geometric mean and the maximum [9] can be used to calculate the normalized log-likelihood score given not the client model. For a set of background speaker models of size B: S@" MS , S ,2, S N, the geometric mean method is de"ned 1 2 B as 1 B log p(XDS@)" + log p(XDS ). b B b/1 The maximum method is de"ned as log p(XDS@)"max log p(XDS N, b Sb|S{
(4)
(5)
(6)
where x 3X, b (S) is a codeword of speaker S, K is the n k codebook size, and d(x , b (S)) is the Euclidean distance n k between x and b (S). n k 3. Fuzzy measure and fuzzy integral Stemming from the concept of fuzzy sets by Zadeh [10], the theory of fuzzy measures and fuzzy integrals were "rst introduced by Sugeno [11]. Fuzzy measures are used as subjective scales for grades of fuzziness that can be expressed as `grade of importancea, or `grade of closenessa, etc. In mathematical terms, a fuzzy measure is a set function with monotonicity but not always additivity. Based on the notion of a fuzzy measure, a fuzzy integral is a functional with monotonicity which is used for aggregating information from multiple sources with respect to the fuzzy measure. For more details in theory of fuzzy measure and fuzzy integral, the reader is referred to Refs. [11}13]. 3.1. Fuzzy measure Let > be an arbitrary set, and B be a Borel "eld of >. A set function g de"ned on B is a fuzzy measure if it satis"es the following three axioms: 1. Boundary conditions: g(0)"0, g(>)"1.
T. Pham, M. Wagner / Pattern Recognition 33 (2000) 309}315
2. Monotonicity: g(A))g(B) if ALB, andA, B3B. 3. Continuity: lim g(A )"g(lim A ) if A 3B and i?= i i?= i i MA N is monotone (an increasing sequence of measuri able sets). A g -fuzzy measure is also proposed by Sugeno [11] j which satis"es another condition known as the j-rule (j'!1): g(AXB)"g(A)#g(B)#jg(A)g(B), A, BL>, and AWB"0. It is noted that when j"0, the g -fuzzy measure j becomes a probability measure [14]. In general, the value of the constant j can be determined by the properties of the g -fuzzy measure as follows. j Let >"My , y ,2, y N. If the fuzzy density of the 1 2 m g -fuzzy measure is de"ned as a function j g : y 3>P[0, 1] such that g "g (My N), i"1, 2,2, m, i i j i then the g -fuzzy measure of a "nite set can be obtained j as [15] g (>) j n m~1 n " + g #j + + g 1g 2#2#jm~1g g 2g . i i i 1 2 m i/1 i1/1 i2/i1`1 (7) Provided that jO0, Eq. (7) can be rewritten as
C
D
1 m g (>)" < (1#jg )!1 . (8) j i j i/1 With boundary condition g(>)"1, the constant j can be determined by solving the following equation: m j#1" < (1#jg ). (9) i i/1 It has been proved [16] that for a "xed set of g , i 0(g (1, there exists a unique root of j'!1, and i jO0, using Eq. (9). And also from Eq. (9) it can be seen that if the values of g are known, then j can be cali culated.
311
> are rearranged to make this relation hold), the Sugeno integral can be computed by
P
m f (y)"g ( ) )"max [min( f (y ), g(A ))], i i i/1 A
(11)
where A "My , y ,2, y N, and g(A ) can be recursively i i i`1 m i calculated in terms of the g -fuzzy measure as [14] j g(A )"g #g(A )#jg g(A ), 1(i)m. i i i~1 i i~1
It can be seen that the above de"nition is not a proper extension of the usual Lebesgue integral, which is not recovered when the measure is additive. In order to overcome this drawback, the so-called Choquet integral was proposed by Murofushi and Sugeno [17]. The Choquet integral of f with respect to a fuzzy measure g is de"ned as follows:
P
m f (y) dg( ) )" + [ f (y )!f (y )]g(A ) i i~1 i A i/1
(13)
in which f (y )"0. 0 To help further understand the concepts of fuzzy measures and fuzzy integrals, let us consider a simple example as follows [18,19]: Let >"My , y , y N, and 1 2 3 given that the fuzzy densities are g "0.34, 1 g "0.32, g "0.33. Using Eq. (9), we obtain the quad2 3 ratic equation 0.0359j2#0.3266j !0.001"0. The parameter j can be obtained by taking the unique root j'!1, which gives j"0.0305. Also using Eq. (9), we can calculate all the fuzzy measures on the power subsets of >, whose values are shown in Table 1. Suppose that f (y )"0.6, f (y )"0.7, and f (y )"0.1. Thus, we need 1 2 3 to rearrange the elements in >, which yields g " 1 0.33, g "0.34, and g "0.32 in order to satisfy f (y )" 2 3 1 0.1(f (y )"0.6(f (y )"0.7. Using Eq. (11), the 2 3 Sugeno integral is computed as: max[min(0.1, 1), min(0.6, 0.66), min(0.7, 0.32)]"0.6. Using Eq. (13), the Choquet integral is obtained as (0.1!0)(1.0)# (0.6!0.1)(0.66)#(0.7!0.6)(0.32)"0.462.
3.2. Fuzzy integral Let (>, B, g) be a fuzzy measure space and f : >P[0, 1] be a B-measurable function. A fuzzy integral over AL> of the function f with respect to a fuzzy measure g is de"ned by
P
f (y) " g( ) )e" sup [min(a, g(f ))], (10) a A a|*0, 1+ where f is the a level set of f, f "My : f (y)*aN. a a The fuzzy integral in Eq. (10) is called the Sugeno integral. When >"My , y , 2, y N is a "nite set, and 1 2 n 0)f (y ))f (y )2)f (y ))1, (if not, the elements of 1 2 n
(12)
Table 1 g measures on the power set of > j Subset A
g (A) j
0 My N 1 My N 2 My N 3 My , y N 1 2 My , y N 2 3 My , y N 1 3 My , y , y N 1 2 3
0 0.34 0.32 0.33 0.6633 0.6532 0.6734 1
312
T. Pham, M. Wagner / Pattern Recognition 33 (2000) 309}315
4. Fuzzy-fusion based normalization It has been mentioned in the foregoing sections that the a priori probability of an utterance given that it is from one of the cohort speakers is assumed to be equal in the conventional similarity normalization methods, we use the concept of the fuzzy measure to calculate the grades of similarity or closeness between each cohort speaker model and the client model, i.e. the fuzzy density, and the multi-attributes of these fuzzy densities. The "nal score for the normalization of the cohort speakers can then be determined by combining all of these fuzzy measures with the corresponding likelihood values using the Choquet integral. We express the proposed model in mathematical terms as log ¸(X)"log p(XDS)!log F(XDS@),
(14)
where F(XDS@) is the fuzzy integral of the likelihood values of an utterance X coming from the cohort speaker set S@"MS : b"1, 2,2, BN with respect to the b fuzzy measures of speaker similarity. It is de"ned as follows: B F(XDS@)" + [p(XDS )!p(XDS )]g(Z DS), b b~1 b b/1
(15)
where p(XDS ) has been previously de"ned, Z " b b MS , S , 2, S N, g(Z DS) is the fuzzy measure of Z b b`1 B b b given that the true speaker is S, p(XDS )"0, and the 0 relation 0)p(XDS ))p(XDS ),2, p(XDS ) holds, other1 2 B wise the elements in S@ need to be rearranged. From the previous presentation of the fuzzy measure and the fuzzy integral, it is noticed that the key factor for the fuzzy fusion process is the fuzzy density. If the fuzzy densities can be determined then the fuzzy measures can be identi"ed, which make it ready for the operation of the fuzzy integral. For the fusion of similarity measures, we consider the fuzzy density as the degree of similarity or closeness between the acoustic model of a cohort speaker and that of the client, i.e. the greater the value of the fuzzy density is, the closer the two models are. Therefore, we de"ne the fuzzy density as g(S DS)"1!exp(!aEvo !vo E2), b b S
(16)
where a is a positive constant, E ) E2 is the Euclidean norm which indicates the root-mean-square averaging process, vo is the mean code vector of a cohort speaker S , and vo b b S is the mean code vector of the client speaker S. It is reasonable to assume that some acoustic models of a cohort speaker, say S , may be more similar to those 1 of the client speaker S than those of another cohort speaker, say S . However, some other acoustic models of 2 S may be more similar to those of S than those of S . 2 1 Since the mean code vectors are globally generated from the codebooks including all di!erent utterances of the
speakers, we therefore introduce the constant a in Eq. (16) for each cohort speaker in order to "ne-tune the fuzzy density with respect to the Euclidean distance measure. At present we select the values of a by means of the training data and will further discuss this issue in the experimental section.
5. Experiments 5.1. Measure of performance One of the most common performance measures for speaker veri"cation systems is the equal error rate (EER) which applies an a posteriori threshold to make the false acceptance error rate equal to the false rejection error rate. If the score of an identity claim is above a certain threshold then it is veri"ed as the true speaker, otherwise the claim is rejected. If the threshold is set high then there is a risk of rejecting a true speaker. On the contrary, if the threshold is set low then there is a risk of accepting an impostor. In order to balance the trade-o! between these two situations, the threshold is selected at a level which makes the percentage of the false acceptance error and the false rejection error equal based on the distributions of client and impostor scores. Thus, the EER o!ers an e$cient way for measuring the degree of separation between the client and the impostor models. Using the EER as an indicator of system performance means that the smaller the EER is, the higher the performance is. 5.2. The database The commercial TI46 speech data corpus is used here for the experiments. The TI46 corpus contains 46 utterances spoken repeatedly by 8 female and 8 male speakers, labeled f1}f8 and m1}m8, respectively. The vocabulary contains a set of 10 computer commands: Menter, erase, go, help, no, rubout, repeat, stop, start, yesN. Each speaker repeated the words 10 times in a single training session, and then again twice in each of 8 testing sessions. The corpus is sampled at 12,500 samples/s and 12 bits/sample. The data were processed in 20.48 ms frames at a frame rate at 125 frames/s. The frames were Hamming windowed and preemphasized with k"0.9. Fortysix mel-spectral bands of a width of 110 mel and 20 mel-frequency cepstral coe$cients (MFCC) were determined for each frame. In the training session, each speaker's 100 training tokens (10 utterances]1 training session]10 repetitions) were used to train the speaker-based VQ codebook by clustering the set of all the speakers' MFCC into codebooks of 32, 64 and 128 codewords using the LBG algorithm [20].
T. Pham, M. Wagner / Pattern Recognition 33 (2000) 309}315
5.3. The results The veri"cation was tested in the text-dependent mode. Since both the geometric mean and the fuzzy fusion methods operate on the principle of integration and depend on the size of the cohort set, we therefore compare the performances of these two methods. This is a closed set test as the cohort speakers in the trainig are the same as those in the testing. For the purpose of comparison and due to a limited number of speakers, we select for each claimed speaker a cohort set of three (same gender) whose acoustic models are closest to the claimed model. In the testing mode, each cohort speaker's 160 test tokens (10 utterances]8 testing sessions]2 repetitions) are tested against each claimed speakers' 10-word models. To identify the fuzzy densities for the cohort speakers, we select the values of a by means of the training data. The range of a was speci"ed to be from 1 to 50, and a unit step size was applied in the incremental trial process. It was observed that using di!erent values of a for di!erent speakers could give more reduction in the equal error rates. However, as an intial investigation we chose the same value for each gender set, that is a"10 for the
Table 2 Equal error rates (%EERs) for the 16 speakers using geometric mean (GM) fuzzy fusion (FF) based normalization methods GM
FF
Codebook size
Codebook size
Speaker
32
64
128
32
64
128
f1 f2 f3 f4 f5 f6 f7 f8 m1 m2 m3 m4 m5 m6 m7 m8
4.17 5.98 9.90 0.00 1.78 6.67 7.38 12.76 3.07 4.17 7.03 10.77 2.70 8.43 7.18 1.83
3.01 1.19 5.66 0.00 1.78 3.01 4.32 9.73 3.05 1.28 7.00 8.28 2.44 7.44 5.88 3.01
2.40 1.79 3.67 0.00 0.59 1.80 3.61 9.22 3.06 1.22 6.32 7.90 1.80 6.53 4.83 2.40
1.80 1.19 7.79 0.00 1.19 2.41 6.48 10.05 3.03 3.14 6.87 8.29 1.62 7.53 6.86 1.80
1.19 0.60 3.70 0.00 0.60 0.59 4.00 8.22 3.03 1.22 6.85 6.89 0.63 5.47 4.88 1.21
1.19 1.20 2.33 0.00 0.00 0.00 2.30 7.62 2.43 1.22 5.92 6.91 1.19 4.72 3.65 1.19
Female
6.08
3.66
2.89
3.50
2.48
1.92
Male
5.65
4.80
4.17
4.89
3.86
3.40
Average
5.87
4.23
3.53
4.20
3.17
2.66
313
female cohort set and a"1 for the male cohort set. As a result, Table 2 shows the mean equal-error rates for the 16 speakers with three codebook sizes of 32, 64 and 128 entries. For the veri"cation of the female speakers, using the fuzzy fusion the average EERs are reduced to (6.08!3.50)"2.58%, (3.66!2.48)"1.18%, (2.89! 1.92)"0.97% for the codebook sizes of 32, 64 and 128, respectively, For the model of the male speakers, the average EERs are reduced to (5.65!4.89)"0.76%, (4.80!3.86)"0.94%, (4.17!3.40)"0.77% for the codebook sizes of 32, 64 and 128, respectively, The total average EER reductions in both models for the three codebook sizes of 32, 64 and 128 are (5.87!4.20)" 1.67%, (4.23!3.17)"1.06%, (3.53!2.66)"0.87%, respectively. Through these results, it can be seen that the speaker veri"cation system using the fuzzy fusion is more favorable than using the geometric mean method.
6. Conclusions A fusion algorithm based on the fuzzy integral has been proposed and implemented in the similarity normalization for speaker veri"cation. Then the experimental results show that the application of the proposed method is superior to that of the conventional normalization. The key di!erence between the two methods is that the assumption of equal a priori probabilities is not necessary for the fuzzy integral-based normalization due to the concept of the fuzzy measure. In fact, applications of fuzzy measures and fuzzy integrals have been attracting great attention among researchers in the "eld of pattern recognition [21}24]. Two useful aspects of fuzzy measures are that the importance and interaction of features are taken into account, and fuzzy integrals serve as a basis for modeling these representations [25]. For this problem of speaker recognition, we interpret the importance of features as the similarity between the acoustic models of cohort and client speakers. There are three kinds of interaction: redundancy, complementarity, and independency. The "rst interaction is meant by that the scoring of the cohort models do not increase signi"cantly if the joint similarity is not greater than the sum of individual similarities. The second type is the converse, that is the scoring is increased signi"cantly when the joint similarity is greater than the sum of individual similarities. The last type indicates that each similarity measure contributes to the total scoring process. The complexity involves in the proposed method is the determination of the fuzzy densities and the computation of the fuzzy integrals that require more computational e!ort than the conventional method. However, the di!erence in computer running time between the two methods was found to be negligible. One important issue arising here for further investigation is the optimal identi"cation
314
T. Pham, M. Wagner / Pattern Recognition 33 (2000) 309}315
of the fuzzy densities, which can o!er #exibility and have great e!ect in the fuzzy fusion. At present, the fuzzy densities were determined based on a rough estimate of the values for a using a small range of integers. One convenient and promising method for "nding such a solution is the optimizing process of the genetic algorithms which are random-search algorithms based on the principles of natural genetics, and have attracted great attention as function optimizers. Using genetic algorithms, the fuzzy densities can be identi"ed in such a way that the error for the training data is minimized in the least-squares sense. Some typical similar problems in data fusion which have been sucessfully tackled by genetic algorithms can be found in Ref. [26,27].
References [1] S. Furui, An overview of speaker recognition technology, Proceedings of Workshop on Automatic Speaker Recognition, Identi"cation and Veri"cation, Martigny, Switzerland, 1994, pp. 1}9. [2] J.P. Campbell, Speaker recognition: a tutorial, Proc. IEEE 85 (1997) 1437}1462. [3] G.R. Doddington, Speaker recognition evaluation methodology } an overview and perspective, Proceedings of Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C), Avignon (France), 1998, pp. 60}66. [4] B.S. Atal, E!ective of linear prediction characteristics of speech wave for automatic speaker identi"cation and veri"cation, J. Acoust. Soc. Am. 55 (1974) 1304}1312. [5] S. Furui, Cepstral analysis techniques for automatic speaker veri"cation, IEEE Trans. Acoust. Speech Signal Process. 29 (1981) 254}272. [6] A.L. Higgins, L. Bahler, J. Porter, Speaker veri"cation using randomnized phrase prompting, Digital Signal Processing 1 (1991) 89}106. [7] T. Matsui, S. Furui, Concatenated phoneme models for text variable speaker recognition, Proceedings of IEEE International of Conference Acoustics, Speech, and Signal Processing, Minneapolis, USA, 1993, pp. 391}394. [8] G. Gravier, G. Chollet, Comparison of normalization techniques for speaker veri"cation, Proceedings of Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C), Avignon, France, 1998, pp. 97}100. [9] C.S. Liu, H.C. Wang, C.H. Lee, Speaker veri"cation using normalization log-likelihood score, IEEE Trans. Speech Audio Process. 4 (1996) 56}60. [10] L.A. Zadeh, Fuzzy sets, Inform. and Controls 8 (1965) 338}353.
[11] M. Sugeno, Fuzzy measures and fuzzy integrals } a survey, in: M.M. Gupta, G.N. Saridis, B.R. Gaines (Eds.), Fuzzy Automata and Decision Processes, North-Holland, Amsterdam, 1977, pp. 89}102. [12] Z. Wang, G.J. Klir, Fuzzy Measure Theory, Plenum Press, New York, 1992. [13] M. Grabisch, II.T. Nguyen, E.A. Walker, Fundamentals of Uncertainty Calculi with Applications to Fuzzy Inference, Kluwer Academic Publishers, Dordrecht, Netherland, 1995. [14] G. Banon, Distinction between several subsets of fuzzy measures, Fuzzy Sets and Systems 5 (1981) 291}305. [15] K. Leszczynski, P. Penczek, W. Grochulski, Sugeno's fuzzy measure and fuzzy clustering, Fuzzy Sets and Systems 15 (1985) 147}158. [16] H. Tahani, J.M. Keller, Information fusion in computer vision using the fuzzy integral, IEEE Trans. Systems Man Cybernet 20 (1990) 733}741. [17] T. Murofushi, M. Sugeno, An interpretation of fuzzy measure and the Choquet integral as an integral with respect to a fuzzy measure, Fuzzy Sets and Systems 29 (1989) 201}227. [18] T.D. Pham, H. Yan, A kriging fuzzy integral, Inform. Sci. 98 (1997) 157}173. [19] S.B. Cho, On-line handwriting recognition with neuralfuzzy method, Proc. IEEE FUZZ-IEEE/IFES'95, Yokohama, Japan, 1995, pp. 1131}1136. [20] Y. Linde, A. Buzo, R.M. Gray, An algorithm for vector quantization, IEEE Trans. Commun. 28 (1980) 84}95. [21] J.M. Keller, P. Gader, H. Tahani, J.H. Chiang, M. Mohamed, Advances in fuzzy integration for pattern recognition, Fuzzy Sets and Systems 65 (1994) 273}283. [22] M. Grabisch, J.M. Nicolas, Classi"cation by fuzzy integral: performance and tests, Fuzzy Sets and Systems 65 (1994) 255}271. [23] S.B. Cho, J.H. Kim, Combining multiple neural networks by fuzzy integral for robust classi"cation, IEEE Trans. Systems Man Cybernet 25 (1995) 380}384. [24] Z. Chi, H. Yan, T. Pham, Fuzzy Algorithms with Applications in Image Processing and Pattern Recognition, World Scienti"c, Singapore, 1996. [25] M. Grabisch, The representation of importance and interaction of features by fuzzy measures, Pattern Recognition Lett. 17 (1996) 567}575. [26] A.L. Buczak, R.E. Uhrig, Information fusion by fuzzy set operation and genetic algorithms, Simulation 65 (1995) 52}66. [27] T.D. Pham, H. Yan, Fusion of handwritten numeral classi"ers based on fuzzy and genetic algorithms, Proceedings North America Fuzzy Information Processing Society (NAFIPS)'97, New York, USA, 1997, pp. 257}262.
About the Author*TUAN D. PHAM received the B.E. degree (1990) in Civil Engineering from the University of Wollongong, the Ph.D. degree (1995) in Civil Engineering, with a thesis on fuzzy-set modeling in the "nite element analysis of engineering problems, from the University of New South Wales. From 1994 to 1995, he was a senior systems analyst with Engineering Computer Services Ltd, and from 1996 to early 1997 he was a post-doctoral fellow with the Laboratory for Imaging Science and Engineering in the Department of Electrical Engineering at the University of Sydney. From 1997 to 1998 he held a research fellow position with the Laboratory for Human-Computer Communication in the Faculty of Information Sciences and Engineering at the University of Canberra, and he is
T. Pham, M. Wagner / Pattern Recognition 33 (2000) 309}315
315
now a lecturer in the School of Computing in the same Faculty. He is a co-author of 2 monographs, author and co-author of over 40 technical papers published in popular journals and conferences. His main research interests include the applications of computational intelligence and statistical techniques to pattern recognition, particularly in image processing, speech and speaker recognition. Dr. Pham is a member of the IEEE.
About the Author*MICHAEL WAGNER received a Diplomphysiker degree from the University of Munich in 1973 and a Ph.D. in Computer Science from the Australian National University in 1979 with a thesis on learning networks for speaker recognition. Dr. Wagner has been involved in speech and speaker recognition research since and has held research and teaching positions at the Technical University of Munich, National University of Singapore, University of Wollongong, University of New South Wales and the Australian National University. He was the Foundation President of the Australian Speech Science and Technology Association from 1986 to 1992 and is currently a professor and head of the School of Computing at the University of Canberra. Dr. Michael Wagner is a fellow of IEAust and a member of ASSTA, ESCA and IEEE.
Pattern Recognition 33 (2000) 317}331
E!ect of resolution and image quality on combined optical and neural network "ngerprint matching C.L. Wilson*, C.I. Watson, E.G. Paek Information Technology Laboratory, National Institute of Standards and Technology, Stop 8940, Gaithersburg, MD 20899-8940, USA Received 20 July 1998; accepted 3 February 1999
Abstract This paper presents results on direct optical matching, using Fourier transforms and neural networks for matching "ngerprints for authentication. Direct optical correlations and hybrid optical neural network correlation are used in the matching system. The test samples used in the experiments are the "ngerprints taken from NIST database SD-9. These images, in both binary and gray-level forms, are stored in a VanderLugt correlator (A. VanderLugt, Signal detection by complex spatial "ltering, IEEE Trans. Inform. Theory IT-10 (1964) 139}145). Tests of typical cross correlations and autocorrelation sensitivity for both binary and 8 bit gray images are presented. When Fourier transform (FT) correlations are used to generate features that are localized to parts of each "ngerprint and combined using a neural network classi"cation network and separate class-by-class matching networks, 90.9% matching accuracy is obtained on a test set of 200,000 image pairs. These results are obtained on images using 512 pixel resolution. The e!ect of image quality and resolution are tested using 256 and 128 pixel images, and yield accuracy of 89.3 and 88.7%. The 128-pixel images show only ridge #ow and have no reliably detectable ridge endings or bifurcations and are therefore not suitable for minutia matching. This demonstrates that Fourier transform matching and neural networks can be used to match "ngerprints which have too low image quality to be matched using minutia-based methods. Since more than 258,000 images were used to test each hybrid system, this is the largest test to date of FT matching for "ngerprints. Published by Elsevier Science Ltd. Keywords: Fingerprints; Matching; Optical correlation; Neural networks; Image quality
1. Introduction This paper presents data on inked "ngerprint images matched with optical and hybrid optical neural network correlators. The matching method is tested on an authentication application. The inked "ngerprint images are rolled prints scanned at 20 pixels/mm on a 4 cm]4 cm area of a "ngerprint card. We study matching of the inked "ngerprints using global optical correlations [1] and partial optical correlation features and a system of neural classi"cation and matching networks [2]. Images of three di!erent resolutions are tested to determine the
* Corresponding author. Tel.: #301-975-2080; fax: #301975-5287. E-mail address:
[email protected] (C.L. Wilson) 0031-3203/99/$20.00 Published by Elsevier Science Ltd. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 5 2 - 7
e!ect of image resolution and quality on matching accuracy. For images of inked rolled "ngerprints, even after core alignment and correction for rotation, optical matching of most prints is successful for matching the original image and rejecting other "ngerprints, but fails on second copies of inked rolled images because plastic pressure distortions and image size variation are too large to allow global Fourier transform (FT) matching. Detailed computer simulations show that global optical matching uses the "ne-grained phase-plane structure of the Fourier transform of the "ngerprints to produce strong optical correlations. This "ne-grained structure is very sensitive to pressure and plastic distortion e!ects which then dominate in correlations of static "ngerprints. The "ne grained local variations in "ngerprints can be compensated for by calculating optical correlations on
318
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
smaller zones of the "ngerprints. A training set was derived from disk two volume 1 of SD-9 and the testing set from disk one volume 1 of SD-9 [3]. This database contains "ngerprints of widly varing quality which are representative of the US population; the "ngerprints were taken from an unbiased sample of the "ngerprint searches submitted to the FBI. Since all "ngerprints on disk one were tested against each other, a total of 258,444 tests were performed in each experiment. This is the largest FT-based matching experiment reported to date. In our experiments, two 4]4 matrices of correlations on zones of the "ngerprint are used to produce a total of 32 features. One set of correlations is computed with the local zone grid centered on the core and one set is computed with the core in the center of the grid just above and to the left of grid center. Features were extracted from the correlation data using correlation peak height, correlation peak width, and correlation area. These features are combined using two types of neural networks. The "rst network is used to classify the "ngerprints [4}6]. This "ngerprint classi"cation network works directly with the "ngerprint image. After each "ngerprint is classi"ed, class-by-class matching networks are trained for each class. These two networks function in a way similar to the binary decision networks discussed in Ref. [7]. For this particular problem, the network training is strongly dependent on regularization and pruning for accurate generalization [2]. The advantage of the combined optical neural network method is its insensitivity to image resolution and quality. The experiments presented in this paper were done with three di!erent image sizes. Initial results were obtained with image samples with 512]512 pixels on each side, sampled at 20 pixels/mm. These images were downsampled to 256]256 and 128]128 using averaging of the gray levels to achieve sampling rates of 10 and 5 pixels/mm. The full matching test was then performed for three combinations of extracted features and for images of each size. As we will discuss in Section 3, analysis of ridge spacing data on the test "ngerprints shows that the Nyquist sampling limit of two pixels for each ridge and valley occurs at the 256 pixel level. The 128]128 images were sampled at half the Nyquist level and are of low quality with few clear ridge endings and/or bifurcations. As we discuss in Section 5, the accuracy of the hybrid matching method decreases with image resolution but remains usable even for 128]128 images. In Section 2 we describe the direct optical correlation experiment. In Section 3 we present an analysis of ridge frequency data and its e!ect on image quality. In Section 4 we discuss combining optical and neural network methods. In Section 5 we present the results of the hybrid system and in Section 6 we draw some conclusions about the di!erence in correlations of real time and rolled inked "ngerprints.
2. Global optical correlations In the global optical matching experiment, images from NIST Special Database 9 (SD-9) [3] are core aligned using the method discussed in Ref. [4] and cropped to "t the 640]480 pixel "eld of the pattern recognition system. Two hundred reference "ngerprints and second rollings (inked images taken at a di!erent time) are available for autocorrelation and cross-correlation experiments. When binary "nger prints are used the method used is based on that presented in Ref. [5]. Fig. 1 shows a schematic diagram of the optical pattern recognition system. It is based on the conventional VanderLugt correlator [1]. The target "ngerprint image is loaded on a spatial light modulator (SLM), and is Fourier transformed by a lens. The resulting Fourier spectrum is interfered with a reference beam to record a Fourier transform hologram. After recording is "nished, if an arbitrary input "ngerprint is presented on the SLM the correlation of the input and the target appears in the correlation output plane. Although the spatial heterodyning technique, often called joint transform correlator [8], has many advantages for real-time applications [9,10] and was used in most of the recent "ngerprint recognition experiments [11}16], the VanderLugt correlator was adopted in this experiment. This is because the VanderLugt correlator does not require a fast SLM with high resolution and the large space bandwidth product (SBP) available from holographic recording materials provide a high degree of freedom to accommodate various distorted versions of a target that are simultaneously compared with an input. Also, since the information is recorded in the form of a di!raction pattern (hologram) instead of a direct image, it can be used on a credit card or an ID card for security purposes without need for further encoding. Finally, the VanderLugt correlator is better suited for spatial "ltering to increase signal-to-noise ratio (SNR). The critical positioning tolerance problem of the VanderLugt correlator can be greatly relaxed by using in situ recording materials, such as thermoplastic plates, as were used in this experiment. In this case, once the system is aligned, new holographic "lters can be generated with no fear of misalignment. In the global correlation experiment, "ngerprint images are generated from the NIST "ngerprint database [3]. In the real-time correlation experiment, images are generated by a live-scan "ngerprint scanner (Identicator Technology, Gatekey Plus ver. 4.1).1 An electrically 1 Certain commercial equipment may be identi"ed in order to adequately specify or describe the subject matter of this work. In no case does such identi"cation imply recommendation or endorsement by the National Institute of Standards and Technology, nor does it imply that the equipment identi"ed is necessarily the best available for the purpose.
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
319
Fig. 1. Diagram of the optical pattern recognition system.
Fig. 2. Histograms of peak correlations for gray (a) and binary (b) "ngerprint images.
addressable liquid crystal SLM (Kopin, LVGA kit, 14 mm diagonal)1 is used as an input device. The SLM is mounted on a rotational stage to facilitate precise rotational tolerance measurements. Holographic "lters are recorded on a thermoplastic plate (Newport Corp. HC-300)1 that allows fast nonchemical processing, high di!raction e$ciency and high sensitivity. Although the recording process cannot be achieved in real time (close to 1 min.), the time-consuming comparison of an input with many other images in a large database can be done very fast, once a hologram is made. A 10 mW HeNe laser with a ND 2 "lter was used as a light source, and so only 0.1 mW was used to see correlation output, due to the high light e$ciency of the system.
The system is also equipped with real-time in situ monitoring of an input image, its Fourier transform, and the correlation output. These monitoring parts, combined with a frame grabber and other analytic tools, permit real-time quantitative analyses and accurate characterization of every stage of the system operation. The correlator system is capable of shift-invariant pattern recognition over a broad range of input positions and has high SNR due to accurate alignment using an interferometer and a microscope. Fig. 2 shows a histogram of peak correlations for gray (a) and binary inputs (b). For each of 20 randomly chosen "ngerprints, an individual holographic "lter was fabricated and tested against the 200 "ngerprints in the NIST database stored in the control computer. Therefore each plot involves 4000 correlations. Each peak
320
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
correlation value was obtained by taking the maximum value in the correlation plane. In case of gray inputs shown in (a), all 20 autocorrelations peak at the maximum value (152). Cross-correlations distribute in a Gaussian shape with a full-width half-minimum (FWHM) of around 15, and the maximum at 60. For binary inputs shown in (b), all autocorrelations peak at the maximum value, as in gray inputs. However, in this case, cross correlations are signi"cantly reduced to zero except for the few cases which were found to be from the correct "ngerprints of the other rolling. For both gray and binary inputs, the autocorrelations are well separated from the cross correlations to permit perfect 100% recognition for correct "ngerprints (without considering distortions). The exact mechanism for the signi"cant increase in SNR for binary inputs is not completely understood. However, several previous works [17,18] support the experimental results. Such a high SNR of binary inputs can be e$ciently used to make a composite "lter to permit tolerance against distortion.
3. Fingerprint image characteristics In this section we present data on "ngerprint ridge pitch and frequency and the e!ect of image sampling frequency on image quality. The standard sampling frequency for "ngerprint data for images of inked "ngerprints is 500 pixels/in. or 19.7 pixels/mm, approximately 20 pixels/mm. Live-scan system designed for law enforcement applications use this sampling rate but live-scan systems designed for veri"cation applications are using lower sampling rates down to approximately 5 pixels/mm. The constraint on image quality that ridge frequency values imposed is important for both minutia matching methods and for Fourier transform methods. For minutia matching methods, the ridge structure of the "ngerprint must be sampled with su$cient frequency to allow the ridge and valley structure to be accurately detected. This is observed to be approximately two points for each ridge and two points for each valley as expected from basic Nyquist sampling theory. In the FT case, the frequency of sampling is important because it e!ects the sensitivity of the correlation to plastic distortion. Near the center of the "ngerprint ridge and valley position do not vary much with pressure but at the edges "ngerprint ridges may be displaced by a full ridge width e!ectively interchanging the ridge and valley positions. Smaller ridge pitch values for an equally elastic "nger will increase this e!ect. Typical e!ects of elastic distortion are shown in Fig. 14. In previous optical "ngerprint correlation studies [11}16], decreasing image size and sampling frequency has decreased sensitivity to rotational alignment and plastic distortion.
3.1. Ridge pitch variation In a collection of "ngerprints, two kinds of variation of ridge spacing are of interest for matcher evaluation. First we have variations in ridge pitch within individual "ngerprints, and second we have variations in ridge pitch across samples of "ngerprints, such as the volume 1 disk 1 of NIST database 9. The variations discussed here were measured by performing FTs for each "ngerprint. The power spectrum of the FT of the "ngerprint was then sampled over angles from 0 to n at 257 di!erent radial distances and a histogram of relative power vs spatial frequency was produced for each "ngerprint. The average values of these histograms over each class for male and female subjects was also produced. The class sample sizes re#ect the natural class occurence rates; sample sizes for arches and tented arches are about 1/19 the size of those for loops and whorls. The variations in ridge spacing for two individual "ngerprints are shown in Figs. 3 and 5. Both "ngerprints have sharp peaks in their power spectrum in the typical ridge pitch range between 0.4 and 0.6 mm and have minimum ridge spacing of about 0.2 mm and maximum ridge spacing of about 0.8 mm. The peak power of the two prints are near the limits for peak power observed in the special database 9 sample. The "ngerprint with the smaller ridge pitch has a power distribution skewed toward smaller ridge pitch values, see Fig. 3, and the "ngerprint with the larger ridge pitch has a power distribution skewed toward larger ridge pitch (see Fig. 5). The "ngerprints measured to obtain the two distributions are shown in Figs. 4 and 6 respectively. Both images have a scale of 19.7 pixels/mm and the ridge pitch shown in Fig. 4 is, as expected from the distributions, 2/3 of that shown in Fig. 6. Fig. 6 also demonstrates that using FT power to measure ridge pitch is robust enough to work
Fig. 3. Relative power of the FT as a function of ridge pitch for "ngerprint with small pitch.
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
Fig. 4. Example "ngerprint with narrow, 0.4 mm, ridge pitch.
Fig. 5. Relative power of the FT as a function of ridge pitch for "ngerprint with large pitch.
well on a "ngerprint with a poor quality image. Examination of the "ngerprint images illustrated that the larger ridges are near the crease at the bottom of the images and the smaller ridges are near the "nger tip. This is true for all of the "ngerprints tested. As the number of "ngerprints used in the calculation of the power spectra vs ridge pitch is increased, the distributions become smooth and approach skewed Gaussian form. This is illustrated in Fig. 7 for males with "ngerprints classi"ed as whorls and by Fig. 8 for females with "ngerprints classi"ed as whorls. In these distributions, the range of ridge pitches in each distribution, 0.2}1.0 mm, is larger than the variations in maximum ridge pitch between males and females, and larger than the variation in maximum ridge pitch between classes. The distributions for other classes of "ngerprints have similar shapes
321
Fig. 6. Example "ngerprint with wide, 0.6 mm, ridge pitch.
Fig. 7. Relative power of the FT as a function of ridge pitch for males for "ngerprint which are classed whorls.
Fig. 8. Relative power of the FT as a function of ridge pitch for females for "ngerprint which are classed whorls.
322
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
Fig. 9. Example "ngerprint sampled at 20 pixels/mm.
and ranges. These measurements show that ridge pitch variations in individual "ngerprints and in classes of "ngerprints are always larger than variations in most likely ridge pitch across groups of "ngerprints or between sexes. 3.2. Image quality The e!ects of sampling frequency on image quality are shown in Figs. 9}11 for a single "ngerprint sampled at 20, 10, and 5 pixels/mm. Based on the FT power spectra presented in the previous section we would expect some part of the "ngerprint, usually above the core of the "ngerprint, to be just adequately sampled at the 20 pixel/mm sampling rate, however most of the "ngerprint would be adequately sampled. In Fig. 9 the region of narrow ridge spacing between 1 and 3 o'clock above and to the right of the core are just adequately sampled. When the sampling rate is reduced to 10 pixels/mm, as in Fig. 10, some of the minutiae in this region are di$cult to detect because of blurring. Most of other sections of the "ngerprint has clearly de"ned ridge structure. From the class FT power distribution, such as Fig. 7, we would expect that only a few percent of the ridges would be obscured by this 2-to-1 down-sampling and this is the case.
When the sampling rate is further reduced to 5 pixels/ mm, as in Fig. 11, we see a large reduction in image quality. At 5 pixels/mm we would expect, from Fig. 7, that about 40% of the "ngerprint would be sampled with a resolution less than the expected Nyquist limit. In Fig. 11 most of the minutia locations are blurred, and ridge locations in the area above and to the right of the core are lost at this level of resolution. From the combined e!ect of FT ridge pitch analysis and from the examples given above we conclude that on most "ngerprints, the e!ect of sampling at 10 pixels/mm will be small although some part of many "ngerprints will be undersampled. Sampling at 5 pixels/mm will make minutiae detection either uncertain or impossible, and will make detection of ridges di$cult in a signi"cant part of most "ngerprints.
4. Combined optical and neutral system Direct global correlation of "ngerprints for matching has a signi"cant failure rate caused by the elasticity of "ngerprints. Two rollings of the same print can vary signi"cantly, as seen by computing their Fourier transforms, because of the stretching variations which occur
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
323
Fig. 10. Example "ngerprint sampled at 10 pixels/mm.
when rolling a "ngerprint. Fig. 12 shows the correlation of two rollings of the same print that have been rotationand-translation-aligned based on the ridge structure around the core. It is clearly seen that the "ngerprints correlate (indicated by the dark gray pixels) around the core, but away from the core the patterns have di!erent amounts of elastic distortion. Since the elastic distortion problem is local, a method of local correlation can be used to lower the average distortion in small subregions of the "ngerprint. 4.1. Optical features A solution to elastic distortion that occurs in di!erent rollings of the same "ngerprint is partitioning the images into tiles and comparing the data within each of the tiles using FT-based methods. For this work, each image was partitioned into 4]4 tiles twice so that each tile contained one-sixteenth of the total image area. One partition had the core located in the center of the image, as de"ned by the "ngerprint core, and the second partition had the core shifted away from the center so that the new center was located at the corner of one of the "rst set of partitions. This double partitioning allowed for overlap of data (speci"cally data on the edge of the tiles). Since
the neural network is allowed to prune any data that is not needed, excess overlap in the features can be removed during network training. Fig. 13a and b show the core location for each 4]4 partition. In NIST database 9 two rollings of each "ngerprint are present; these "ngerprint sets labeled "le prints ( f (n)) and search print (s(n)). After partitioning, each f (n) and s(m) pairs are compared by correlating the corresponding tiles (32 tiles) for each print and extracting features from the correlations as inputs to the neural network. The features used are the central correlation peak height, correlation peak area and correlation peak width. Fig. 14 shows two print tiles from a matched pair and the corresponding correlation output. The correlation peak data is extracted by taking a cross section (perpendicular to the "ngerprint ridge direction) of the peak at the maximum correlation value. Only the central peak, from the maximum values to the "rst minimum on each side of the curve, is used for extracting features. The peak height is the di!erence between the maximum value and the lowest of the two minimum values. The peak width is measured between the two minimums and peak area is the area under the peak curve from the maximum-to-minimum values. The correlation is computed in the Fourier domain by taking the Fourier transform of the partitions and
324
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
Fig. 11. Example "ngerprint sampled at 5 pixels/mm.
computing the inverse Fourier transform of their product, using the complex conjugate of the "rst (Eq. (1)). f (n)"s(m)"F~1[FH[ f (n)]]F[s(m)]].
(1)
Each f (n) and s(m) vector has 32 values for each peak feature (i.e. height, area and width) (n"1, 2,2, 900 and m"1, 2,2, 900): peak} features[ f (n) "s(m) ] 1 1 peak} features[ f (n) "s(m) ] 2 2 F peak} features[ f (n) "s(m) ] 31 31 peak} features[ f (n) "s(m) ] 32 32 Automated feature detection procedures were applied to NIST Special Database 9 Vol 1, where disk 2 was used as training data and disk 1 was used as testing data. For this partitioning technique to be e!ective, the images need to be rotationally and translationally aligned about the cores of the two "ngerprints being compared. This alignment was accomplished over a large set of data using an automated technique. There are three steps in the automated alignment, "lter/binarize image, detect core location, and determine alignment.
Filtering, binarization, and core detection are done using methods previously developed and discussed in detail in Ref. [6]. The only addition is that the binarized "ngerprint is median "ltered using a 3]3 window to help smooth noise in the ridge data and improve correlation performance. The alignment step uses 128]128 segments that are centered about the core of the "ngerprints being aligned. The correlation of the segments is computed while rotating the second segment over a range of angles. The angle which produces the largest correlation is used for rotation alignment. Since two prints can have signi"cant angular displacements the alignment is actually done in two stages. Stage one uses an angular step size of 13 over a range of $153 and stage two a step size of 0.23 over a range of $13 from the angle determined in the "rst stage. Since the correlation computed by Eq. (1) is translation independent, translation alignment is accomplished by using the peak correlation location from the second stage of the angular alignment. The amount that the peak correlation is o! center of the 128]128 segment determines how much the second print needs to be shifted to achieve translational alignment with the "rst. The feature extraction procedure results in a total of 96 features for each pair of "ngerprints compared. In SD-9, each "ngerprint has one print in the test set that matches
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
325
Fig. 12. Correlation of two rollings of the same print. (Dark gray indicates correlated ridges, white and light gray indicate uncorrelated ridges.
matching prints had very narrow, well-de"ned peaks and non-matching prints had broader #at peaks. This led to the next neural network in which correlation peak area was added as an input feature (64 features). Signi"cant improvements were obtained in matching error rates (shown later in the paper). Since the shape of the peak curves vary, an exact correlation could not be made between peak area, height and width. The "nal network tested used all 96 features (correlation peak maximum, area, and width) and showed a smaller improvement over the 64 feature network. Fig. 13. Shows image partitioning and the corresponding feature number.
4.2. Neural network matching
and several thousand which do not match. Only those prints which do not match but are of the same class are included in the training set. The previously developed neural network classi"er [6] is used for this screening process. The 96 features were used in three di!erent neural networks. The "rst network only used the maximum correlation values as the features (32 features). Because of partitioning the prints, a main source of error resulted from non-matching prints having a maximum peak value in the ranges of matching prints. The di!erence was that
The matching networks discussed in this section were trained using a dynamically modi"ed scaled conjugate gradient method presented in Ref. [2]. In Ref. [2], we demonstrated that performance equal to or better than probabilistic neural networks (PNN) [19] can be achieved with a single three-layer multi-layer perceptron (MLP) by making fundamental changes in the network optimization strategy. These changes are: (1) neuron activation functions are used which reduce the probability of singular Jacobians; (2) successive regularization is used to constrain volume of the weight space being minimized;
326
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
Fig. 14. Original matched pair and corresponding correlation.
(3) boltzmann pruning is used [20] to constrain the dimension of the weight space; and (4) prior class probabilities are used to normalize all error calculations so that statistically signi"cant samples of rare but important classes can be included without distortion of the error surface. All four of these changes are made in the inner loop of a conjugate gradient optimization iteration [21] and are intended to simplify the training dynamics of the optimization. In this work we found that the e!ect of the sinusoidal activation, 1 above, was not useful, but that pruning, 3 above, and regularization, 2 above, were essential to good generalization. Since the distribution of match and
do not match classes was highly unequal, the e!ect of prior weights, 4 above, was also very important. The optimal regularization factor for all runs was found to be 0.001 and the optimum pruning temperature was found to be 0.005.
5. Hybrid results Five di!erent experiments were performed. In each experiment all "ngerprints from disk one of volume one of SD9 were used as a test sample. All "ngerprints from disk two of volume one of SD9 were used for network
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
training. A total of 258,444 tests were performed in each experimental test sequence. The experiments were designed to test the e!ect of di!erent methods of FT peak feature extraction and the e!ect of image resolution on accuracy. The feature extraction experiments used the three methods described above to obtain features from the local FT data. All images were sampled at 20 pixels/mm. In the "rst of these experiments the correlation peak height was used as the feature. In the second experiment the area under the correlation peak was used as an additional feature. In the third set of experiments, the width of the correlation peak was added to the feature set. For each feature set the Karhunen}Loe`ve (K}L) transformation was used, as in Ref. [7], to reduce the dimensionality of the feature set. Before K}L transformation, these three experiments had feature vector lengths of 32, 64 and 96-elements. In the second set of experiments, the 96-element feature vectors including correlation peak height, peak area, and peak width were extracted for sets of images which were sampled at 20, 10, and 5 pixels/mm. The "rst set of images in the resolution experiments was the same as the set of images from the third experiment. Each of the local features sets discussed above was separated into testing and training samples both by class and as a global (all class) set. The training sets were used to calculate global and class-by-class covariance matrices and eigenvectors and to calculate K}L transform [22] features for all of the testing and training sets. The e!ect of the K}L transform was to reduce the feature set sizes from 32 to 13, from 64 to 58 and from 96 to 58 or 59. When the eigenvectors of the K}L transform were examined for the peak-based 32-element feature set, the primary source of variation was found to be in 12 zones near the center of the two feature grids. The "rst eigenvector of each of the transforms was approximately 40 times larger than the 13th eigenvector, indicating that only about 13 statistically independent features were computed from the training sets. No large di!erence in K}L transform characteristics were seen between global and class-by-class data sets. When the eigenvectors of the K}L transform were examined for the combined peak-and-area-based 64-element feature set, most elements, 58 of the 64, made a signi"cant contribution to the sample variance. Increasing the feature set width to 96-elements by adding the correlation peak width did not increase the number of useful eigenvalues. The transformed feature vectors were still 58-elements long. We can therefore conclude that peak width and peak area are highly correlated. The K}L transformed features were used to train neural networks for both global and class-by-class matching for each of the "ve data sets. The networks were trained using regularization to bound weight size and pruning to restrict the number of active weights in the network to a size. Network size, pruning, and regularization were
327
adjusted empirically to provide reasonable generalization. The criterion used to test generalization accuracy was the comparison of the test and training matching errors. 5.1. Correlation peak features The basic network size was 13-24-2 network with 386 weights including bias weights. Twenty-four hidden units were needed to provide adequate coverage of the various combinations of interconnections during pruning. A sigmoidal activation function was used for the hidden nodes. With this network size and these training parameters, a typical functioning network has approximately 150 weights and has a accuracy of 62}71%. The results of this process are given in Table 1. These results can be signi"cantly improved by using PCASYS or some other classi"cation method to test only prints of the same class for matching. Assuming the PCASYS accuracy of 99% correct classi"cation at 20% rejects and a natural distribution of classes would allow the results given above to be improved to 84.3% matching accuracy. If a perfect classi"er were available, then the combined accuracy would be 90.3%. This model assumes that each print is classi"ed or rejected by PCASYS. The rejected prints are matched with the All network given in the top line of the table. All other prints are matched by the network selected by its PCASYS class. All prints misclassi"ed by PCASYS are assumed to be mismatched. The process of calculating the results shown in Table 1 involved training runs in which both the regularization and pruning were systematically varied to determine the correct network size and the appropriate dynamics for training. As discussed in Ref. [20], network size is an indication of the amount of information that can be transferred between the training sample and the network without learning random noise patterns. In Table 1, all of the "nal networks had a potential weight space size of Table 1 Results of training and testing for global and class-by-class neural network matching using 13 K}L features. Features were taken from the correlation peak in each subregion. Images contained 512 pixels on each axis. Combining all networks with pattern classi"cation yields 85.28% accuracy. All networks had a 13-24-2 structure with 386 weights Class
Train
Test
Wts. pruned
Test set size
All Arch Left loop Right loop Tented arch Whorl
70.2 71.8 72.1 72.5 75.4 72.0
65.2 64.9 62.6 71.1 67.3 68.5
285 229 240 209 247 275
258,444 1681 73,984 68,121 1089 113,569
328
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
386 weights. Larger networks were found to have poorer testing error than networks of this size. The pruning temperature was varied to produce similar testing and training errors for each class and for the global class. As the Table shows, this produced weight reductions from 209 to 285 weights, leaving 101}187 non-zero weights. The small network size and large pruning ratio for acceptable generalization with training set of up to 100,000 samples show that the noise in the features used in the training is at a level where larger network sizes are not useful, because all of the information needed for generalization is learned by these small networks. All of the pruning experiments require that some small amount of regularization be used to constrain the volume of weight space [2]. This allows the discriminant surfaces to remain in the part of the training space which is heavily populated by the data. All of these runs were done in the 13-feature K}L space, but numerous test pruning and regularization runs were made in the original 32-feature space. Similar e!ective weight spaces were found in the full 32-feature space, about 150 weights. The 13-feature data set was selected for additional testing to save on computation time during training.
from 64 to 58 features. For some classes, such as left loops, this results in greatly improved accuracy, from 62.2 to 84.6%, and a reduced number of weights, from 146 to 90. For other classes the network does not reduce in size or improve in accuracy. For tented arches the accuracy decreases from 75.4 to 74.1% and the number of active weights increases from 139 to 320. Since the training data has a natural class distribution, the data indicates that the classes with relatively small sample sizes, arch and tented arch did not improve with more complex features but the classes with larger training set sizes, loops and whorls, improved an average of 10%. 5.3. Correlation peak, peak area, and peak width features
The basic network size was a 58-24-2 network with 722 weights, including bias weights, for class networks and 59-48-2 for the all class network. A sigmoidal activation function was used for the hidden nodes. With this network size, from Table 2 we see that a typical functioning network has approximately 100}300 weights and has an accuracy of 74}84% for class networks. The global (All) class network had 298 weights and an accuracy of 76%. This experiment di!ers from the peak feature experiment in that the combined feature set is only reduced
The basic network size was 57, 58, 59-48-2 network with 1442, 1466, 1490 weights, including bias weights, for the various networks. A sigmoidal activation function was used for the hidden nodes. With this network size, from Table 3 we see that a typical functioning network has approximately 250}530 weights and has a accuracy of 74}84% for class networks. The global (All) class network had 298 weights and an accuracy of 76%. This experiment di!ers from the two previous experiments in that the required network size has 48 hidden nodes for all of the networks and pruning on these networks with the training set sizes used is substantially less e!ective than it was with peak and area features. The less e!ective pruning of the network doubles the number of weights from 100}300 to 250}530. This shows that, even after feature correlations are removed by the K}L transform, various complex feature combinations are available that are detected in the network training. As in the previous experiment, the classes with relatively small sample sizes, arch and tented arch, did not improve with more complex features, but the classes with larger
Table 2 Results of testing and training for global and class-by-class neural network matching using 56}58 K}L features. All class networks had 24 hidden nodes while the all network had 48 hidden nodes. Features used were based on peak correlation and area under the correlation peak. Images contained 512 pixels on each axis. Combining all networks with pattern classi"cation yields 89.95% accuracy. Typical class networks had a 58-24-2 structure
Table 3 Results of testing and training for global and class-by-class neural network matching using 58 K}L features. All networks had 48 hidden nodes. Features used were based on peak correlation, the width of the correlation peak, and area under the correlation peak. Images contained 512 pixels on each axis. Combining all networks with pattern classi"cation yields 90.9% accuracy
Class
Train
Test
Wts. pruned
Total wts.
Test set size
Class
Train
Test
Wts. pruned
Total wts.
Test set size
All Arch Left loop Right loop Tented arch Whorl
78.3 78.2 79.1 80.6 86.8 79.3
76.0 75.0 84.6 80.0 74.1 80.7
1144 587 608 576 402 589
1442 746 698 722 722 722
258,444 1681 73,984 68,121 1089 113,569
All Arch Left loop Right loop Tented arch Whorl
78.3 80.1 82.2 82.6 96.0 84.8
76.0 76.0 84.9 79.9 74.5 82.7
1144 1135 1276 1251 963 1235
1442 1490 1466 1466 1490 1490
258,444 1681 73984 68,121 1089 113,569
5.2. Correlation peak and peak area features
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
329
Table 4 Results of testing and training for global and class-by-class neural network matching using 58 K}L features. All networks had 48 hidden nodes. Features used were based on peak correlation, the width of the correlation peak, and area under the correlation peak. Images contained 256 pixels on each axis. Combining all networks with pattern classi"cation yields 89.3% accuracy
Table 5 Results of testing and training for global and class-by-class neural network matching using 58 K}L features. All networks had 48 hidden nodes. Features used were based on peak correlation, the width of the correlation peak, and area under the correlation peak. Images contained 128 pixels on each axis. Combining all networks with pattern classi"cation yields 88.66% accuracy
Class
Train
Test
Wts. pruned
Total wts.
Test set size
Class
Train
Test
Wts. pruned
Total wts.
Test set size
All Arch Left loop Right loop Tented arch Whorl
78.3 76.2 77.6 76.9 80.1 77.5
76.0 71.3 81.9 76.0 72.0 79.4
1144 1276 1358 1343 1276 1381
1442 1466 1442 1442 1466 1466
258,444 1681 73,984 68,121 1089 113,569
All Arch Left loop Right loop Tented arch Whorl
78.3 79.0 79.0 76.6 84.7 77.6
76.0 77.9 81.7 77.5 76.6 73.8
1144 1315 1384 1430 1360 1433
1562 1562 1514 1514 1514 1514
258,444 1681 73,984 68,121 1089 113,569
training set sizes, loops and whorls, improved an average of 10% over simple peak features.
Table 6 Accuracy of match for di!erent features and sample resolutions Feature type
Sample resolution (pixels/mm)
Accuracy (%)
Peak Peak#Area Peak#Area#Width Peak#Area#Width Peak#Area#Width
20 20 20 10 5
85.28 89.95 90.9 89.3 88.66
5.4. Scan resolution of 10 pixels/mm In this experiment the image scanning resolution was reduced from 20 to 10 pixels/mm. The basic network size was 57, 58-48-2 network with 1442, 1466 weights, including bias weights, for the various networks. The network size is similar to the 20 pixel/mm experiment. A sigmoidal activation function was used for the hidden nodes. With this network size, from Table 4 we see that a typical functioning network has approximately 85}190 weights and has an accuracy of 71}81% for class networks. The global (All) class network had 302 weights and an accuracy of 76 %. The two main e!ects of lower image resolution are to increase pruning e!ectiveness and to decrease accuracy. The required network size is still about 1466 weights but the number of weights that are useful has decreased by about a factor of two and the class networks have been trained to a corresponding lower accuracy. The data needed to generate more complex weights set is missing in the lower resolution data. 5.5. Scan resolution of 5 pixels/mm In this experiment the image scanning resolution was reduced from 10 to 5 pixels/mm. The basic network size was 62, 60-48-2 network with 1562, 1514 weights, including bias weights, for the various networks. A sigmoidal activation function was used for the hidden nodes. With this network size, from Table 5 we see that with these training parameter a typical functioning network has approximately 88}247 weights and has an accuracy of 74}81% for class networks. The global (All) class network had 418 weights and an accuracy of 76%.
Further reduction in image resolution from 10 to 5 pixels/mm has had two e!ects. The K}L transform yields 60}62 features instead of 58}59 features which results in a small increase in initial network size. This larger network is then pruned to about the same size as in the 10 pixel/mm case and the "nal accuracy of the matching is reduced by less than 1%. 5.6. Summary In Table 6 the global matching accuracy of all "ve matching experiments are compared. The largest improvement in accuracy occurs when the peak area is added to the peak height; this improves matching accuracy from 85 to 89%. Adding the peak width provides another 1% increase in accuracy but the peak width and area are su$ciently correlated that major improvements are not possible. Decreasing the image resolution decreases matching accuracy. The reduction from 20 to 10 pixels/mm reduces matching accuracy from 90.9 to 89.3%. The most surprising result is that, by using FT-based features, the 5 pixel/ mm image can still be used to train an 88.66% accurate matcher which is only 0.64% lower than the 10 pixel/mm case. This clearly shows that the FT-based features can
330
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
be used for matching on images which have too low quality to provide clear minutia. The various operations in the hybred matching system have limiting speeds which vary over many orders of magnitude. The optical feature extraction occurs at the speed of light which allows a correlation peak to be processed in about 10 ns. The neural network used here is small and can combine the features to generate a match on a typical PC at a rate of 1000 matches/s. The speed limitation of the system is in the input and output of the data to the optical system. The SLM requires about 50 ms to form an image and the output camera is limited to NTSC video rates of 60 frames/s. 6. Conclusions We have compared optical and combined optical-neural network methods for rolled inked "ngerprint image matching. For static inked images, direct global optical correlation of inked images made at di!erent times has very low reliability, although cross correlations and auto correlations of the original inked images are good. This di$culty can be accounted for by the plastic deformation of the "ngerprint during rolling. Combining zonal optical features with neural networks for classi"cation and matching can yield reliable matching with an accuracy of 90.9%. This result was achieved using a neural classi"cation network described elsewhere [4}6] and three components of the local FT correlation to drive class-by-class matching networks. The information content analysis of the features, both from the dimension of the K}L transform features and the generalization error analysis, show that the information transfer from the training data to the classi"cation network is as high as the noise level of the features will allow for each K}L transform, feature set, and image resolution. The method used to achieve this optimal training is discussed in Ref. [2]. In principle, direct combination of multiple real-time images in a holographic matched "lter can allow for greater stored information content in the matching process. This will be the subject of further study. References [1] A. VanderLugt, Signal detection by complex spatial "ltering, IEEE Trans. Inform. Theory IT- 10 (1964) 139}145. [2] C.L. Wilson, J.L. Blue, O.M. Omidvar, Training dynamics and neural network performance, Neural Networks 10 (5) (1997) 907}923. [3] C.I. Watson, Mated "ngerprint card pairs, Technical Report Special Database 9, MFCP, National Institute of Standards and Technology, February 1993.
[4] C.L. Wilson, G.T. Candela, P.J. Grother, C.I. Watson, R.A. Wilkinson, Massively parallel neural network "ngerprint classi"cation system, Technical Report NISTIR 4880, National Institute of Standards and Technology, July 1992. [5] C.L. Wilson, G.T. Candela, C.I. Watson, Neural-network "ngerprint classi"cation, J, Arti"cial Neural Networks 1 (2) (1994) 203}228. [6] G.T. Candela, P.J. Grother, C.I. Watson, R.A. Wilkinson, C.L. Wilson, PCASYS } a pattern-level classi"cation automation system for "ngerprints, Technical Report NISTIR 5647, National Institute of Standards and Technology, 1995. [7] C.L. Wilson, P.J. Grother, C.S. Barnes, Binary decision clustering for neural network based optical character recognition, Pattern Recognition 29 (3) (1996) 425}437. [8] C.S. Weaver, J.W. Goodman, Technique for optically convolving two functions, Appl. Opt. 5 (1966) 1248}1249. [9] F.T.S. Yu, X.J. Lu, A real-time programmable joint transform correlator, Opt. Commun. 52 (1984) 10}16. [10] J.L. Horner, Optical processing or security and anticounter"eting, IEEE LEOS"9296 Proceedings, Boston, vol. 1, 18}21 November 1996, pp. 228}229. [11] Y. Petillot, L. Guibert, J.-L. de Bougrenet de la Tocnaye, Fingerprint recognition using a partially rotation invariant composite "lter in a FLC joint transform correlator, Opt. Commun. 126 (1996) 213}219. [12] J. Podolfo, H. Rajenbach, J.-P. Huignard, Performance of a photorefractive joint transform correlator for "ngerprint identi"cation, Opt. Engng 34 (1995) 1166}1171. [13] B. Javidi, J. Wang, Position-invariant two-dimensional image correlation using a one-dimensional space integrating optical processor: application to security veri"cation, Opt. Engng 35 (1966) 2479}2486. [14] T.J. Grycewicz, B. Javidi, Experimental comparison of bianry joint transform correlators used for "ngerprint identi"cation, Opt. Engng. 35 (1996) 2519}2525. [15] F.T. Gamble, L.M. Frye, D.R. Grieser, Real-time "ngerprint veri"cation system, Appl. Opt. 31 (1992) 652}655. [16] K.H. Fielding, J.L. Horner, C.K. Makekau, Optical "ngerprint identi"cation by binary joint transform correlation, Opt. Engng. 30 (1991) 1958}1961. [17] J.L. Horner, H.O. Bartelt, Two-bit correlation, Appl. Opt. 24 (1985) 2889}2893. [18] D. Psaltis, E.G. Paek, S.S. Venkatesh, Optical image correlation with a binary spatial light modulator, Opt. Engng 23 (1984) 698}704. [19] D.F. Specht, Probabilistic neural networks, Neural Networks 3 (1) (1990) 109}118. [20] O.M. Omidvar, C.L. Wilson, Information content in neural net optimization, J. Connection Sci. 6 (1993) 91}103. [21] J.L. Blue, P.J. Grother, Training feed forward networks using conjugate Gradients. in: Conference on Character Recognition and Digitizer Technologies, vol. 1661, SPIE, San Jose, February 1992, pp. 179}190, [22] P.J. Grother, Cross validation comparison of NIST OCR databases, in: D.P. D'Amato (Ed.), Conference on Character Recognition Technologies, Vol. 1906, SPIE, San Jose, 1993, pp. 296}307.
C.L. Wilson et al. / Pattern Recognition 33 (2000) 317}331
331
About the Author*C.L. WILSON has worked in various areas of computer modeling ranging from semiconductor device simulation, for which he received a DOC gold medal in 1983, and computer aided design to neural network pattern recognition at Los Alamos National Laboratory, AT & T Bell Laboratories and for the last 19 years NIST. He is currently the manager of the Visual Image Group in the Information Access and User Interface Division. His current research interest are in application of statistical pattern recognition, neural network methods and dynamic training methods for iamge recognition, image compression, optical information processing systems and in standards used to evalute recognition systems. About the Author*CRAIG I. WATSON received his B.S. in Electrical Engineering Technology from the University of Pittsburgh, Johnstown, in 1991 and his M.S. in Electrical Engineering from The Johns Hopkins University in 1997. He has worked with the Visual Image Processing Group at the National Institute of Standards and Technology for the past 7 years. His work has included image processing, image compression and automated "ngerprint classi"cation. Currently, he is working with holographic storage and pattern matching using optical correlators. About the Author*EUNG GI PAEK received his B.Sc. degree in Physics from Seoul National University in 1972 and the M.Sc. and Ph.D. degrees in Physics from Korea Advanced Institute of Science in 1976 amd 1979, respectively. From 1979 to 1981, he worked at the Agency for Defense Development in Korea. In April 1982, he joined the California Institute of Technology as a postdoctoral fellow and later as a senior research fellow for a period of "ve years. In early 1987, he joined the Physical Science and Technology group of Bellcore (Bell Communications Research in Red Bank, NJ) as a permanent member of technical sta! and a principal investigator. During his stay at Bellcore for seven years, he had investigated various applications of photonic devices interacting with device and material scientists. He later joined Rockwell Science Center to contribute to the DARPA holographic storage project for a year until he moved to NIST (National Institute of Standards and Technology in Gaithersburg, MD) in March, 1995. Currently he is a physicist in Information Technology Laboratory of NIST. His current interest includes biometrics, volume holographic storage and optical neutral networks with various photonic devices such as surface-emitting microlasers. Recently, he has also been actively involved in dense WDM-based optical telecommunications and RF photonics. He is a fellow of both SPIE (International Society for Optical Engineering) and the OSA (Optical Society of America). He also serves the Optical Society of America as a Topical Editor of Optics Letters.
Pattern Recognition 33 (2000) 333}340
Bayes empirical Bayes approach to unsupervised learning of parameters in pattern recognition Tze Fen Li* Department of Applied Mathematics, National Chung Hsing University, Kuo-Kuang Road, Taichung 40227, Taiwan, Republic of China Received 29 April 1998; accepted 10 March 1999
Abstract In the pattern classi"cation problem, it is known that the Bayes decision rule, which separates k classes, gives a minimum probability of misclassi"cation. In this study, all parameters in each class are unknown. A set of unidenti"ed input patterns is used to establish an empirical Bayes rule, which separates k classes and which leads to a stochastic approximation procedure for estimation of the unknown parameters. This classi"er can adapt itself to a better decision rule by making use of unidenti"ed input patterns while the system is in use. The results of a Monte Carlo simulation study with normal distributions are presented to demonstrate the favorable estimation of unknown parameters for the empirical Bayes rule. The percentages of correct classi"cation is also estimated by the Monte Carlo simulation. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Classi"cation; Empirical Bayes; Pattern recognition; Stochastic approximation
1. Introduction A pattern recognition system in general consists of feature extractor and classi"er. The function of feature extractor is to extract or measure the important characteristics from the input patterns. Let x denote the measurement of the signi"cant, characterizing features. This x will be called an observation. The function performed by a classi"er is to assign each input pattern to one of several possible pattern classes. The decision is made on the basis of feature measurements supplied by the feature extractor in a recognition system. This general approach has been applied to many research areas: speech and speaker recognition, "ngerprint identi"cation, electrocardiogram analysis, radar and sonar signal detection, weather forecasting, medical diagnosis and reading handprinted characters and numerals. Since the measurement x of a pattern may have a variation or
* Tel.: #04-287-3028; fax: #04-287-3028
noise, a classi"er may classify an input pattern to a wrong class. The classi"cation criterion is usually the minimum probability of misclassi"cation. Essentially, there are two di!erent approaches to solving the classi"cation problem. One approach is to "nd a Bayes decision rule which separates classes based on the present observation X and minimizes the probability of misclassi"cation [1}3]. To apply this approach, one needs su$cient information about the conditional density function f (x D u) of X given class u and the prior probability p(u) of each class. Otherwise, the conditional density function and the prior probability have to be estimated through a set of past observations (or a training set of sample patterns). On the other hand, if very little is known about statistical properties of the pattern classes, a discriminant function D(X, h , 2, h ) can be 1 m used. A learning procedure and an algorithm are designed to learn the unknown parameters h of the disi criminant function. This is the so-called nonparametric procedure. The most common nonparametric procedures are the k-mean and the k-nearest-neighbor classi"ers [1]. After learning, the function is used to separate pattern
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 6 4 - 3
334
T.F. Li / Pattern Recognition 33 (2000) 333}340
classes [2,4}7]. For this approach, it is not easy to de"ne the exact functional form and the parameters of the discriminant function, which gives the minimum probability of classi"cation. In this study, the "rst approach is applied to solving k-class pattern problems: all parameters of the conditional density function f (x D u) are unknown, where u denotes one of k classes, and the prior probability of each class is unknown. A set of n unidenti"ed input patterns is used to establish a decision rule, called an empirical Bayes (EB) decision rule, which is used to separate k classes. After learning the unknown parameters, the EB decision rule will make the probability of misclassi"cation arbitrarily close to that of the Bayes rule when the number of unidenti"ed patterns increases. The problem of learning from unidenti"ed samples (called unsupervised learning or learning without a teacher) presents both theoretical and practical problems [2]. In fact, without any prior assumption, successful unsupervised learning is indeed unlikely. Our classi"er, after unsupervised learning the unknown parameters, can adapt itself to a better and more accurate decision rule by making use of the unidenti"ed input patterns after the system is in use. The results of a Monte Carlo study with normal distributions are presented to demonstrate the favorable estimation of the unknown parametrs for the EB decision rule.
2. Empirical Bayes decision rules for classi5cation
P
h f (x D m ) dx, i c i
R(h)"inf R(h, d). d|D
(2)
A decision rule d which satis"es Eq. (2) is called the h Bayes decision rule6 with respect to the prior probability vector h"(h , h ,2, h ) and given by [7] 1 2 k d (x)"c if h f (x D m )'h f (x D m ) for all jOi. (3) h6 i i i j j In the EB decision problem [8}10], the past observations (u , X ), m"1, 2,2, n, and the present observation m m (u, X) are i.i.d., and all X are drawn from the same m conditional densities, i.e., f (x D u ) with p(u "c )"h . m m m i i The EB decision problem is to establish a decision rule based on the set of past observations X " n (X , X ,2, X ). In a pattern recognition system 1with 1 2 n unsupervised learning, X is a set of unidenti"ed input n patterns. The decision 1rule can be constructed using X to select a decision rule t (X ) which determines n n 1 n whether the present observation 1 X belongs to c . Let i m"(m ,2, m ). Then the risk of t , conditioned on X , is 1 k n 1 n R(h, t (X ))*R( h ) and the overall risk of t is n 1 n n n n R (h, t )" R(h , t (x )) < p(x D h, m) < dx , (4) m n n n n m m/1 m/1
P
where p(x D h, m) is the marginal density of X with m m respect to the prior distribution of classes, i.e., p(x D h, m)"+ k h f (x D m ). Let m i/1 i m i S"Mh, mN
Let X be the present observation which belongs to one of k classes c , i"1, 2,2, k. Consider the decision probi lem consisting of determining whether X belongs to c . i Let f (x D u) be the conditional density function of X given u, where x denotes one of k classes and let h , i i"1, 2,2, k, be the prior probability of c with i + k h "1. In this study, both the parameters of f (x D u) i/1 i and the h are unknown. Let d be a decision rule. A simple i loss model is used such that the loss is 1 when d makes a wrong decision and the loss is 0 when d makes a correct decision. Let h"M(h , h ,2, h ); h '0, + k h "1N be 1 2 k i i/1 i the prior probabilities. Let R(h, d) denote the risk function (the probability of misclassi"cation) of d. Let ! , i"1, 2,2, k, be k regions separated by d in the doi main of X, i.e., d decides c when X3! . Let m denote all i i i parameters of the conditional density function in class c , i"1,2, k. Then i k R(h, d)" + i/1
be denoted by
(1)
!i
where !c is the complement of ! . Let D be the family of i i all decision rules which separate k pattern classes. For h "xed, let the minimum probability of misclassi"cation
(5)
de"ne a parameter space of prior probabilities h and the i parameters m representing the ith class, i"1,2, k. Let i ' be a probability distribution on the parameter space S. An EB decision rule t is said to be the Bayes EB rule with n respect to ' if it minimizes
P
RK (', t )" R ( h , t ) d'( h, m). n n n n
(6)
Similar approaches to constructing EB decision rules can be found in the recent literature [8}12]. From Eqs. (1) and (4), Eq. (6) can be written as RK (', t ) n n
P P CP D
k " + i/1
C ci,n
n f (x D m )h < p(x D h, m) i i m m/1
n dU(h , m) dx < dx , m m/1
(7)
where, in the domain of X, ! , i"1, 2,2, k, are k rei,n gions, separated by t (X ) and hence they depend on the n 1 n past observations X . The Bayes EB rule which minim1 n in the same way as for the Bayes izes Eq. (7) can be found
T.F. Li / Pattern Recognition 33 (2000) 333}340
rule (3) and is given by
P
n tK (x )(x)"c if f (x D m )h < p(x D h , m) d'(h , m) n n i i i m m/1 n ' f (x D m )h < p(x D h , m) d'( h , m) (8) j j m m/1 for all jOi. In applications, we let the parameters m , i"1,2, k, be i bounded by "nite numbers M . Let o'0 and d'0. i Consider the subset S of the parameter space S de"ned by 1 S1"M(n o, n o,2, n o, n d, n d,2, n d); 1 2 k k`1 k`2 2k k integer n , + n o"1, D n d D )M , i i i i i/1 i"k#1,2, 2kN, (9)
P
where (n o,2, n o) are prior probabilities and 1 k (n d,2, n d) are the parameters of k classes. Let ' be k`1 2k a "nite discrete distribution on the parameter space S with equal mass on S . The boundary for class i relative 1 to another class j as separated by Eq. (8) can be represented by the equation (10) E[ f (x D m )h D x ]"E[ f (x D m )h D x ], j j n i i n where E[ f (x D m )h D x ] is the conditional expectation of i i n f (x D m )h given X "x with the conditional probability n i i n function of (h , m)1 given X "x equal to n 1 n
P
H(jo, j)" log p(x D j)p(x D jo) dx.
(12)
Then the Kullback}Leibler infomation number H(jo, jo)!H(jo, j*0 with equality if and only if p(x D j)"p(x D jo) for all x, i.e., H(jo, j) has an absolutely maximum value at j"jo. Let H(jo, jK )"max 1H(jo, j), with jK "(hK ,2, j|S 1 hK , mK ,2, mK ). Since H(jo, j) is a smooth function of j, the k 1 k maximum point jK will be located closely to the maximum point jo in S if the distances o and d are small.
335
Theorem 1. Let j"(h, m) in S. The conditional probability function h(j D x ) given X "x in Eq. (11) has the follown 1 n , n ing property: for each j3S 1 0 if jOjK , lim h(j D x )" (13) n 1 if j"jK n?= and hence E[j D X ] converges a.s. to jK . 1 n Proof. H(jo, j) has a maximum value at jK on S . Let j3S 1 1 and jOjK . Consider
G
1
n`1
)" +
(14)
where h(j D x )"1, if n"0. Eq. (14) is di!erent from the n above two types of procedures. It does not have a regression function or an obvious decreasing sequence of coe$cients, but it appears to be a weighted product of the estimates calculated from the old patterns and the contribution of the new pattern. From the simulation results in the next section, our approach does show the
336
T.F. Li / Pattern Recognition 33 (2000) 333}340
e!ect of diminishing in#uence of the new patterns, although a deceasing sequence of coe$cients is not used in our procedure. Eq. (14) is a stable stochastic approximation procedure which does not produce large computation errors and hence will be later used for simulation study. In each step of evaluation, Eq. (14) multiplies a new probability factor with the old conditional probability h(j D x ) based on the new pattern x . The conn n`1 vergence of Eq. (14) is guaranteed by Theorem 1. Let X be an unidenti"ed pattern. The algorithm of Eq. (14) can be stated as follows: 1. Set n"0. 2. Set h(j D x )"1 for each point j in S before input any n 1 unidenti"ed pattern.
3. While not EOF do: Begin 4. Read(X). 5. Compute the marginal density function p(x D j@) for each point j@ in S . 1 6. Compute the conditional probability h(j D x ) given n`1 x in Eq. (14). n`1 7. n"n#1. End. From Theorem 1, the rate of convergence of the estimates E[j D X ] to jK depends upon how fast the condin tional probabilities h(j D x ) converge to 0 if jOjK and n converge to 1 if j"jK . Hence, it is di$cult to discuss the rate of convergence, but usually, the speed of
Table 1 The estimates at the "rst time are shown in (a) and the estimates at the second time are shown in (b). The graph shows the convergence of the estimates from the "rst time to the true parameters (a) True parameter vector (u , u , p , p , p )"(2, 4, 0.5, 1, 0.4) 1 2 1 2 1 No. of mixed Estimated values samples (n) u( u( p( p( p( 1 2 1 2 1
(a) True parameter vector (u , u , p , p , p )"(1.1, 4.3, 0.8, 1.3, 0.2) 1 2 1 2 1 No. of mixed Estimated values samples (n) u( u( p( p( p( 1 2 1 2 1
100 200 300 400 500 True parameter
100 200 300 400 500 True parameter
2.007 2.000 2.000 2.000 2.000 2
3.996 4.000 4.000 4.000 4.000 4
0.507 0.500 0.500 0.500 0.500 0.5
0.910 0.900 0.900 0.900 0.900 1
(b) True parameter vector (u , u , p , p , p )"(2, 4, 0.5, 1 2 1 2 1 100 2.155 4.111 0.607 0.851 200 2.033 3.856 0.469 0.934 300 1.915 3.763 0.451 1.057 400 1.952 3.775 0.509 1.053 500 2.053 3.967 0.530 0.938 True parameter 2 4 0.5 1
0.485 0.496 0.495 0.499 0.499 0.4 1, 0.4) 0.503 0.396 0.334 0.359 0.413 0.4
The percentage of correct classixcation
True parameter Estimated parameter Estimated values
k /10 000 1 0.932 0.923
2.534 1.599 1.335 1.394 1.469 1.1
4.501 4.247 4.158 4.186 4.234 4.3
1.521 0.936 0.797 0.802 0.845 0.8
1.166 1.208 1.312 1.351 1.209 1.3
0.508 0.270 0.199 0.216 0.240 0.2
(b) True parameter vector (u , u , p , p , p )"(1.1, 4.3, 0.8, 1.3, 0.2) 1 2 1 2 1 100 1.535 4.185 1.066 1.296 0.248 200 1.443 4.277 0.947 1.126 0.262 300 1.249 4.261 0.861 1.193 0.234 400 1.235 4.166 0.852 1.295 0.215 500 1.188 4.228 0.870 1.210 0.221 True parameter 1.1 4.3 0.8 1.3 0.2 The percentage of correct classixcation
k /10 000 2 0.898 0.895
Total 0.912 0.906
True parameter Estimated parameter
k /10 000 1 0.875 0.848
k /10 000 2 0.961 0.956
Total 0.944 0.932
Estimated values
*continued opposite
T.F. Li / Pattern Recognition 33 (2000) 333}340
337
Table 1 (Continued) (a) True parameter vector (u , u , p , p , p )"(1.7, 5.3, 2, 1.1, 0.7) 1 2 1 2 1 No. of mixed Estimated values samples (n) u( u( p( p( p( 1 2 1 2 1
(a) True parameter vector (u , u , p , p , p )"(3.1, 4.1, 1.2, 0.5, 0.4) 1 2 1 2 1 No. of mixed Estimated values samples (n) u( u( p( p( p( 1 2 1 2 1
100 200 300 400 500 True parameter
100 200 300 400 500 True parameter
1.412 1.267 1.759 1.703 1.971 1.7
4.319 3.764 4.747 4.744 4.976 5.3
1.863 1.618 2.017 1.987 2.088 2
1.389 1.742 1.406 1.408 1.307 1.1
0.548 0.410 0.645 0.645 0.694 0.7
2.898 2.993 2.999 3.000 3.000 3.1
4.000 4.000 4.000 4.000 4.000 4.1
1.217 1.059 1.300 1.300 1.300 1.2
0.500 0.500 0.500 0.500 0.500 0.5
0.346 0.303 0.304 0.304 0.300 0.4
(b) True parameter vector (u , u , p , p , p )"(1.7, 5.3, 2, 1.1, 0.7) 1 2 1 2 1 100 1.906 4.867 2.104 1.100 0.726 200 1.964 4.899 1.924 1.182 0.754 300 1.866 5.085 2.022 1.185 0.746 400 1.760 5.103 2.012 1.199 0.733 500 1.757 5.153 1.932 1.137 0.718 True parameter 1.7 5.3 2 1.1 0.7
(b) True parameter vector (u , u , p , p , p )"(3.1, 4.1, 1.2, 0.5, 0.4) 1 2 1 2 1 100 2.914 4.101 1.122 0.502 0.382 200 2.965 4.142 1.004 0.408 0.412 300 2.796 4.010 1.179 0.570 0.313 400 2.889 4.002 1.180 0.532 0.358 500 3.109 4.044 1.180 0.455 0.429 True parameter 3.1 4.1 1.2 0.5 0.4
The percentage of correct classixcation
The percentage of correct classixcation
True parameter Estimated parameter
k /10 000 1 0.895 0.897
k /10 000 2 0.848 0.810
Total 0.881 0.872
True parameter Estimated parameter
k /10 000 1 0.604 0.626
k /10 000 2 0.939 0.934
Total 0.805 0.802
Estimated values
Estimated values
convergence depends on the distance among class means, the values of class variances and the increments o and d in S . Simulation results in the next section show that 1 all estimates steadily converge to their own true values of the parameters as expected from Eq. (14). The details will be discussed in Section 4.
have to save the conditional probability h(j@ D x ) for each n j@3S in order to compute the conditional probability 1 h(j D x ). Because the memory space in a PC computer n`1 is limited, the recursive formula is used twice: at the "rst time, large increments o and d in Eq. (9) are used to obtain a rough estimate and at the second time, based on the rough estimates, small increments o and d are used to obtain more accurate estimates. At the second time, the subspace S is de"ned by only one large increment 1 around the rough estimate. 100 input unclassi"ed patterns are generated from normal distributions and the number of samples is increased by another 100 until a total of 500 samples is generated. We would like to show the performance of convergence of the estimates to the true parameters.
4. Simulation results 4.1. Parameter estimation In the simulation study, the recursive formula (14) is used to compute the estimates E[j D x ] of the unknown n parameters j"(h, m). In using this recursive formula, we
338
T.F. Li / Pattern Recognition 33 (2000) 333}340
4.2. Percentage estimation of correct classixcation Although the empirical Bayes decision rule in Eq. (8) can be used to estimate the percentage of correct classi"cation, the integration in Eq. (8) is complicated. Since we obtain the estimates of parameters (h, m), it is easy to use the Bayes rule (3) to estimate the percentage of correct classi"cation. A hit and miss method is used to obtain the estimate of correct classi"cation. In the simulation study, we generate 10 000 patterns in each class based on the estimates of the parameters and use the Bayes rule (3) to "nd the percentage of correct classi"cation. It is obvious that more accurate estimates of parameters in each class can "nd more accurate percentages
of correct classi"cation. Higher percentages of correct classi"cation depend upon the distances of class means and the sizes of variances. A di!erent approach to estimation of correct classi"cation can be found in Ref. [18]. Example 1. Unidenti"ed patterns are generated from two di!erent classes of normal distributions N(k , p ) and 1 1 N(k , p ) with parameters (k , k , p , p , p), where p de2 2 1 2 1 2 notes the prior probability of class 1. For the two-class system, 4 di!erent sets of parameters (k , k , 1 2 p , p , p)"(2, 4, 0.5, 1, 0.4), (1.1, 4.3, 0.8, 1.3, 0.2), (1.7, 5.3, 1 2 2, 1.1, 0.7) and (3.1, 4.1, 1.2, 0.5, 0.4) are used to generate input patterns. At the "rst time, we use the increments
Table 2 The estimates at the "rst time are shown in (a) and the estimates at the second time are shown in (b). The graph shows the convergence of the estimates from the "rst time to the true parameters (a) True parameter vector (u , u , u , p , p )"(2, 4, 6, 0.4, 0.3) 1 2 3 2 2 No. of mixed Estimated values samples (n) u( u( u( p( p( 1 2 3 1 2
(a) True parameter vector (u , u , u , p , p )"(2, 3, 4.5, 0.4, 0.3) 1 2 3 1 2 No. of mixed Estimated values samples (n) u( u( u( p( p( 1 2 3 1 2
100 200 300 400 500 True parameter
100 200 300 400 500 True parameter
1.900 1.999 1.999 2.000 2.000 2
3.359 3.969 3.999 3.999 3.999 4
4.636 5.519 5.945 6.041 5.744 6
0.250 0.369 0.455 0.495 0.445 0.4
0.188 0.325 0.314 0.301 0.301 0.3
2.067 2.004 2.000 2.000 2.000 2
2.845 2.998 2.999 2.999 3.000 3
3.444 4.470 4.907 4.982 4.663 4.5
0.251 0.335 0.396 0.485 0.386 0.4
(b) True parameter vector (u , u , u , p , p )"(2, 4, 6, 0.4, 0.3) 1 2 3 1 2 100 1.988 3.949 5.547 0.405 0.289 200 2.061 4.033 5.703 0.413 0.320 300 1.929 4.034 5.987 0.398 0.341 400 1.906 3.910 5.997 0.390 0.344 500 1.953 3.922 5.896 0.384 0.329 True parameter 2 4 6 0.4 0.3
(b) True parameter vector (u , u , u , p , p )"(2, 3, 4.5, 1 2 3 1 2 100 2.041 2.932 4.302 0.389 200 2.147 2.914 4.445 0.407 300 1.969 3.006 4.651 0.397 400 1.952 2.928 4.642 0.395 500 2.011 2.929 4.456 0.387 True parameter 2 3 4.5 0.4
The percentage of correct classixcation
The percentage of correct classixcation
k /10 000 1 True parameter 0.875 Estimated parameter 0.859 Estimated values
k /10 000 2 0.780 0.800
k /10 000 Total 3 0.649 0.792 0.676 0.788
k /10 000 1 True parameter 0.733 Estimated parameter 0.676
k /10 000 2 0.574 0.634
0.202 0.429 0.386 0.311 0.345 0.3 0.4, 0.3) 0.335 0.363 0.368 0.363 0.335 0.3
k /10 000 Total 3 0.644 0.659 0.622 0.647
Estimated values
*continued opposite
T.F. Li / Pattern Recognition 33 (2000) 333}340
339
Table 2 (Continued) (a) True parameter vector (u , u , u , p , p )"(3.1, 4.1, 7.3, 0.3, 0.4) 1 2 3 1 2 No. of mixed Estimated values samples (n) u( u( u( p( p( 1 2 3 1 2
(a) True parameter vector (u , u , u , p , p )"(3, 4.5, 6.6, 0.4, 0.3) 1 2 3 1 2 No. of mixed Estimated values samples (n) u( u( u( p( p( 1 2 3 1 2
100 200 300 400 500 True parameter
100 200 300 400 500 True parameter
3.054 2.997 2.996 2.999 2.999 3.1
4.013 4.000 4.000 4.000 4.000 4.1
7.013 7.000 7.000 7.000 7.000 7.3
0.314 0.298 0.299 0.299 0.299 0.3
0.434 0.497 0.428 0.403 0.366 0.4
2.991 2.999 2.999 2.999 3.000 3
4.355 4.079 4.819 4.985 4.869 4.5
6.067 6.056 6.818 6.986 6.869 6.6
0.391 0.314 0.463 0.497 0.473 0.4
0.243 0.300 0.300 0.299 0.299 0.3
(b) True parameter vector (u , u , u , p , p )"(3.1, 4.1, 7.3, 0.3, 0.4) 1 2 3 1 2 100 3.138 4.107 7.005 0.307 0.398 200 3.127 4.091 7.096 0.288 0.426 300 2.989 4.095 7.213 0.286 0.420 400 2.992 4.097 7.209 0.293 0.416 500 3.015 4.099 7.216 0.283 0.425 True parameter 3.1 4.1 7.3 0.3 0.4
(b) True parameter vector (u , u , u , p , p )"(3, 4.5, 6.6, 0.4, 0.3) 1 2 3 1 2 100 3.064 4.940 6.468 0.457 0.306 200 3.146 4.835 6.534 0.452 0.323 300 3.039 4.767 6.681 0.433 0.332 400 3.039 4.704 6.688 0.437 0.325 500 3.062 4.659 6.624 0.423 0.323 True parameter 3 4.5 6.6 0.4 0.3
The percentage of correct classixcation
The percentage of correct classixcation
k /10 000 1 True parameter 0.599 Estimated parameter 0.613
k /10 000 2 0.920 0.930
k /10 000 Total 3 0.946 0.831 0.939 0.843
k /10 000 1 True parameter 0.879 Estimated parameter 0.883
k /10 000 2 0.573 0.611
k /10 000 Total 3 0.855 0.780 0.802 0.775
Estimated values
Estimated values
o"0.1 for prior probabilities and d"1 for means and standard deviations in Eq. (9). The estimates of the parameters are presented in Table 1a. At the second time, we use the increments o"0.02 for prior probabilities and d"0.2 for means and standard deviations. The estimates are presented in Table 1b, where the estimates are closer to true parameters than those in Table 1a. The graphs below the tables are presented to show the performance of the convergence of estimates to true parameters. For estimation of correct classi"cation, 10 000 patterns are generated both from the true parameters and the estimates of parameters. The hit and miss method "nds two percentages of correct classi"cation, which are pre-
sented in Table 1c. Both percentages come close to each other. Example 2. In this example, a three-class system is discussed. The class means and the prior probabilities are unknown, but the class variances are known. A total of 5 unknown parameters are estimated using the recursive formula (14). Four di!erent sets of parameters are (k , k , k , p , p , p , p , p )"(2, 4, 6, 1, 0.8, 1.8, 0.4, 0.3), 1 2 3 1 2 3 1 2 (2, 3, 4.5, 1, 0.8, 1.5, 0.4, 0.3), (3.1, 4.1, 7.3, 1.1, 0.5, 1.3, 0.3, 0.4) and (3, 4.5, 6.6, 0.8, 1, 1, 0.4, 0.3), where k , i"1, 2, 3, i are three class means and p , i"1, 2, 3, are three stani dard deviations. p and p are the prior probabilities for 1 2 class 1 and class 2. The recursive formula is used twice as
340
T.F. Li / Pattern Recognition 33 (2000) 333}340
in Example 1 to obtain accurate estimates, which are shown in Table 2.
5. Conclusion In this paper, a set of unidenti"ed input patterns is used to establish an empirical Bayes decision rule and a stochastic approximation, which is used to compute the estimates of all unknown parameters. The stochastic approximation can be written in a recursive form (14) which will adapt the estimates to the true parameters by making use of new input patterns. From the simulation results, the recursive formula can bring all estimates of parameters and the estimates of percentages of correct classi"cation to their true parameters.
Acknowledgements The author is grateful to the editor and referees for their valuable suggestions for the improvement of this paper.
References [1] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, New York, 1972. [2] R.L. Kasyap, C.C. Blayton, K.S. Fu, Stochastic approximation, in: J.M. Mendel, K.S. Fu. (Eds.), Adaptation, Learning and Pattern Recognition Systems: Theory and Applications, Academic press, New York, 1970. [3] T.Y. Young, T.W. Calvert, Classi"cation, Estimation and Pattern Recognition, Elsevier, New York, 1974.
[4] A.G. Barto, P. Anandan, Pattern recognizing stochastic learning automata, IEEE Trans. Systems Man Cybernet SMC-15 (1985) 360}375. [5] H. Do-Tu, M. Installe, Learning algorithm for nonparametric solutions to the minimum error classi"cation problem, IEEE Trans. Comput. C-27 (1978) 648}657. [6] M.A.L. Thathachar, K.R. Ramakrishnan, A cooperative game of a pair of learning automata, Automation 20 (1984) 797}801. [7] M.A.L. Thathachar, P.S. Sastry, Learning optimal discriminant function through a cooperative game of automata, IEEE Trans. Systems Man Cybernet SMC-17 (1987) 73}85. [8] J.J. Deely, D.V. Lindley, Bayes empirical Bayes, J. Amer. Statist. Assoc. 76 (1981) 833}841. [9] M. Ghosh, Estimation of multiple Poisson means: Bayes and empirical Bayes, Statist. Decision 1 (1983) 183}195. [10] D.C. Gilliland, J.E. Boyer, H.J. Tsao, Bayes empirical Bayes: "nite parameter case, Ann. Statist. 10 (1982) 1277}1282. [11] G. Meeden, Some admissible empirical Bayes procedures, Ann. Math. Statist. 43 (1972) 96}101. [12] H. Robbins, An empirical Bayes approach to statistics, Proceedings of the Third Berkeley Symposium on Mathematical Statistics Probability, vol. 1, University of California Press, 1956, pp. 157}163. [13] S. Kullback, Information Theory and Statistics, Wiley, New York, 1959. [14] S.S. Wilks, Mathematical Statistics, Wiley, New York, 1962. [15] A. Abert, L. Gardner, Stochastic Approximation and Nonlinear Regression, MIT, Cambridge, MA, 1967. [16] R. Duda, P. Hart, Pattern Classi"cation and Scene Analysis, Wiley, New York, 1973. [17] H. Robbins, S. Monro, A stochastic approximation method, Ann. Math. Statist. 22 (1951) 400}407. [18] N. Glick, Additive estimators for probabilities of correct classi"cation, Pattern Recognition 10 (1978) 211}222.
About the Author*TZE FEN LI received the B.A. degree in forestry from chung Hsing University, Taichung, Taiwan, in 1962 and the Ph.D. degree in applied mathematics from North Carolina State University in 1972. Dr. Li then taught statistics at University of North Carolina at Charlotte for two years and computer science at Rutgers University at Camden for six years. He is at present a professor of applied mathematics at Chung Hsing University. Dr. Li has published papers in statistics, computer science, reliability and signal analysis.
Pattern Recognition 33 (2000) 341}348
A new fast method for computing Legendre moments H.Z. Shu*, L.M. Luo, W.X. Yu, Y. Fu Laboratory of Image Science and Technology, Southeast University, 210096, Nanjing, People's Republic of China Received 26 August 1998
Abstract This paper presents a new algorithm for fast and accurate computation of Legendre moments. For a binary image, by use of a Green's theorem, we transform a surface integral to a simple integration along the boundary. The inter-order relationship of Legendre moments is then investigated. As a result, the moments of higher order can be deduced from those of lower order. Based on this relationship, an iterative method is proposed to calculate the Legendre moments from a polygonal approximation of the boundary. Comparison with known methods shows that our algorithm is almost as e$cient as the existing method, but is more accurate. ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Legendre moments; Green's theorem; Fast algorithm; Iterative method; Polygon
1. Introduction Moment functions have been widely used in image processing and analysis such as pattern recognition, object classi"cation, image analysis and edge detection [1}8]. Examples of moment-based feature descriptors include Cartesian geometrical moments, rotational moments, complex moments and orthogonal moments. As is well known, moments with an orthogonal basis set (e. g., Legendre and Zernike polynomials) can be applied to represent the image with a minimal amount of information redundancy. These orthogonal moments have been successfully used in the "eld of pattern representation and image analysis [3,9}11]. In recent years, many algorithms have been developed to speed up the calculation of moments by reducing the computational complexity [12}16]. Most of them work for geometric moments. Orthogonal moments, however, have been analysed in few papers [3,16]. In Ref. [16], Mukundan and Ramakrishnan "rst used a Green's theorem, and then proposed a recursive algorithm for computing the Legendre and Zernike polynomials. Their
* Corresponding author. Tel.: #86-25-3794249; fax: #86-253792698. E-mail address:
[email protected] (H.Z. Shu)
method is e$cient, but not accurate, since Mukundan and Ramakrishnan used a trapezoidal integration rule to approximate the integral function for Legendre moments. Liao and Pawlak [3] proposed a more accurate formula, called the alternative extended Simpson's rule to numerically calculate the integral function for high order of Legendre moments in each pixel. In this paper, by extending Jiang and Bunke's method [12] we present a new algorithm to calculate Legendre moments.
2. Fast computation of Legendre moments The (n#m)th-order Legendre moment of an image intensity function f (x, y) is de"ned as
P P
(2n#1)(2m#1) 1 1 j " P (x)P (y) f (x,y) dx dy, nm n m 4 ~1 ~1
(1)
where the nth-order Legendre polynomial is given by 1 n@2 (2n!2k)! P (x)" + (!1)k xn~2k, n 2n k!(n!k)!(n!2k)! k/0 x3[!1, 1]
0031-3203/99/$20.00 ( 1999 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 4 4 - 8
(2)
342
H.Z. Shu et al. / Pattern Recognition 33 (2000) 341}348
or
Setting
n P (x)" + C xk, n nk k/0
(3)
LM (x, y)"P (x)P (y) n m Lx
where the Legendre coe$cients, C , are given by nk
and using Eq. (6), we obtain
C " nk
2m#1 j " nm 4
G
1 (n#k)! (!1)(n~k)@2 , n!k is even, 2n [(n!k)/2]![n#k)/2]!k! 0 n!k is odd. (4)
As is well known, the functions P (x) form a complete n orthogonal basis set inside the unit circle. For fundamental properties of the Legendre polynomials, we refer to Sansone [17]. We mention here two important properties which will be used in the remainder of this paper. (1) The recursive relation n 2n#1 xP (x)! P (x), n*1 P (x)" n n`1 n#1 n#1 n~1
(5)
with P (x)"1 and P (x)"x. 0 1 (2) The integral relation
P
1 P (x) dx" (P (x)!P (x)). n n~1 2n#1 n`1
(6)
In many applications, images are, in general, transformed to binary images, i.e., the intensity function f (x, y) has the values 0 or 1. So in this paper, only the binary images are taken under consideration. For binary images, a digital region R is completely determined by its boundary. Any function of R, such as the Legendre moment j , can be expressed as a function of the boundnm ary of R by a Green's theorem. On the other hand, the boundary of R can generally be approximated by a set of straight-line segments and represented by a polygon. For this reason, our discussion is limited to the case of polygonal boundary.
Q
k " nm
PP
Q
(7)
where { denotes a line integral along C in the counterC clockwise direction.
[P (x)!P (x)]P (y) dy. n`1 n~1 m
(9)
P (x)P (y) dy, n m
(10)
2m#1 j " (k !k ). nm (n`1)m (n~1)m 4
(11)
C
we have
To calculate the boundary integrals k , we will use the nm same notation that adopted by Jiang and Bunke [12]. Denote (x , y ), i"1, 2, . . . , N, the coordinates of the i i N vertices of the polygon C, and assume (x ,y )"(x , y ), then the boundary of the polygon N`1 N`1 1 1 C consists of N straight line segments c : y"a x#y !a x , x3[x , x ], i"1, 2,2, N, (12) i i i i i i i`1 where a "(y !y )/(x !x ) is the slope of c . Dei i`1 i i`1 i i note D (n, m) the contribution of c to the total line i i integral, we have N k " + D (n, m), nm i i/1
(13)
where D (n, m) can be expressed as follows by using Eqs. i (10), (12) and (13)
P P
D (n, m)" P (x)P (y) dy n m i ci
2.1. Computation of the Legendre moments of a polygon
LM (x, y) dx dy" M(x, y) dy, Lx R C
C
De"ning
"
Denote C the boundary of a simple connection region R, and suppose that the curve C is piecewise linear. For a binary image, the surface integral de"ned by Eq. (1) can be converted into a boundary integral by using the following Green's theorem:
Q
(8)
xi`1 P (x)P (a x#y !a x )a dx, n m i i i i i xi
(14)
thus 2m#1 N + [D (n#1, m)!D (n!1, m)]. j " nm i i 4 i/1
(15)
Before giving an iterative method to compute the boundary integral D (n, m), we mention brie#y the i method proposed by Mukundan and Ramakrishnan [16], and describe a method based on Jiang and Bunke's algorithm that has been proposed to calculate geometric moments.
H.Z. Shu et al. / Pattern Recognition 33 (2000) 341}348
343
2.2. Mukundan and Ramakrishnan's method
2.4. An iterative computation method
In Ref. [16], the following relation instead of (6) has been used by Mukundan and Ramakrishnan:
Now we propose an iterative algorithm to calculate D (n, m). From the recursive relation (5), we have i
P
D (n, m)"a i i
1 P (x) dx" [xP (x)!P (x)]. n n n~1 n#1
(16)
Insertion of Eq. (16) into Eqs. (7) and (8) yields (2n#1)(2m#1) j " nm 4(n#1)
Q
C
"a
[xP (x)!P (x)]P (y) dy. n n~1 m
(2n#1)(2m#1) N~1 + (¹ #¹ ), j + j j`1 nm 4(n#1)(M!1) j/1
As mentioned correctly by Jiang and Bunke [12], their fast algorithm for computing the geometric moments can be applied to calculate other types of moments such as Legendre, Zernike, rotational and complex moments since these latter moments can be expressed as a combination of the geometric moments. For this purpose, by inserting Eq. (3) into Eq. (14), we obtain
P
n m xi`1 D (n, m)"a + + C C xj(a x#y !a x )k dx. i i nj mk i i i i xi j/0 k/0 (20) De"ning
P
A ( j, k)" i
xi`1 xi
xj(a x#y !a x )k dx. i i i i
(21)
P P
]
xi`1 P (x)P (a x#y !a x ) dx!(m!1) n m~1 i i i i xi
]
xi`1 P (x)P (a x#y !a x ) dx . n m~2 i i i i xi
Using this relation, the boundary integrals A ( j, k) can be i obtained by an iterative method, then the values of D (n, m) and k can be deduced from those of A ( j, k). i nm i
H
(23)
By using the following relation in Eq. (23) 1 [(n#1)P (x)#nP (x)], xP (x)" n n`1 n~1 2n#1 we have
G
1 2m!1 D (n, m)" a [(n#1)D (n#1, m!1) i i m 2n#1 i #nD (n!1, m!1)] i #(2m!1)(y !a x )D (n, m!1) i i i i
H
!(m!1)D (n, m!2) . i
(24)
The above relation shows that D (n, m) can be calculated i by recursion. That is, to obtain the values of all D (n, m), i for 0)n#m)K where K denotes the maximum order of Legendre moment, we need only to determine those of D (n, 0) for 0)n)K. For computing these latter coi e$cients, the following formula is used with the help of Eq. (6): D (n, 0)"a i i
(22)
P
(a x#y !a x ) dx#(2m!1)(y !a x ) i i i i i i i
Jiang and Bunke deduced the following recursive relation: A ( j, k)"a A ( j#1, k!1)#(y !a x )A ( j, k!1). i i i i i i i
xi`1 1 P (x)[(2m!1)(a x#y !a x ) i i i i m n xi
G
¹ "P (y )(x P (x )!P (x ) j m j 2j n 2j n~1 2j
2.3. A computation method based on Jiang and Bunke's algorithm
P (x)P (a x#y !a x ) dx n m i i i i
xi`1 a xP (x)P " i (2m!1)a i n m~1 m xi
(18)
(19)
xi
!(m!1)P (a x#y !a x )] dx m~2 i i i i
where M]M is the image size; (x , x , y ), for j"1, 1j 2j j 2, 2 , N, denote the contour points of C, and
!x P (x )#P (x )). 1j n 1j n~1 1j
i
xi`1
]P (a x#y !a x ) m~1 i i i i
(17)
To obtain an approximate value of j , a trapezoidal nm integration rule has been proposed by Mukundan and Ramakrishnan. So one can obtain from Eq. (17)
P P
P
xi`1 P (x) dx n xi
a i [P (x )!P (x ) " n`1 i 2n#1 n`1 i`1 !P (x )#P (x )]. n~1 i`1 n~1 i
(25)
344
H.Z. Shu et al. / Pattern Recognition 33 (2000) 341}348
We remark that when a segment c is vertical, i.e., c has i i the form c : x"x for y3[y , y ]. (26) i i i i`1 In this case, D (n, m) could be calculated in another way. i Concretely, we have
P
P
yi`1 D (n, m)" P (x)P (y) dy" P (x )P ( y) dy i n m n i m ci yi P (x ) ( y )!P (y ) " n i [P m~1 i`1 2m#1 m`1 i`1 !P ( y )#P ( y )]. (27) m`1 i m~1 i From the above discussion, we can see that from Eqs. (15) and (24), all Legendre moments j with nm 0)n#m)K can be deduced from the value of D (n, 0) i for 0)n)K#1. So in our algorithm, we need to calculate the Legendre polynomials of order up to K#2.
3. Algorithm and computational complexity As is well known, the iterative method is a better way to implement the recursive relations. So we use such a method to calculate the Legendre polynomials and the coe$cients D (n, m). The algorithm for computing the i Legendre polynomial values at the vertices of polygon C is given in Fig. 1, we remark that this strategy has been also utilised by Mukundan and Ramakrishnan [16]. The iterative method to calculate D (n, m) and the Legendre i moments j is given in Fig. 2. Note that the convention nm D (!1, n)"D (n, !1)"0 for n*0 is used in the algoi i rithm. Also, we give the Mukundan and Ramakrishnan's algorithm in Fig. 3. To illustrate the e!ectiveness of our method, we apply it to reconstruct a binary image. Fig. 4 shows the boundary of the original image. After computing the Legendre moments j by using Eqs. (15) and (24), nm
Fig. 2. Algorithm for computing D (n, m) and the Legendre i moments. Fig. 1. Algorithm for the computation of the Legendre polynomial values.
H.Z. Shu et al. / Pattern Recognition 33 (2000) 341}348
345
Fig. 5. Reconstructed image of Fig. 4 using our iterative method with the order of moments up to 20 and 40.
Fig. 3. Mukundan and Ramakrishnan's algorithm for computing the Legendre moments (see Ref. [16], Fig. 1). Fig. 6. Reconstructed image of Fig. 4 using Mukundan and Ramakrishnan method with the order of moments up to 20 and 40.
needs: (K!1)N additions and 5(K!1)N multiplications. The calculation of the Legendre moments needs: Fig. 4. The boundary of the original image.
the image is reconstructed with the help of the following equation: K n f (x , y )+ + + j P (x )P (y ), (28) i j n~m,m n~m i m j n/0 m/0 where K denotes the maximum order of Legendre moments we want to calculate. Fig. 5 shows the results of reconstruction using our iterative method with the order of moments up to 20 and 40, respectively. The results using Mukundan and Ramakrishnan's algorithm with the same orders are given in Fig. 6. Note that the boundary of the original image is superposed in both Figs. 5 and 6. Now, we calculate the computation complexity (CC) for each algorithm described in Section 2.
1(K#1)(K#2)(5N#3) additions, 2 1(K#1)(K#2)(3N#6) multiplications. 2 The total number of additions and multiplications needed for computing all Legendre moments j with nm 0)n#m)K is given as follows: N "3(K#1)(K#2)#1(5K2#21K#4)N additions a1 2 2 and N "3(K#1)(K#2) m1 #1(3K2#19K!4)N multiplications. 2 3.2. CC for the method based on Jiang and Bunke's algorithm
3.1. CC for Mukundan and Ramakrishnan's algorithm
We suppose that all A ( j, k) for 0)j)n, 0)k)m i de"ned by Eq. (21) and the coe$cients C and C are nj mk given, the calculation of D (n, m) (see Eq. (20)) needs i approximately
The calculation of the Legendre polynomial values of order up to K at the N corners of the boundary
n m n m ] additions and ] ]2 multiplications. 2 2 2 2
346
H.Z. Shu et al. / Pattern Recognition 33 (2000) 341}348
Then, for 0)n#m)K and 1)i)N, the computation of all the Legendre moments needs approximately 1 (K#1)2(K#2)2N additions and 1 (K#1)2(K#2)2 96 48 N multiplications 3.3. CC for the iterative method For this method, we need to calculate the Legendre polynomials of order up to K#2, so this step needs 3(K#1)N additions and 5(K#1)N multiplications. The calculation of all D (n, m) needs: i 3N#4(K#1)N#7(K#1)(K#2)N additions, 2 2N#3(K#1)N#3(K#1)(K#2)N multiplications. The calculation of the Legendre moments needs (K#1)(K#2)(N#1) additions, 3(K#1)(K#2) multiplications. 2 The total number of additions and multiplications for calculating all the Legendre moments of order up to K using the proposed iterative method is N "(K#1)(K#2)#1(9K2#41K#38)N additions a2 2 and N "3(K#1)(K#2) m2 2 #(3K2#17K#16)N multiplications.
4. Discussions and conclusions Moments functions are an important tool in pattern recognition, image analysis and edge detection. However, the direct computation of the moments is too expensive in computation cost. To overcome this problem, a number of algorithms have been proposed to reduce the computation complexity [12}16], and most of them work for geometric moments. In recent years, other types of moments such as Legendre moments, Zernike moments, rotational moments and complex moments have been proposed and successfully applied in many scienti"c research "elds. It has been demonstrated [5] that orthogonal moments are better than other types moments with respect to information redundancy. However, few algorithms have been developed [16] for fast computation of the orthogonal moments. We have proposed in this paper a fast algorithm for the calculation of the Legendre moments. For a binary image, by use of a Green's theorem, we have "rst converted the double integral into a simple integral around the
boundary of a polygon. Then we have investigated the inter-order relationship of Legendre moments, and a recursive relation has been established. Based on this relationship, an iterative algorithm was implemented. Also, we have proposed a method based on Jiang and Bunke's algorithm that has been used to treat geometric moments. The computation complexity of di!erent methods has been analysed. From the result obtained in the previous section, we can observe that our iterative method needs more computation than that proposed by Mukundan and Ramakrishnan in which a trapezoidal integration rule has been used to approximate the boundary integral. So our method is more accurate due to the fact that no numerical integration rule has been used. Obviously, when a more precise formula, e.g. the alternative extended Simpson's rule, is used, a better result can be obtained in comparison with Mukundan and Ramakrishnan's result, but more computations are needed. On the other hand, we can see that the proposed iterative method needs less computation than the method based on Jiang and Bunke's algorithm. The iterative method is particularly e$cient when the high order of moments needs to be calculated. So the method is extremely useful in image analysis. 5. Summary In this paper, we present a new algorithm for fast and accurate computation of Legendre moments. The method can be described as follows. The (n#m)th-order Legendre moment of an image intensity function f (x, y) is de"ned as
P P
(2n#1)(2m#1) 1 1 P (x)P (y) f (x, y) dx dy, (1a) j n m nm 4 ~1 ~1 where P (x) is the nth-order of Legendre polynomials. n For the purpose of application, only binary images are taken under consideration in the present paper. For binary images, a digital region R is completely determined by its boundary C, and the boundary C can generally be approximated by a polygon. So we limit ourselves to the case of polygon boundary. By using a particular form of Green's theorem and the following integral relation:
P
1 P (x) dx" (P (x)!P (x)), n n~1 2n#1 n`1
(2a)
we can transform the double integral de"ned by Eq. (1a) to a simple integral 2m#1 j " nm 4
Q
[P (x)!P (x)]P (y) dy n`1 n~1 m C 2m#1 N " + [D (n#1, m)!(n!1, m)]. i 4 i/1
(3a)
H.Z. Shu et al. / Pattern Recognition 33 (2000) 341}348
Here N is number of vertices of the polygon C, D (n, m) i denotes the contribution of c to the total line integral, i de"ned as
P P
D (n, m)" P (x)P (y) dy i n m ci xi`1 " P (x)P (a x#y !a x )a dx, n m i i i i i xi where a is the slope of c . i i Using the relation
(4a)
2n#1 n P (x)" xP (x)! P (x), n*1, n`1 n n#1 n#1 n~1
(5a)
we can deduce from Eq. (4a) that
G
1 2m!1 D (n, m)" a [(n#1)D (n#1, m!1) i i m 2n#1 i #nD (n!1, m!1)] i #(2m!1)(y !a x )D (n, m!1) i i i i
H
!(m!1)D (n, m!2) i
(6a)
and
P
xi`1 a i [P (x ) P (x) dx" n n`1 i`1 2n#1 xi !P (x )!P (x )#P (x )] (7a) n`1 i`1 n~1 i n~1 i The above equations show that D (n, m) can be cali culated by recursion. So all Legendre moments j can nm be deduced from the values of D (n, 0). An iterative i method has been implemented for computing the Legendre moments. Also we have proposed a method to calculate the Legendre moments based on Jiang and Bunke's algorithm [Pattern Recognition 24 (1991) 801}806] that has been used to treat geometric moments. The computation complexity of di!erent methods has been analysed. We can observe that our interative method is almost as e$cient as that proposed by Mukundan and Ramakrishnan [Pattern Recognition 28 (9) (1995) 1433}1442] in which a trapezoidal integration rule has been used to approximate the boundary integral. But our method is more accurate due to the fact that no numerical integration rule has been used. On the other hand, the proposed iterative method needs less computation than the method based on Jiang and Bunke's algorithm. D (n, 0)"a i i
347
The iterative method is particular e$cient when the high order of Legendre moments needs to be calculated. So the method presented in this paper is extremely useful in image analysis.
References [1] M.K. Hu, Visual pattern recognition by moment invariants, IRE Trans. Inf. Theory 8 (1) (1962) 179}187. [2] R.J. Prokop, A.P. Reeves, A survey of moment based techniques for unoccluded object representation, Graph. Models Image Process.*CVGIP 54 (5) (1992) 438}460. [3] S.X. Liao, M. Pawlak, On image analysis by moments, IEEE Trans. Pattern Anal. Mach. Intell. 18 (3) (1996) 254}266. [4] A.P. Reeves, M.L. Akey, O.R. Mitchell, A moment based two-dimensional edge operator, Proc. CVPR (1983) 312}317. [5] C.H. Teh, R.T. Chin, On image analysis by the method of moments, IEEE Trans. Pattern Anal. Mach. Intell. 10 (1988) 485}513. [6] S. Ghosal, R. Mehrotra, Orthogonal moment operators for subpixel edge detection, Pattern Recognition 26 (2) (1993) 295}306. [7] L.M. Luo, C. Hamitouche, J.L. Dilenseger, J.L. Coatrieux, A moment-based three-dimensional edge operator, IEEE Trans. Biomed. Engng. 40 (7) (1993) 693}703. [8] L.M. Luo, X.H. Xie, X.D. Bao, A modi"ed moment-based edge operator for rectangular pixel image, IEEE Trans. Circ. Sys. Video Tech. 4 (6) (1994) 552}554. [9] M.R. Teague, Image analysis via the general theory of moments, J. Opt. Soc. Am. 70 (8) (1980) 920}930. [10] A. Khotanzad, Y.H. Hong, Rotation invariant image recognition using features selected via a systematic method, Pattern Recognition 23 (10) (1990) 1089}1101. [11] A. Khotanzad, Invariant image recognition by Zernike moments, IEEE Trans. Pattern Anal. Mach. Intell. 12 (5) (1990) 489}497. [12] X.Y. Jiang, H. Bunke, Simple and fast computation of moments, Pattern Recognition 24 (8) (1991) 801}806. [13] B.C. Li, J. Shen, Fast computation of moment invariant, Pattern Recognition 24 (8) (1991) 807}813. [14] W. Philips, A new fast algorithm for moment computation, Pattern Recognition 26 (11) (1993) 1619}1621. [15] L. Yang, F. Albregtsen, Fast and exact computation of Cartesian geometric moments using discrete Green's theorem, Pattern Recognition 29 (7) (1996) 1061}1073. [16] R. Mukundan, K.R. Ramakrishnan, Fast computation of Legendre and Zernike moments, Pattern Recognition 28 (9) (1995) 1433}1442. [17] G. Sansone, Orthogonal Functions, Dove Publications, New York, 1991.
About the Author*HUAZHONG SHU received the B. S. Degree in Applied Mathematics from Wuhan University, China, in 1987, and a Ph.D. degree in Numerical Analysis from the University of Rennes (France) in 1992. He was a postdoctoral fellow with the Department of Biology and Medical Engineering, Southeast University, from 1995 to 1997. His recent work concentrates on the treatment planning optimization, medical imaging, and pattern recognition.
348
H.Z. Shu et al. / Pattern Recognition 33 (2000) 341}348
About the Author*LIMIN LUO obtained his Ph.D. degree in 1986 from the University of Rennes (France). Now he is a professor and the chairman of the Department of Biology and Medical Engineering, Southeast University, Nanjing, China. He is the author and coauthor of over 60 papers. His current research interests include medical imaging, image analysis, computer-assisted systems for diagnosis and therapy in medicine, and computer vision. Dr LUO is a senior member of the IEEE. He is an associate editor of IEEE Eng. Med. Biol. Magazine and Innovation et Technologie en Biologie et Medecine (ITBM). About the Author*WENXUE YU received his B.S. degree in Machine Engineering from Shandong Engineering College in 1992, the M.E. degree in Mechanics from the Southeast University in 1997. He is now a Ph.D. student at the Department of Biology and Medical Engineering of Southeast University. His current research interests include image processing and analysis, radiosurgery treatment planning and pattern recognition. About the Author*YAO FU received his B.S. and M.S. degrees in Image Processing from Southeast University, China, in 1994 and 1997, respectively. His research interests are in pattern recognition and image analysis.
Pattern Recognition 33 (2000) 349
Erratum
&&Intensity analysis of Boolean models'' by W. Weil. Pattern Recognition 32(9) pp. 1675}1684 (1999) The set of equations in the second column of p. 1678 was printed incorrectly and should appear as follows: <M (Z)"1!e~V1 (Y), SM (Z)"e~V1 (Y) SM (>),
A A
B
n2 M (>)! SM (>)2 , M (Z)"e~V1 (Y) M M 32
B
1 n M (>) SM (>)! sN (Z)"e~V1 (Y) c! M SM (>)3 , 4n 384
PII of original article: S0031-3203(97) 00030-8 0031-3203/99/$20.00 ( 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 9 3 - 4
Pattern Recognition 33 (2000) 350
Erratum
&&Shape-based retrieval: a case study with trademark image databases'' by A.K. Jain, A. Vailaya Pattern Recognition 31(9) pp. 1369}1390 (1998) The equations for M and M on p. 1376 were printed incorrectly and should appear as follows: 5 7 M "(k #k ) (k !3k ) [(k #k )2!3(k #k )2] 5 30 12 30 12 30 12 21 03 #(3k !k ) (k #k ) [3(k #k )2!(k #k )2]. 21 03 21 03 30 12 21 03 M "(3k !k ) (k #k ) [(k #k )2!3(k #k )2] 7 21 03 30 12 30 12 21 03 !(k !3k ) (k #k ) [3(k #k )2!(k #k )2]. 30 12 21 03 30 12 21 03
PII of original article: S0031-3203(97) 00131-3 0031-3203/99/$20.00 ( 1999 Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 9 4 - 6