This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
2R' is defined by a
a
l,i
X{ ^i-j-1
l,»+l
e v»;
Vi Vi+i
a
/i,i a/i,t+i
r< G R', 1 < i < k - 1, for any X = [ay] h x f c G V*; define ?r : V** -» V** such that Ol,»
a
l,t+l
G VV
_2/i 2/i+i J Hence for each U $ r D = j j i 2 • • • i t yiV2 there exists a parallel shuffled column context
•••
Vk
G *?/(*),
Vi 2/2 ••• 2/fc J Let dpu*dq G T r , where |f/|r = |a;,|r = p and \D\r = |j/j| r = . It is easy to see that L(G') — L(G). Conversely, if L is generated by the PACST grammar G = (V, B, Ca, Ra, ipc, tpr,Tc,TT) such that Tc = rml*rn, where m,n > Tr — dpu*dq, where p, q > 0, then we define a PEAC grammar G' = (V,B,C',R',
(x) cj>(y), 2
= ip(x) (j)(y),
3
ip (x, y) = (j>{x) ip(y),
rp (x, y) = ip(x) ip{y)
where the Fourier transform of (p(x, y) has the low-pass characteristic, and Fourier transforms of ipl{x,y), ip2(x,y), and ip3{x,y) have the horizontal, vertical, and diagonal high-pass characteristics respectively. 3.1. Scale Dynamic
System
of 2D Images
and Kalman
Filtering
Let X(m + 1) be a N(m + 1) x N(m + 1) matrix which represents an image at the scale level m +1. Its coarser version X(m) and detail images Di(m), Z?2(m), Ds(m) (which represent the difference in information between X(m + 1) and X{m) in the horizontal, vertical and diagonal directions) at the next scale level m are given by X(m) = HmX(m D2(m)=GmX(m
+ l)H*m,
D1(m) = HmX(m
+ l)G*m
+ l)H*m,
D3(m) = GmX(m
+ l)G*m
where Hm and Gm, as defined before, are respectively the coarsening operator and the differencing operator at the scale level m, the superscript * denotes an adjoint operator, and all the matrices X(m), £>i(m), £>2(m) and Dz(m) are of the size N{m) x N(m). The above equations are the 2D wavelet analysis equations which are obtained by the tensor product of two ID wavelet analyses, i.e., the row-processing followed by the column-processing 13,16 . In this case, X{m), Di(m), D2{m) and D3(m) are respectively the scaling coefficient matrix, the horizontalhigh-frequency, vertical-high-frequency and diagonal-high-frequency wavelet coefficient matrices at the scale level m. For an orthonormal wavelet under consideration, the image X(m +1) at the fine scale level m + 1 can be reconstructed from its coarse version X(m) at the scale level m, by adding the detail information Di(m), D2(m) and D3(m) as follows: X(m + 1) = H*mX(m)Hm
+ H^lD1(m)Gm
+ G*mD2{m)Hm + G*mD3(m)Gm
(33)
which is the 2D wavelet synthesis equation. We may rewrite it as X(m + 1) = H*mX(m)Hm
+ LmD(m)Rm
(34)
where Lm = [HZn;G*m\G*Tn],
RTn = [G*m;H'}n\G*m\*,
(35)
334 £»i(m) 0 0 0 D2(m) 0 0 0 D3{m)_
D(m) =
(36)
This can be interpreted as a state equation in which X{m) is composed of state variables, D(m) is the driving signal, and the state transition takes place from the coarse scale level m to the fine scale level m + 1 . For the sake of convenience in analysis, let us again rewrite the 2D wavelet synthesis equation Eq. (34) represented in matrix form into its equivalent vector form. Specifically, we may find an isomorphism (transformation) T of a finite matrix space onto a finite vector space and define the following equivalent variables: x(m)=T w2(m)=T
X(m),
wi(m) = T
Dx{m),
D2(m),
w3(m)=T
D3(m)
(37)
where x(m),wi(m), w2(m) a n d u ^ m ) areiV 2 (m)xl vectors because X(m), £>i(m), D2(m) and D3(m) are N(m) x N(m) matrices. The isomorphism may be given by T
:
X(m;i,j)-+x{m;(i-l)-N(m)+j) th
where X(m;i,j) is the (i,j) entry of the matrix X(m), element of the vector x{m). In that case, x(m + 1) = H^*x(m)
+ G£
(38)
and x(m;k)
dx{m) + G^*d 2 (m) + G£
is the kth
d3{m)
(39)
where Hm and G^ , (k = 1,2,3), are the augmented operators such that Hm x(m), G 1 , di(m), G^ d2{m) and G^ d3{rn) are respectively equivalent t o H*mX{m)Hm, H^lD1(m)Gm, G*mD2(m)Hm and G*mD3(m)Gm in the following sense: Hmx(m)=T
{H*mX{m)Hm),
G)ndl{m)=T
{H*mDx{m)Gm)
G3m d3(m) = T
G*m d2{m) = T (G*mD2(m)Hm),
(40)
(G*mD3(m)Gm).
Due to the orthonormality properties of Hm and Gm (refer to Eq. (14)), the orthonormality of Hm and (?*, are preserved, as shown by the following equations 16 : HmHm
= GkmGkm = J :
for k = 1,2,3
JCGi* = G^lC* = 0 for G^G^*
= 0
k = 1,2,3
for k # I
Hm Hm + G]n Gl + Gl Gl + Gl G*m = I.
(41) (42) (43) (44)
Let us define CZ* = [GC;Gl;GC]
,
d(m) = [d 1 (m) T ;d 2 (m) T ;d 3 (m) T ] :
(45)
The synthesis equation, Eq. (39), then becomes x(m + 1) = Hm x(m) + Gm
d(m).
(46)
335
From the orthonormal property, we have the following identities similar to Eq. (14): HmHm
= GmGm
= J, HmGm
= GmHm
= 0, i / m .ffro + Gm Gm — I. (47)
Suppose that an image is measured at several scales, from the coarsest scale level L to the finest scale level M, and is represented in its equivalent vector form: y(m) = C{m) x(m) + v(m)
(48)
where x(m) = T X(m) and y(m) = T Y{m) represent respectively the original image and its noisy measurement (represented in the vector forms), C{m) is the measurement matrix, and v(m) is the additive noise. From Eqs. (46) and (48), the wavelet-based scale dynamic system of 2D images (in vector form) is formulated, in which x(m) is the state vector, d(m) is the driving signal, and the aug—* mented operator Hm is the state transition matrix. Similar t o Eqs. (16)-(22), the scale recursive Kalman filtering can be applied to this scale dynamic system (The corresponding Eqs. (17) and (20) will be referred to later in Propositions 3 and 4). The least-square estimate X(M\M) of the image X{M) at the finest scale level M is given by the inverse transformation of x(M\M), = T-1
X{M\M)
x(M\M).
(49)
It is noted that the above scale-recursive Kalman filtering is a single filtering process. The inverse of huge N(m)2 x N(m)2 correlation matrices P _ 1 ( m | m — 1), etc., which is required in the Kalman gain equation and the associated Riccati equation at each scale level m, is very difficult to compute. The following section presents an efficient algorithm based on the wavelet packet transform 16,17 . 3.2. Wavelet
Packets
in Kalman
Filtering
Algorithm
for
Images
The 2D wavelet packet transform can be obtained by using the tensor product of two ID wavelet packet transforms. Let us rename the 2D scaling function and three wavelets given in Eq. (31) as Ho(x,y) = (p{x) <j>(y), Mx,y)
=
Hi(x,y) = ip(x)
(50)
and define the wavelet packet basis functions (for n € Z+) as -q,y-i) /x4n+i = 2 ^2^2g(q) q
h(l) /j,n{x -q,y-i)
I
tHn+2 = 2 X]5Z h^ 9W Vn(x -q,y-i) 1
I
»4n+3 = 2 ^2 ] T s ( g ) g(l) fin(x -q,y-i)q
i
(51)
336
Subspaces U"'s spanned by each basis function (in(x, y) can be similarly denned as in Eq. (26) U ? = f {2JHn{Vx - q,2iy -l);k<E Z};
j E Z;n € Z+.
(52)
However, the 2D wavelet packets have a quad-tree structure
u? = u£ x © u^j 1 e u^t 2 © u^+3.
(53)
The approximation of a given image at the scale level m is represented by its scaling coefficient matrix X(m) of the translated basis functions in subspace U ^ . It can be decomposed into four packets (coefficient matrices) in four subspaces TJjrj-i (k = 0,1,2,3) at the next scale level m — 1, and each of which can again be decomposed into four packets in the next coarser scale level. Let Uk (m) denote the fcth packet coefficient matrix of X{m) decomposed down to the scale level j , Uk (m) is in the subspace U*. Rewrite Eq. (33) as follows to synthesize the approximate image X(m+1) at the fine scale level ra+1 in terms of the scaling image X{m) € U ^ and detail images Di(m) in \5%m, (i = 1,2,3), at the scale level m, X(m + 1) = P 0 *(m)X(m)P 0 (m) + P0* (m)£>i(m)Pi(m) + P?(m)D2(m)P0{m)
+ P]*(m)Z>3(m)Pi(m)
(54)
where Po(m) = Hm and Pi(m) = Gm. With an isomorphism T, defined in Eq. (38), which bijectively transforms a finite matrix space onto a finite vector space, we can rewrite the above equation into its equivalent vector form x(m + 1) = T X(m + 1) = T (PZ(m)X(m)P0(m))+T T
(P^^Dr^P^m))
(P{(m)D2(m)P0(m))+T
= Po*x{m) + P7dx{m) + p7d2(m)
+
(P^m^M^iM) + F?'d3(m)
(55)
where Pv{m)=H^
; PvW^Gl
; Pi{m) = 0%, ; H ( m ) = G\
that have been given in Eq. (40). To compute the fcth wavelet packet coefficient matrix Uk (m) (or its equivalent vector u£(m)) at the scale level j , from a scaling image X(m) (or its equivalent vector x(m)) at the scale level m, let us introduce an operator (vk(m))* on x(m) where w^(m) is defined as 771 — 1
= ^ ( r a - i ) ( n » - l ) o P ; f c ( r a _ 2 ) ( m - 2 ) o . . . p ; i k ( . ) ( j ) ; ek(i) G {0,1,2,3}
337
which denotes a sequence of operations from P* tj)(j) to P* („,_!)(?« — 1), and the packet index fc is expressed in terms of £fc(i)'s: m—1
ek (») • 4<*-J>; fc = 0 , . . . , 4 ( m - ^ - 1
k = £
(57)
i=i
so that, for a given k, Sk (i) 's can be determined and the operators PSk (») (i) involved can be identified. For example, for two-level decomposition j — J — m — 2, fc runs from 0 to 15; and for fc = 6 = 2-4° + l-4 1 , we have v^(m) = pT(m-l)oF^(m-2). In particular, let us consider the decomposition down to the coarsest level j = J; for simplicity in notation, let us drop the supscript j on the operator vjt(m), vector Uk(m), and matrix Uk(m), so the wavelet packet decomposition is given by Uk(m) = (vk(m))* oX(m)
(58)
where Uk(m) = T Uk(m) is the equivalent vector representation of the wavelet packet coefficient matrix Uk(m). The feth wavelet packet Uk{m) (at the coarse scale level J, J <m) of X{m) (at the scale level m) is given by Uk(m)=T-1o(Vk(m))*oT
X(m).
(59)
T
Let Rxx(m) = E[x{m)x(m) ] be the autocorrelation matrix of x{m) which is the vector equivalent of the scaling coefficient matrix at the scale level m. The correlation matrix between its fcth and Zth wavelet packets at the scale level J is given by E [uk(m) ui(m)T]
= E [(vk{m))*
x(m)x(m)T
= (fjt (m)) * Rxx(m)
vi(m).
vi(m)] (60)
We know that the wavelet coefficients (at all scales) of wide sense stationary signals or nonstationary signals with stationary increments are also stationary and their correlations decay exponentially fast both across scales and along the time index 10,11 . So, the wavelet packet coefficients Wfc(m)'s are stationary and their packet correlations decay fast across the packet index 16 . In fact, it has been observed that for orthogonal wavelet transforms, the transformed packets are nearly uncorrelated, i.e., the correlation matrix of the wavelet packet transform is almost block-diagonal. This strong decorrelation phenomenon of the wavelet packet transform enables us to model images by using the wavelet packet synthesis equation x m
( ) =
5Z
v
k(m)
u
k{m)
(61)
if fc £ I.
(62)
it=o
with the following constraint (for J < m): E [uk(m) ui{m)T] = 0;
This will be the wavelet-packet-based dynamic s y s t e m model to approximate images, yielding the Kalman filtering algorithm with correlation matrices Q, R and P all block-diagonalized.
338
Due to the orthogonality properties of {Pk(m); k = 0,1,2,3} (see Eqs. (41)(44)), and the definition of Vk{m), it can be shown that the following identities hold true: (vk(m))*vi{m)
= ISki
(63)
vk(m)(vk(m)y
=I
(64)
4(">-J)_l
£ fc=0
where the identity matrix in Eq. (64) has the size of x(m) while the identity matrix in Eq. (63) has the size of the packets. The following four propositions have been shown to be valid 16 . Proposition 1: If a matrix A can be bloch-diagonalized by the operators {vk(m)'s ;fc = 0 , . . . , 4 ( m - J ) - l } , i.e., {vk{m)Y Avi{m) = Ak8kl
(65)
then its inverse A~l can also be block-diagonalized (vkim^A^Viim)
= A^Ski
(66)
and vice versa. We will choose an appropriate orthonormal wavelet such that the wavelet packets C/fc(M)'s of an image matrix X(M) are decorrelated (or nearly decorrelated), i.e. E [uk(M)u,(M)T]
=0;
if k ? I.
(67)
The above equation means that (vk(m))* Qa(m) vi(m) = Qa,k(m)
6
ki ;
a = 1,2,3
(68)
where Qa(m) = E[da(m)da(m)T],
da(m) = T Da(m)
(69)
Qa(jn) is the augmented autocorrelation matrix of the wavelet coefficient matrix Da(m) (at the scale m < M) of X(M). Similarly, we may have (vk(m))* R(m) vi(m) = Rk(m) Ski
(70)
where R(m) is the augmented autocorrelation matrix of the additive noise, represented in vector form, at the scale level m. The measurement matrix C(m) may also satisfy the following equation, (vk(m)y
C(m) v,(m) - Ck(m) 8kl.
(71)
Now, we can transform all the large dimensional matrices and vectors in the original 2D Kalman filtering process (similar to Eq. (16) to (22)) into a set of small size
339
uncorrelated variables as follows: uk{m\m — 1) = (vk(m))*x(m\m uk(m\m)
— 1)
(72)
= (vk(m))*x(m\m)
Pk{m\m - 1) d= {vk{m))*P{m\m
(73) - l)vk{m)
(74)
d
P fc (m|m) = («fc(»n))*P(m|m)t;fc(m)
(75)
f
J/fc(m)H K(m))*y(m)
(76)
-J _1
for k = 0 , 1 , . . . , 4("* ) . The following propositions are essential to develop an efficient 2D Kalman filtering algorithm using orthonormal wavelet packets (for proof, see Ref. 16). Proposition 2: The augmented error covariance matrices P(m\m) and P(m|m—1) can be block-diagonalized by {vk(m); k = 0 , . . . ; 4 ( m - J ) — 1}. That is, (vk(m))* P(m\m - 1) v,(m) = Pk(m\m - 1) Su
(77)
and hence, by Proposition 1, {vk{m)Y p - ^ m l r o ) vi(m) = P ^ M m ) SH.
(78)
Proposition 3: Prom Eq. (20) for 2D images (Section 3.1), we have Uk{m\m — 1) = (vk(m))*
x(m\m — 1) •
—
•
*
= («*(m))* o P 0 ( m - 1 ) _ ( uk(m - l\m - 1) ;
x(m-l\m-l) k €[0,4{m~J~1]) otherwise
(79)
Proposition 4: Tfte Kalman gain matrix K(m) in Eq. (17) for 2D images (Section 3.1) can be block-diagonalized by {vk(in); k = 0 , . . . ; 4 ( m - J ) — l } ; i.e., (vk(m))* K(m) vi(m) = Kk(m) Ski
(80)
where Kk(m) = Pk(m\m - 1) C*k(m) [Ck(m) Pk(m\m - 1) C*k(m) + P f c ( m ) ] _ 1 . 3.3. Wavelet-Packet-Based Kalman Filtering Multiresolution Image Fusion
Algorithm
(81)
for
Based on the above discussions, we can summerize the wavelet-packet-based Kalman filtering algorithm in the following. Given a set of noisy measurements Y(m) of an image X at scales m = L,L + l,...,M, the least squares estimate X(M\M) of X at the finest scale level M can be obtained, in a scale recursive form, from the coarsest scale level L to the finest scale level M. Choose an appropriate orthonormal analyzing wavelet such that the wavelet packets of the image X are nearly uncorrelated. We can take the 2D wavelet packet transform of each measurement
340 Y(m), (m = L,..., M), down to the packets of small size at a coarse scale J (J < L in general), and apply the Kalman filtering separately to each wavelet packet of F(m) from the coarse scale level m = L to the fine scale level m = M recursively. Specifically, for packet index k = 0 , . . . , 4 ( M ~ J ) - 1, and scale level m = L,...,M, yk(m) uk(m\m)
= (vk(m))*oT
Y{m)
(82)
= uk(m\m - 1) + Kk{m) [yk{m) - Ck{m)uk(m\m
Kk{m) = Pk(m\m - l)C*k(m)[Ck(m)Pk(m\m ( Pk{m - l\m - 1) Pk{m\m-
1) = <
Qi,fc-4("— j-i)(rn—
1)
Q2,fc-2-4<"— J-l)(m
— 1)
> <33,fc-3-4C"— J-i){m
- 1)
ke ke fcG ke
- 1)]
(83)
- l)C*k{m) + ifc(m)]- 1 (84)
[o^™-- 7 - 1 )) [4(ro-J-1),2-4(m_J~1)) [2-4(m-J-1),3-4(m-J-1)) [3-4(m-J-1),4(m"J))
(85)
l) + Ct{m)Rk\m)Ck{m)
(86)
uk(m\m - 1) = { "fc(m " 1 | m ~ 1; 5 ^ 6 [°' 4(m ~' 7 " 1) )
(87)
Pk-\m\m)=Pk-\m\m
\ 0
;
otherwise
with the initial conditions (i = 0 , . . . ,4^ L J ) — 1): ut(L\L - 1) = 0 ;
Pi(L\L-l)
Finally, the least squares estimate X(M\M) by
= (Vi(L))* Px(L)Vi(L).
(88)
of X at the finest scale level M is given
^4(M-J)_1
X(M|M) = r -
1
I
^
t; fc (M)u fc (M|M)
(89)
fc=0
where T~x is the inverse transformation of the isomorphism T. Let the size of image X(m) at the scale level m be N(m) x N(m). The standard Kalman filtering for 2D images (Section 3.1) needs to compute the inverse of a huge N(m)2 x N(m)2 matrix in the matrix Riccati equation. Since the computation complexity of the 2D wavelet packet transform is only 0(N(m) log2 N(m)), we can decompose with ease each measurement Y(m) down to the wavelet packets of small size N(J) x N(J) at a coarse scale level J , say, N(J) = 4. The packet-based 2D filtering algorithm transforms each 4 x 4 small size matrix into a 16 x 1 vector equivalent for performing the scale-recursive Kalman filtering, which needs the inverse matrix computation of only a small 16 x 16 matrix. The filtering takes place on each of the (j^fjf)2 small size matrices, which can be implemented in parallel processing. The storage complexity can also be significantly reduced. The filtered small matrices are obtained by transforming the filtered vector equivalents back to the matrix form, from which the reconstruction (inverse transform) is applied to obtain the least squares estimate of the N(m) x N(m) image.
341
Fig. 2. Fusion of two registered, noisy measurements of a pebble image: (a) at 0 dB, (b) at 5 dB but with coarser spatial resolution, and (c) the fused image at 19.4 dB.
4. E x p e r i m e n t s of t h e F i l t e r i n g A l g o r i t h m for I m a g e Fusion We show two examples on image fusion by applying the wavelet-packet-based Kalman filtering. The first example is on a pebble image, and the second example is on fusion of a Landsat TM image and a SPOT image. 4 . 1 . A Simple
Experiment
on Image
Fusion
Consider two noisy measurements of a pebble image as shown in Fig. 2(a) and 2(b): one with signal-to-noise-ratio (SNR) of 0 dB at the same spatial resolution as that of the original image (m = 1), and the other with SNR of 5 dB at the dyadically next coarser resolution (m = 0). The fused image obtained by applying the wavelet-packet-based Kalman filtering algorithm is shown in Fig. 2(c), giving a SNR of 19.4 dB. Daubechies' D4 orthonormal wavelet was used in this experiment. Its eight wavelet packets of three levels are shown earlier in Fig. 1. 4.2. Fusion
of TM and SPOT
Images
In this experiment, we considered fusion of two remote sensing satellite images covering the same region, one image is a segment of a SPOT panchromatic image, and the other is a corresponding Landsat Thematic Mapper (TM) band 1 image. Thematic Mapper produces images in seven spectral bands with high spectral resolution, each contains special textures for discriminating natural objects, but the spatial resolution is low. SPOT images have higher spatial resolution but lower spectral resolution 23,24,30,32 . Before processing the fusion algorithm, the following calibrations were done first: (1) A geometrical transform was performed to register the TM image with respect to the SPOT image. The spatial resolution of the SPOT image is about three times finer than that of the TM image, but we transformed it to be only finer by two. A coarser version, by a factor of 2, was generated as the reference image against
342
(c)
(d)
Fig. 3. Fusion of a TM band 1 image and a SPOT image: (a) the SPOT image (256 x 256); (b-left) the coarsened SPOT image (128 x 128); (b-right) the registered TM band 1 (128 x 128); (c) the amplitude matched SPOT image; and (d) the fused image.
which the TM image was registered but retaining its original spatial resolution. (2) A look-up-table (LUT) between intensity values of the registered TM band 1 image and those of the spatially coarsened SPOT image, which have the same spatial resolution, was constructed such that a high spatial resolution estimate of the TM band 1 image was obtained by transforming the pixel intensity of the original SPOT image into the corresponding TM band 1 amplitude through this look-up-table30. After the above calibrations, we had the registered original TM band 1 image and the amplitude matched (transformed) SPOT image which were used as inputs to our wavelet-packet-based Kalman filtering algorithm for multiresolution image fusion. Fig. 3(a) shows a segment of the SPOT image (256 x 256). Fig. 3(b-left) is a coarsened SPOT image (128 x 128), which was obtained byfilteringthe SPOT image with the Daubechies D4?s low-pass filter (ft-filter) and then down-sampling by 2 in both directions. Fig. 3(b-right) is the TM band 1 image (128 x 128) registered with respect to the coarsened SPOT image. Based on these two images, a look-up-table was constructed through which the original SPOT image was transformed into the amplitude matched SPOT image shown in Fig. 3(c). The latter was regarded as
343 (Solid: TM1. Dulled: TnoSpot)
50
100
150
(a)
200
Hutognun (Solid: TMI, Dtsbed Bucd method)
250
300
50
100
150
200
250
300
(b)
Fig. 4. Histograms of the TM band 1 image (solid) and the transformed S P O T image (dashed) in (a), and of the fused image (dotted) in (b).
equivalent to a high spatial resolution estimate of the TM band 1 image. We assumed that the TM band 1 image and the SPOT image were corrupted by white noises whose respective variances are estimated by computing the variances within the corresponding small regions in the two images, covering a river area with smooth reflections. The estimated signal-to-noise ratios were 8.7 dB and 10.2 dB in the TM band 1 image and the SPOT image respectively. Fig. 3(d) shows the fused image of the registered TM band 1 image (Fig. 3(b-right)) and the amplitude matched SPOT image (Fig. 3(c)), obtained by applying the wavelet-packet-based Kalman filtering algorithm with D4 orthonormal wavelet in 5-level decompositions. The spectral information consistency between the fused image and the registered TM band 1 image was examined by comparing their grey level distributions. Fig. 4(a) shows the grey level distribution of the TM band 1 image (solid) and of the transformed SPOT image (dashed). Fig. 4(b) shows the comparison of the grey level distribution of the fused image (dotted) with that of the TM band 1 image (solid). The fused image, while bringing in information from the high spatial resolution SPOT image, has essentially the same spectral information as that of the original TM band 1 image. 5. Conclusions In a multiresolution setting, the fine scale representation of images can be produced by adding detail informations into a coarse scale representation, leading to a scale dynamic system for images and making the scale recursive Kalman filtering in the wavelet domain applicable for multiresolution image fusion. Choose an orthogonal analyzing wavelet such that the wavelet packet transform of an image is greatly decorrelated across the packet index; this will make autocorrelation matrices to be block-diagonalizable. The packet-based Kalman filtering for multisensory image fusion involves computation of inverse matrices of small sizes, and thus it is computationally efficient and robust, and may be implemented for parallel processing.
344 Acknowledgment T h e a u t h o r s would like t o t h a n k Mr. Jennting Hsu for his help in t h e final phase of t h e manuscript preparation.
References 1. M.S. Grewal. Ralman Filtering : Theory and Practice. Prentice-Hall, 1993. 2. J.W. Woods. Two-Dimensional Kalman Filtering, In T. S. Huang, editor, Two Dimensional Digital Signal Processing, pp. 155-205, New York, Springer-Verlag, 1981. 3. I. Daubechies. Orthonormal Bases of Compactly Supported Wavelets. Comm. in Pure Appl. Math., 41:909-996, 1988. 4. S. Mallat. A Theory for Multiresolution Signal Decomposition : The Wavelet Representation. IEEE Trans, on Patt. Anal, and Mach. Intell, ll(7):674-693, 1989. 5. S. Mallat and W. Hwang. Singularity Detection and Processing with Wavelets. IEEE Trans, on Information Theory, 38(2):617-643, 1992. 6. P. Flandrin. On the Spectrum of Fractional Brownian Motions. IEEE Trans, on Information Theory, 35(1):197-199, 1989. 7. P. Flandrin. Wavelet Analysis and Synthesis of Fractional Brownian Motion. IEEE Trans, on Information Theory, 38:910-917, 1992. 8. G.W. Wornell. A Karhunen-Loeve-Like Expansion for 1/f Processes via Wavelets. IEEE Trans, on Information Theory, 36(9):859-861, 1990. 9. G.W. Wornell and A.V. Oppenheim. Estimation of Fractal Signals from Noisy Measurements Using Wavelets. IEEE Trans, on Signal Processing, 40(3):611-623, 1992. 10. R.W. Dijkerman and R.R. Mazumdar. On the Correlation Structure of the Wavelet Coefficients of Fractional Brownian Motion. IEEE Trans, on Information Theory, 40(5):1609-1612, 1994. 11. R.W. Dijkerman and R.R. Mazumdar. Wavelet Representations of Stochastic Processes and Multiresolution Stochastic Models. IEEE Trans, on Signal Processing, 42(7):1640-1652, 1994. 12. Jun Zhang and Gilbert Walter. A Wavelet-Based KL-Like Expansion for Wide-Sense Stationary Random Processes. IEEE Trans, on Signal Processing, 42(7):1737-1745, 1994. 13. K.C. Chou. A Stochastic Modeling Approach to Multiscale Signal Processing. Ph. D. Diss., Dept. of Elec. Engrg. and Computer Sci., MIT, Cambridge, MA, May 1991. 14. K.C. Chou, A.S. Willsky and A. Benveniste. Multiscale Recursive Estimation, Data Fusion, And Regularization. IEEE Trans, on Automatic Control, 39(3):464-478, 1994. 15. E. Miller and A.S. Willsky. A Multiscale Approach to Sensor Fusion and the Solution of Linear Inverse Problems. Applied and Computational Harmonic Analysis, 2:127-147, 1995. 16. Hsi-Chin Hsin. Wavelet-Based Filtering in Scale Space and Image Fusion. Ph.D. dissertation, Dept. of Elec. Engrg., University of Pittsburgh, Pittsburgh, PA, 1995. 17. H.C. Hsin and C.C. Li. Multiscale 2D Kalman Filtering Based on Wavelet Transform. Proc. IEEE Int. Conf. on Image Processing, Vol. 1, pp. 80-84, 1994. 18. H.C. Hsin and C.C. Li. Wavelet-Based Filtering in Scale Space for Data Fusion. In Proc. SPIE, Vol. 2569, Wavelet Applications in Signal and Image Processing, III, pp. 701-712, 1995. 19. J.Q. Ni, K.L. Ho and K.W. Tse. Model-based Multirate Kalman Filtering Approach for Optimal Two-dimensional Signal Reconstruction from Noisy Subband Systems. Optical Engineering, 37(8):2376-2386, 1998.
345 20. C. Wen and D. Zhou. Multiscale Stochastic System Modeling and Recursive Data Fusion Estimation. Chinese Journal of Electronics, 11(2):192-195, 2002. 21. H. Li, B.S. Manjunath and S.K. Mitra. Multiscale Image Fusion Using the Wavelet Transform. Graphical Models And Image Processing, 57(3):235-245, 1995. 22. Z. Zhang and R.S. Blum. A Categorization of Multiscale-Decomposition-Based Image Fusion Schemes with a Performance Study for a Digital Camera Application. Proceedings of the IEEE, 87(8):1315-1326, 1999. 23. T. Ranchin, L. Wald and M. Mangolini. Efficient Data Fusion using Wavelet Transform: The Case for SPOT Satellite Images. Proc. SPIE, Vol. 2034, Wavelet Applications in Signal and Image Processing, pp. 171-178, 1993. 24. B. Garguet, J. Girel, J.-M. Chassery and G. Pautou. The Use of Multiresolution Analysis and Wavelet Transform for Merging SPOT Panchromatic and Multispectral Image Data. Photogrammetric Engrg. & Remote Sensing, 62(9):1057-1066, 1996. 25. C. K. Chui, An Introduction to Wavelets. Academic Press, Inc., San Diego, CA, 1992. 26. M.V. Wickerhauser. Adapted Wavelet Analysis from Theory to Software. A. K. Peters, Wellesley, MA, 1995. 27. G. Strang and T. Nguyen. Wavelets and Filter Banks, revised edition. WellsleyCambridge Press, Wellsley, MA, 1997. 28. W. Liu and L. Zhou. Image Fusion Using Wavelet Packet Transform. P r o c , 3rd Intern. Conf. on Wavelet Anal. & Appl., Chongqing, China, Vol. 1, pp. 275-280, 2003. 29. M.A. Abidi and R.C. Gonzalez. Data Fusion in Robotics and Machine Intelligence. Academic Press, Inc., San Diego, CA, 1992. 30. P.S. Chavez, S.C. Sides and J. A. Anderson. Gomparison of Three Different Methods to Merge Multiresolution and Multispectral Data: Landsat TM and SPOT Panchromatic. Photogrammetric Engineering & Remote Sensing, 57(3):295-303, 1991. 31. T.A. Wilson, S.K. Rogers and M. Kabrisky. Perceptual-Based Image Fusion for Hyperspectral Data. IEEE Trans, on Geoscience and Remote Sensing, 35(4):1007-1017, 1997. 32. C. Pohl and J.L. Van Genderen. Multisensor Image Fusion in Remote Sensing: Concepts, Methods and Applications. International J. Remote Sensing, 19(5):823-854, 1998. 33. W.P. Segars, B.M.W. Tsui, A.J. Da Silva and L. Shao. CT-PET Image Fusion using the 4D NCAT Phantom with the Purpose of Attenuation Correction. IEEE Nuclear Science Symposium Conference Record, Vol. 3, pp. 1775-1779, 2002. 34. Q. Wang, Y. Shen, Y. Zhang and J.Q. Zhang. A Quantitative Method for Evaluating the Performances of Hyperspectral Image Fusion. IEEE Trans, on Instrumentation and Measurement, 52(4):1041-1047, 2003. 35. Y. Du, P.W. Vachon and J.J. van der Sanden. Satellite Image Fusion with Multiscale Wavelet Analysis for Marine Applications: Preserving Spatial Information and Minizing Artifacts (PSIMA). Can. J. Remote Sensing, 29(l):14-23, 2003.
CHAPTER 3.8 MULTISENSOR FUSION WITH HYPERSPECTRAL IMAGING DATA: DETECTION AND CLASSIFICATION Su May Hsu and Hsiao-hua Burke Massachusetts Institute of Technology Lincoln Laboratory 244 Wood Street, Lexington, Massachusetts 02420-9185 We present two examples that show how fusing data from hyperspectral imaging (HSI) sensors with data from other sensors can enhance overall detection and classification performance. The first example involves fusing HSI data with foliage-penetration synthetic aperture radar (FOPEN SAR) data; the second example involves fusing HSI data with high-resolution panchromatic imaging (HRI) data. The fusion of HSI and SAR data exploits different phenomenology from the two different sensors. The fusion of HSI and HRI combines their superior respective spectral and spatial information. Fusion of HSI and SAR data is accomplished at the feature level. HSI data provide background characterization and material identification; HSI-SAR fusion allows us to reduce false detections and confirm target detection in the SAR image. Fusion of HSI and HRI data is implemented at both data and feature levels, resulting in a combined spatial-spectral analysis that enhances target identification. 1
INTRODUCTION
Hyperspectral imaging sensors collect data that can be represented by a threedimensional image cube. The hyperspectral data cube is resolved in along-track, cross-track and spectral dimensions. The HSI sensor affords fine spectral resolutions (AX, - 1 0 nm), typically in visible to shortwave infrared (SWIR) wavelength region (0.4 - 2.5 um). For each pixel within a hyperspectral image, a continuous spectrum is sampled and can be used to identify materials by their reflectance. One shortcoming of HSI is that it provides no surface penetration. Another shortcoming is that HSI spatial resolutions are usually coarser than those from panchromatic imagery. To overcome these limitations and enhance HSI system performanc, we fuse HSI data with other sensor data. For example, in counter camouflage, concealment and deception (CC&D) applications, HSI data can be used to identify ground cover and surface material and a low-frequency foliage penetration synthetic aperture radar (FOPEN SAR) [1] can determine if any threat objects are under concealment. 347
348
Because FOPEN SAR and HSI sensors exploit different phenomenology, their detection capabilities complement each other. FOPEN SAR typically operates at 20-700 MHz. It penetrates foliage and detects targets under tree canopy, but has significant clutter returns from trees. HSI, on the other hand, is capable of subpixel detection and material identification. Both SAR and HSI systems may suffer substantial false-alarm and missed detection rates because of their respective background clutter but we expect that combining SAR and HSI data will greatly enhance detection and identification performance. Another opportunity for HSI data fusion occurs in surface surveillance of exposed targets. Unlike conventional single-band or multispectral sensors, HSI sensors collect image data in hundreds of contiguous narrow spectral bands with only moderate spatial resolutions. By spatially sharpening a hyperspectral imge with a panchromatic high-resolution image, we can enhance image visualization for the analyst. Such a combination of sensors can be found, for example, on the National Aeronautic and Space Administration's Earth Observing (EO)-l satellite, which was launched in November 2000. This satellite includes an HSI sensor, Hyperion, and a high-resolution imaging (HRI) sensor as part of the Advanced Land Imager. The HRI sensor spatial resolution of 10 m is three times better than that of the HSI sensor. In the next section, we describe an example of HSI-SAR fusion. Because HSI and SAR sensors are distinct and exploit different phenomenology, fusion of HSI and SAR data is established at the level of features such as material composition and terrain type. We then explore HSI-HRI data fusion. HSI sharpening with HRI data is first investigated. A combined spatial-spectral analysis is then discussed to illustrate enhanced target detection and identification. 2
HSI AND FOPEN SAR DATA FUSION
In May 1997 a P-3 ultrawideband (UWB) radar operating from 215 to 730 MHz, and a Hyperspectral Digital Image Collection Experiment (HYDICE) sensor, operating from 0.4 to 2.5 urn in 210 bands [2], collected data at a site in Vicksburg, Mississippi. The SAR data were collected with a 32° depression angle and a ground sample distance (GSD) of 0.23 m x 0.4 m. The HSI data were collected at a 1.5-km altitude with a nadir viewing geometry and GSD of 0.76 m x 1.1 m. These measurements formed part of the Dixie-97 data set that we use here to demonstrate the framework of HSI-SAR data fusion. Figure 1 shows a sketch of the target site and the composite data set, including a hyperspectral data cube and a SAR graysclae image. Targets in the forest background included fabric nets and vehicles. Several fabric nets are placed along the tree line around an open area. One fabric net at the tree line covers a vehicle; all other nets are empty or cover non-radar reflecting decoys.
349
One vehicle is obscured at the lower right corner of the sketch and another vehicle is partially exposed near the top left corner. For the Dixie-97 data, an HSI-SAR fusion strategy was established on the basis of each sensor's detection characteristics. HSI and SAR data were first processed separately for detection and terrain classification, respectively. Then co-registration was performed to allow overlay of the images. Terrain mapping reduced SAR false alarm from trees. Detection of concealed targets under nets was verified and detection of partially exposed targets was further confirmed with material identification by HSI. We provide an analysis of the HSI data first, followed by illustration of SAR-HSI image co-registration and the fusion results. HYDICE HSI
P-3UWBSARImage
Site Sketch ,r\~
&~ 10nm 200 bands
Figure 1. Target site sketch (right), hyperspectral data cube (left), and synthetic aperture radar (SAR) data (middle), part of the Dixie-97 data set. The site sketch shows targets in a forest background of fabric nets and vehicles. Several fabric nets are placed along the tree line around an open area. One fabric net at the tree line covers a vehicle. The hyperspectral data arc displayed as a cube with a red-green-blue composite image on the front face. The SAR data are displayed as a radar-cross-section image in gray scale.
2.1
HSI Data Analysis
Sample spectra of backgrounds of road, grass, trees, and nets are plotted as a function of wavelength in Figure 2. The road spectrum is significantly higher than other spectra in the visible wavelength region from 0.4 to 0.7 pm. The spectra for grass and trees each exhibit a decrease of reflectance at 0.68 pm, followed by a large increase at near infrared, characteristic of vegetation, because the chlorophyll in green plants absorbs the visible light from the sun, and reflects the infrared radiation. In the SWIR wavelength region, spectral signatures of road and nets also differ from the vegetation signatures. Reduction in spectral dimensionality is first applied to the HSI data cube to extract the spectral features that then lead to further analysis. Principal
350
component analysis (PCA) is used to de-correlate data and maximize the information content in a reduced number of features [3]. Figure 3 shows sample principal components calculated from the Dixie97 HSI data. Background classes of open area, trees and roads are apparent in the first and third principal components. Fabric nets appear in strong contrast to the backgrounds in the seventh principal component. We constructed a matched filter from the mean of several pixels extracted from the nets. A matched-filtering algorithm with thresholding was then applied to the HSI data to detect all pixels of fabric nets. Figure 4 shows the fabric-net detection and a color-coded terrain classification map for the Dixie-97 HSI data. Superimposed on the terrain classification map is the HSI fabric-net detection. The map shows background classes for roads, grass, trees and shadow regions; these classes result from an unsupervised data-clustering operation that uses the first five principal components. r—i—'—•—•—•—•—i—;—'—'—'—•—r
Wavelength C"'")
Figure 2. Sample spectra from a forest scene. The road spectrum is significantly higher than other spectra in the visible wavelength region of 0.4 to 0.7 u,m. The spectra for grass and trees each exhibit decreasde of reflectance at 0.68 u.m followed by a large increase at near infrared because green plants use chlorophyll to absorb the visible light from the sun and reflect the infrared radiation. Background (1st component)
Road (3rd component)
Fabric Net (7th component)
Figure 3. Principal components calculated from the Dixie-97 HSI data. Background classes of open area, trees and roads are apparent in the first and third principal components. Fabric nets appear in strong contrast to the backgrounds in the seventh component.
351 Fabric net detection
Background characterization
Figure 4. HSI fabric net detection with a matched-filtering algorithm (left) and terrain classification map (right). The HSI fabric-net detection has been superimposed on the terrain classification map. Roads, grass, trees and shadow regions are shown as separate terrain classes. The background classes result from an unsupervised data-clustering operation that uses the first five principal components.
2.2
Combined FOPEN SAR-HSI Analysis and Fusion
To combine HSI and SAR detection results, we performed co-registration with reference to terrain classes, such as open areas and trees by using scaling, rotation and translation operations. We then fused the data by using the coregistered images according to the process illustrated in Figure 5. The SAR data were first processed with pixel grouping and thresholding. The sample SAR detection at a 6-dBsm threshold is shown in the top left image of Figure 5. The detection of a near vertical line in the top right area of this SAR image corresponds to a power line. The terrain map derived from the HYDICE HSI data with open-area and fabric-net detections is the top middle image of Figure 5. In combining these analyses, we retained SAR detections only from open areas or around fabric nets indicated in HSI data. Detections that coincided with HSI identifications of trees, far-from-open areas, and nets were considered false alarms. In the combined detection result shown in the top right image of Figure 5, a SAR detection of a vehicle under a net appears; other nets are empty with no SAR detections. There are several strong SAR detections at the left side of the open area. A spectral angle mapper (SAM) algorithm was applied to match HSI data in the open area to paints in our spectral library. Three pixels matched well with military gray-tan paint. This match indicated the presence of a vehicle, possibly military, in the open area and thus confirms the SAR detection.
352 P-3 UHF SAR 6 dBsm Thresholded
HYDICE HSl Open area/Fabric net Detection
Combined SAR/HSI Data Vehicle under Net Identified
SAR Chip SAR detection confirmed and material identified
using HSl
Figure 5. SAR detection (top left), a terrain map derived from the Hyperspecctral Digital Imagery Collection Experiment (HYDICE) data with net detections (top middle), and the combined detection result (top right). The SAR data were first processed with pixel grouping and thresholding. The sample SAR detection shows the detection of a vertical line in the top right area that corresponds to a power line. When we combined the analysises, only SAR detections from either open areas or around fabric nets indicated in the HSl data were retained. SAR detections that corresponded to identifications of trees, far-from-open areas or nets on the hyperspectral image were considered false alarms. In the combined SAR-HSI data, a SAR vehicle detection appears under a net at the top-right corner. Other nets detected in the HSl data are considered empty because they have no corresponding SAR detections. There are several strong SAR detections on the left side of the open area. A spectral angle mapper (SAM) algorithm was applied to match HSl data in the area to paints in our spectral library. Three pixels match well with military gray-tan paint, indicating the presence of a vehicle, possibly military; this match confirms the SAR detection.
HSl and HRI DATA Fusion To assess the fusion of HSl and HRI data, we generated a set of co-registered HSl and HRI data with known ground truth. We used the first frame of HYDICE data from Forest Radiance I Run 05 as the "truth" in both spatial and spectral domains; it has 320x320 pixels and 0.8-m pixel resolution. The truth data were processed to generate simulated HSl (4 m/pixel, AX ~ 10 nm) and HRI (0.8 m /pixel, panchromatic) data. The simulated HSl data were generated by a 5x5
353 spatial averaging of the input data followed by a 5x5-to-l under-sampling. The resulting HSI is 64 x 64 pixels in size and 4-m pixel resolution. The simulated HRI data were generated with a 25-band integration, from 0.63 urn to 0.9 um, and Ak ~ 10 nm, resulting in a 320 x 320 single-band image with 0.8-m pixel resolution. Figure 6 shows the flow of data generation for HSI and HRI fusion. In the following section, we first apply sharpening to the HSI data and then conduct a combined spatial-spectral analysis.
Figure 6. Flow of simulated data generation from the HSI truth data for HSI and High Resolution Imaging (HRI) data fusion. The first frame of HYDICE data from Forest Radiance I Run 05 is used as the truth in both spatial and spectral domains; it has 320x320 pixels, 0.8 m per pixel. The truth data are processed to generate the simulated HSI (4m/pixel, AX ~ 10 nm) and HRI (0.8m/pixel, panchromatic) data. The simulated HSI data are generated by a 5x5 spatial averaging of the input data followed by a 5x5-to-l undersampling. The resulting hyperspectral data cube has 25 bands; each band is 64x64 pixels in size and 4 m per pixel. The HRI data are generated with a 25-band integration from 0.63 um to 0.9 urn, AX ~ 10 nm. The result is a 320x320 single-band image, 0.8 m per pixel.
3.1
Reviews and Demonstration of Sharpening Algorithms
A number of image sharpening techniques are often applied to multispectral images for visualization enhancement. This section reviews three of the commonly used sharpening algorithms—pseudo-inverse, color normalization and spatial frequency correction—that are adapted for implementation on HSI data. We show the results of applying these algorithms to the generated HSI and HRI data for a sharpened HSI image. Spectral fidelity is evaluated by comparing the original and sharpened data.
354
3.1.1
Pseudo-inverse Technique [4,5]
Given a high-resolution image co-registered with a hyperspectral image, a system of equations can be established for the reconstruction of a sharpened, or highspatial-resolution, hyperspectral image. The value at a pixel in the highresolution image is the spectral average at the same pixel in the sharpened hyperspectral image. The spectral value at a pixel in the hyperspectral image is the pixel sum of of the sharpened hyperspectral image within the GSD of the hyperspectral image in the same spectral band. If the sharpening ratio—the ratio of GSDs between unsharpened and sharpened images—is integer r and the number of spectral bands is K, then the number of equations for each hyperspectral image pixel is r2 + K, and the number of unknowns for the sharpened hyperspectral image reconstruction is r2 x K. As the sharpening ratio and the number of spectral bands increase, the number of unknowns increases faster than the number of additional equations. The system of equations is generally underdetermined for a unique solution. Pseudo matrix-inversion algorithms with least mean squared (LMS) estimations are applied to obtain a solution for the fusion problem. This method is described in the following equations: E = AT (A A1)"1 w , and AE = w , where E is a (K r2) X 1 array containing K bands of r X r sharpened hyperspectral image pixels, A is a (K + r2) X (K r2) matrix of the system equation, AT (A A1)"1 is the pseudo inverse of A and w is a (K + r2) X 1 array of the HSI and HRI input values. In this approach the high-resolution image needs to be in perfect registration with the hyperspectral image to establish the system of equations. The pseudo-inverse technique uses singular value decomposition to obtain an LMS solution for the system. Performed on a pixel-by-pixel basis, the pseudoinverse technique is very time comsuming. Also, the LMS estimations do not necessarily attain spectral information beyond that originally contained in the HSI data. 3.1.2
Sharpening Color Normalization (SCN) [6]
The color normalization algorithm conventionally used in multispectral imaging is modified for HSI sharpening. The HSI data are first oversampled to the same pixel size as the HRI data. The algorithm multiplies each of the HSI bands by the HRI data and the resulting values are each normalized by the averaged HSI data
355
over the spectral bands covered in the panchromatic range of HRI. This process is defined by the following equation: _ _ . , _ (HSI, x HRI)
where HSI is an HSI band, SCNt is the HSI sharpened by color normalization, and (HSI)p is band-averaged HSI data over the HRI panchromatic wavelength range. This approach is straightforward in merging the spatial contrast of HRI into spectral bands of HSI. The method also requires good HSI-HRI registration. The sharpened HSI data appear visually sharper but do not contain more spectral information than the original HSI. 3.1.3
Spatial-Frequency Correction (SFC)
Some sharpening approaches use wavelet transformations [7] that decompose images into different spatial-frequency scales. In spatial-frequency correction, the HSI data are first oversampled to the same pixel size as the HRI data. For each HSI band, the high-frequency components are replaced with components from the HRI data. The sharpened image is then obtained via inverse transformation of the modified spectra. In practice, two-dimensional (2-D) Fourier transformation can be applied for image spatial-frequency analysis. The sharpening process is described as below: SFd = FFT 1 {FFT (HSI,), FFT(HRI)hi} where SFCj is the sharpened HSI, FFT(HSI,) represents the 2-D spatial frequency components of HSI, and FFT(HRT)hi represents the high-frequency components in the 2-D spatial Fourier transform of the HRI data. The sharpened HSI is the inverse 2-D Fourier transform of FFT(HRI) with the central (low-frequency) portion replaced by FFT(HSI,). In this approach, the spatial shift caussed by imperfect HSI-HRI registration shows up as a phase shift in the frequency spectrum. Some loss of spectral quality is expected in the sharpened data. 3.1.4 Discussion of Performance In the three sharpening approaches reviewed, we emphasize that good spatial registration between HSI and HRI data is required. For our purpose of spectral fidelity evaluation, we have perfect co-registration through the method used to generate the HSI and HRI data. Two measures, spectral angle and spectral
356
distance (Euclidean distance), are used to evaluate the sharpening algorithms. These quantities are calculated as
spectral angle(x!, x , ) = cos" and
x
I
i * * V
2
V
spectral distance(x,, x 2 ) = |x, - x. where xi and x2 are two multiband spectra. A zero angle or zero distance represents a perfect match of the two spectra. The sharpened images are compared to the truth data in terms of spectral angle and spectral distance. The spectral distance normalized by the pixel amplitude in the truth image is calculated for the fraction of spectral difference. The frame-averaged differences for the three sharpening algorithms are listed in Table 1. For comparison, the difference measurements from the unsharpened HSI data are also included in the table. The sharpened 4-m resolution image data are five times oversampled to compare with the truth image data at 0.8-m resolution. The sharpened images improve significantly from the unsharpened HSI data in spectral distance, but show no obvious improvement in spectral angle. Among the sharpened images, color normalization is closer to the truth in spectral angle than the results of pseudo-inverse and frequency-correction. On the other hand, the spatial-frequency correction method achieves results that are closest to the truth in spectral distance. Although the sharpened images generally appear sharper, the impact of this effect is small on target detection and identification. Because the combined HSI-HRI data are severely underdetermined for the reconstruction of sharpened HSI data, spectral features of objects smaller than the original HSI resolution cannot be fully resolved in the the sharpened HSI. Additional spatial and spectral analysis at the feature level is necessary for enhanced target detection and identification. In the subsequent spatial-spectral analysis discussion that follows, HSI data sharpened by color normalization will be used. Table 1. Sharpening-Algorithm Comparison with Respect to the Truth Data Performance Metrics Spectral Angle Spectral Distance
Unsharpened Pseudo Inverse HSI 4.9° 21.6%
8.3° 13.8%
Sharpened Frequency Correction 5.8° 8.2%
Color Normalization 3.8° 10.0%
357
3.2
Spatial-spectral Analysis
Various combinations of spatial and spectral analysis of HSI and HRI images have been attempted [8,9,10]. To demonstrate HSI and HRI fusion for enhanced background characterization as well as target detection and identification, we employ a similar combined spatial-spectral analysis. Figure 7 shows our analysis approach. We first obtain background classification and anomaly detection from HSI data. Applying these results to the sharpened HSI data provides enhanced background classification and target detection, while the HRI data provides target and background boundaries with spatial edge detection. These boundaries, combined with results from the sharpened HSI data, spatially enhance the definition of targets and backgrounds. After we apply spectral-matched filtering, the HSI data further reveal the background and target materials.
HSI
Figure 7. Diagram of a spatial-spectral analysis approach. Background classification and anomaly detection are first obtained from HSI data. Applying the results to the sharpened HSI data provides enhanced background classification and target detection, while the HRI data provide target and background boundaries with spatial edge detection. These edges, combined with results from the sharpened HSI data, spatially enhance the definition of targets and backgrounds. Finally, spetralmatched filtering for target detection is applied to the sharpened HSI data.
358
To illustrate this analysis approach, we used an expanded data set consisting of three major frames of HYDICE data from Forest Radiance I Run 05. The original truth image of 300x960 pixels, 0.8-m pixel resolution and 210 bands was used as input. The degraded low-resolution hyperspectral spectral image is 60x192 pixels in size, 4-m pixel resolution, and 210 bands. The corresponding HRI image was generated with band integration from 0.4 to 0.8 urn, 300x960 pixels, with 0.8-m pixel resolution. A sharpened hyperspectral data cube is obtained from the combined HSI and HRI data by using the colornormalization method. Figure 8 shows a reference HSI image, as well as a background classification and anomaly-detection map derived from the unsharpened (4-m GSD) HSI data. The background classification is the result of an unsupervised data-clustering operation on the first five principal components [11]. Road, ground, vegetation, and shade are delineated. The background map and class statistics are employed in all subsequent processing. Areas in the hyperspectral image that do not match the background classification are considered anomalies; these anaomalies are further processed with spatial and spectral analysis for target detection and identification. Edges detected from the HRI image with the Sobel operator [12] are shown in Figure 9. Overlay of the edges with background classification from sharpened HSI data is also shown in the figure. Background edges in the sharpened HSI data are enhanced with the edges derived from the HRI image. The image on the right of Figure 9 depicts the combined results of spectralmatched filtering and spatial edges over regions containing anomaly detections from HSI. In the far right grayscale image, red and green represent different types of vehicle paints, and the detections are bounded by the edges shown in blue. Two regions of detection located near the top of the scene, however, are not well defined in shape and appear to be large in size. These are inconsistent with vehicle features. Further testing with spectral matched filtering confirms pieces of fabric in these regions, as shown in Figure 10, which contains an enlarged view of the vehicle detections. The vehicle size and orientation can be determined from the bounding edges. Objects are classified as large vehicles (4 m x 8 m) if they are 4 to 7 pixels wide and 8 to 11 pixels long. Likewise, objects are small vehicles (3 m x 6 m), if they are 3 to 5 pixels wide and 6 to 7 pixels long. The colored bar next to each vehicle in the enlarged image depicts the vehicle's size, orientation and paint type.
359
RGB Image
Background classification
Anomaly detection • • "| •
Road Ground Brt.Veg. I Dk.Veg. I Shade
Figure 8. Reference HSl image (left), background classification (middle), and anomaly detection (right) derived from the unsharpened HSl data. The background classification is the result of an unsupervised data-clustering operation on the first five principal components. Road, ground, vegetation, and shade are delineated. The background map and class statistics are employed in all subsequent processing. Areas in the HSl reference image that do not match the background classification are considered anomalies as shown on the right.
360
HRl with edges
Detection in regions cued by HSI
Background &HRI edges • • 11 •
Road Ground Brt.Veg. Dk.Veg.
•
Shade
• Paint 1 • Paint 2 • "Edges Figure 9. HRl image with detected edges (left); background classification from sharpened HSI data overlaid with HRI-derived edges (middle); and target detection on sharpened HSI data (right). Background boundaries in the sharpened HSI are enhanced with the edges derived from the HRl image. The image on the right depicts the combined results of spectral matched filtering and spatial edges over regions with anomaly detections from HSI. Red and green colors each represent different types of vehicle paints; the detections are bounded by the edges shown in blue.
361
. «•—Net 1
et2
•
P aint 1
•
P aint 2
• • Edges | Large (4 mx 8 m) ;
Small(3mx6m)
Figure 10. Fabric and additional object identification. An enlarged view of the vehicle detections is on the right. The vehicle size and orientation can be determined from the bounding edges. Objects are classified as large vehicles (4 m x 8 m) if they are 4 to 7 pixels wide and 8 to 11 pixels long. Likewise, objects are small vehicles (3 m x 6 m), if they are 3 to 5 pixels wide and 6 to 7 pixels long. The colored bar next to each vehicle in the enlarged image depicts the vehicle's size, orientation and paint type.
SUMMARY AND DISCUSSION Relative to previous multispectral sensors, HSI sensors offer superior spectral resolution that allows detailed background characterization and material identification-related applications. These capabilities employed in conjunction with other remote sensing modalities can further enhance the combined system performance. The potential payoff of fusing HSI data with data from other sensors has been illustrated in this chapter. FOPEN SAR's capability of surface penetration compensats HSI's passive surface sensing function. The fact that SAR and HSI sense different phenomena helps in reducing false detections. HSI's other shortcoming is suboptimal spatial resolution due to trade-off with
362
fine spectral resolution. A panchromatic HSI image, typically with much better spatial resolution, provides additional spatial enhancement on target detection and background delineation derived from HSI data. Coordinated data collection is required to demonstrate multisensor fusion performance. The platforms should be relatively stable and capable of georeferencing to allow multisensor co-registration for data fusion. For the purpose of algorithm development and assessment, spatial and spectral ground truth supporting all participating sensors is also required to fully realize the system performance. While there have been some multisensor data collections to date, they were typically designed for a single sensor rather than designed synergistically for multisensors. In general, these data are not optimally suited for data/image fusion demonstration. For the SAR-HSI fusion example we provided, co-registration was accomplished by manual selection of tie points over a small common area of SAR and HSI. Simulated data with perfect co-registration were used for the HSI and HRI fusion example. In this chapter, we demonstrated the framework of HSI fusion with FOPEN SAR and HRI data by using various sources of data. P-3 UWB SAR and HYDICE HSI data from the Dixie-97 data collection over Vicksburg, Mississippi, were used for SAR-HSI data fusion. Targets in the forest background of this data set included fabric nets and vehicles. Principal component analysis on HSI data was shown to allow effective spectral dimension reduction and feature extraction for terrain characterization and fabric net detection. SAR-HSI feature fusion was accomplished with a co-registration of the images by using references to terrain features. The fusion results showed detection of a vehicle under a fabric net and reduction of SAR false alarms due to trees. A case of SAR detection was also confirmed by HSI material matching with military vehicle paint. The Dixie-97 example demonstrates the UHF SAR's capability of penertration through nets. However, the detections come with significant false alarms from trees. Background classification and anomaly detection from HSI data reduce SAR false alarms in target detection, as illustrated in the Dixie-97 example. Additional target processing of UHF SAR data with template matching and HSI material identification will both enhance target detection and reduce false alarms. For HSI-HRI data fusion, sharpening approaches were investigated and implemented on a combined HSI and HRI data set. The sharpened hyperspectral image retained the spectral signatures of the extended area and in general appeared spatially sharper. Spectral features of objects smaller than the original HSI resolution, however, were not fully resolved through sharpening, due to the underdetermined nature of the reconstruction of the sharpened HSI. Thus the utility of sharpening alone is limited. Further analysis was conducted to combine respective high-spectral and high-spatial resolution information from HSI and HRI data; this HSI-HRI fusion demonstrated enhanced background
363
characterization as well as target detection and identification performance. Anomalies detected from HSI data were used to cue for target detection and identification in the combined analysis. Spatial image processing was applied to HRI for edge delineation. As the result of combined analysis, target size, shape and material were determined. REFERENCES 1. T. G. Bryant, G. B. Morse, L. M. Novak, and J. C. Henry, "Tactical Radars for Ground Surveillance," Lincoln Laboratory Journal, vol. 12, no. 2, pp. 351-353. 2. L. J. Rickard, et al; "HYDICE: An Airborne System for Hyperspectral Imaging", SPIE Proceedings, Imaging Spectrometry of the Terrestrial Environment, Vol. 1937, p. 173, 1993 3. Schott, J. R., Remote Sensing, Oxford University Press, 1997 4. Tim J. Patterson, Michael E. Bullock, "Radiometrically Correct Sharpening of Multispectral Images using a Panchromatic Image", p. 252-264, SPIE Vol.2234 5. Tim J. Patterson, Lester Gong, Robert Haxton, Troy Chinen, "Extension of Sharpening Techniques to Hyperspectral Data", P. 114-122, SPIE Vol.3372, April 1998 6. Jim Vrabel, "Advanced Band Sharpening Study", p. 73-84, SPIE Vol.3071, April, 1997 7. Mark A. Goforth, "Multispectral Image Sharpening with Multiresolution Analysis and the MTF", P. 123-131, SPIE Vol.3372, April 1998 8. Harry N. Gross, John R. Schott, "Application of Spatial Resolution Enhancement and Spectral Mixture Analysis to Hyperspectral Images", P. 30-41, SPIE Vol.2821, 1996 9. Harry N. Gross, John R. Schott, "Application of Spectral Mixture Analysis and Image Fusion Techniques for Image Sharpening", P.85-94, Remote Sensing and Environment 63, 1998 10. Bing Zhang, Jiangui Liu, Xiangjun Wang, Changshan Wu, "Study on the Classification of Hyperspectral Data in Urban Area", P. 169-172, SPIE Vol.3502 11. S. M. Hsu, H. K. Burke, et al, "An Efficient Approach to Background Characterization of Hyperspectral Images and its Applications", Project Report HTAP-5, MIT Lincoln Laboratory, 20 October, 2000 12. R. C. Gonzalez, P. Wintz, Digital Image Processing, Addison-Wesley, 1987
364
Acknowledgements The authors are grateful for SITAC for the HYDICE data. This work was conducted under ODUSD S&T Hyperspectral Technology Assessment Program (HTAP) at Lincoln Laboratory. This work was sponsored by the Department of Defense under contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and not recessarily endorsed by the United State Government. This article is reprinted from MIT LINCOLN LABORATORY Journal, Volume 14, Number 1, 2003, Special Issue on Spectral Imaging, pp. 145-159, with permission from MIT LINCOLN LABORATORY Journal.
CHAPTER 3.9 INDEPENDENT COMPONENT ANALYSIS OF FUNCTIONAL MAGNETIC RESONANCE IMAGING DATA V. D. Calhoun§Vo, B. Hong§ ^Olin Neuropsychiatry Research Center, Institute of Living, Hartford, CT 06106. v Dept. of Psychiatry, Yale University, New Haven, CT 06520. "Dept. of Psychiatry, Johns Hopkins University, Baltimore, MD 21205. E-mail: [email protected], [email protected] Independent component analysis (ICA) has recently demonstrated considerable promise in characterizing fMRI data, primarily due to its intuitive nature and ability for flexible characterization of brain function. As typically applied, spatial brain networks are assumed to be systematically non-overlapping. Often temporal coherence of brain networks is also assumed, although convolutive and other models can be utilized to relax this assumption. ICA has been successfully utilized in a number of exciting fMRI applications including the identification of various signal-types (e.g. task and transiently task-related, and physiology-related signals) in the spatial or temporal domain, the analysis of multi-subject fMRI data, the incorporation of a priori information, and for the analysis of complex-valued fMRI data (which has proved challenging for standard approaches). In this chapter, we 1) introduce ICA and review current algorithms, 2) relate ICA to several well-known pattern recognition techniques, 3) introduce fMRI data and its properties, 4) review the basic motivation for using ICA on fMRI data, and 5) review the current work on ICA of fMRI and the incorporation of prior information. 1
Introduction What an antithetical mind! - tenderness, roughness - delicacy, coarseness sentiment, sensuality - soaring and groveling, dirt and deity - all mixed up in that one compound of inspired clay! -Lord Byron
Independent component analysis (ICA) is a statistical method that recovers a set of linearly mixed hidden independent factors (sources or features) from a set of measurements or observed data. Unlike principal component analysis (PCA) which seeks for an orthogonal transformation to remove 2nd-order statistical dependency, ICA not only decorrelates the 2nd-order signal statistical moments, but also reduces higher-order statistical dependencies. ICA is a way of finding a linear non-orthogonal transformation which makes the resulting signals or factors as statistically independent from each other as possible. 365
366 An intuitive example of ICA can be given by a scatter-plot of two independent signals x and x2. Figure 1.1a shows a plot of the two independent signals (*,,* 2 ) in a scatter-plot. Figure 1.1b and c show the projections for PCA and ICA, respectively, for a linear mixture of x, and x2. PCA finds the orthogonal vectors ux,u2, but does not find independent vectors. In contrast, ICA is able to find the independent vectors ax,a2 of the linear mixed signals (x{,x2), and is thus able to restore the original sources.
X .<
X Figure 1.1: (a) The joint density of two independent signals, (b) PCA projection (w,, u2), (c) ICA projection (a,, a 2 ) (adoptedfrom[1]) A typical ICA model assumes that the source signals are not observable, statistically independent and non-Gaussian, with an unknown, but linear, mixing process. Consider an observed M - dimensional random vector is denoted by x = (x,, • • • xM )T which is generated by the ICA model: x = As (1) where s = [j,,5 2 ...sNf is an N-dimensional vector whose elements are assumed independent sources and AMxJV is an unknown mixing matrix. Typically M >= N, so that A is of full rank. Mathematically, since AMxN is an unknown mixing matrix. The goal of ICA is to estimate an unmixing matrix Ww„w by an iteration approach such that y (defined in equation (2)) is a good approximation to the 'true' sources: s . y = Wx (2) ICA is a type of blind source separation, meaning that it can separate independent signals from linear mixtures with virtually no prior knowledge on the signals. A typical example of blind source separation is the cocktail party problem in which several people are speaking simultaneously in the same room. Then the problem is to separate the voices of the different speakers, using recordings of several microphones in the room [2]. The basic ICA model for blind source separation is shown in Figure 1.2.
Unknown Figure 1.2: ICA basic model for blind source separation
367
ICA is area of active research in fields such as neural networks, advanced statistics, and signal processing. So far, there have been a number of different methods for deriving ICA algorithms. Popular methods include those using the Infomax principle [3], cumulant-based methods [4,5], maximal likelihood estimation [6], mutual information minimization [7,8]. Among these methods, the most typical ICA algorithms are Infomax [3], FastICA [9] and JADE [10]. The original infomax algorithm for blind separation by [3] is suitable for super-Gaussian sources. To overcome this limitation, techniques have been developed for simultaneously separating sub- and super-Gaussian sources [11]. A flexible independent component analysis approach using generalized Gaussian density model method was introduced in [12]. These algorithms typically work well only for symmetric distributions and are less accurate for skewed distributions. Recently, some researchers have made contributions for the extension of ICA to more general distributions including introducing non-parametric ICA [13] and kernel independent component analysis [14]. Other ICA models which adaptively vary the nonlinear functions (or activation functions) to better fit the underlying sources have also been proposed [15,16]. The variety of recent approaches to ICA algorithms demonstrates the vitality of current research in this area. 2
ICA and Pattern Recognition
Pattern recognition has a wide array of applications in such diverse areas as face and character recognition, data mining, medical diagnosis, image processing, computer vision, bioinformatics, speech recognition, fraud detection, and stock market prediction, etc. ICA is an unsupervised statistical approach and is used for both feature extraction or classification/clustering problems in pattern recognition. 2.1
ICA Feature extraction
Feature extraction seeks a transformation mapping from an original data space to a lower dimensional feature subspace. PCA seeks a set of orthogonal basis vectors in the data space with respect to which the transformed (projected) data are uncorrelated. ICA goes farther by seeking a basis (not necessarily orthogonal) such that the corresponding transformed data are statistically independent. The use of ICA for feature extraction is motivated by the representation of components (features) as independent from each other as possible. The feature extraction procedure for ICA involves two standard preprocessing steps. First, the mean of the data is usually removed from the original data. The second step is to whiten the data such that the transformed data is uncorrelated with unit variance. For similarity, let's re-denote \(k) a TV-dimensional vector of the observed mixed input patterns at time k. We rewritten the equations above such that these contains time moment as \(k) = [xl(k),x2(k),...,xN(k)f. It is linearly composed of N statistically independent components s(k) = [s,(k),s2(k),....,sN(k)]T by this relationship
x(k) = As(k) = fjsi(kX
(3)
368
where A denotes a NxN unknown matrix and a,, is the z'th column of A . Similar to the PCA projection transformation, Equation (3) views ICA as a projection transformation. That is, the input pattern vectors x(k) are considered to be a linear mixture of statistically independent basis pattern s(k) combined by an unknown mixing matrix A . A is viewed as the ICA projection transformation matrix. In this sense, a, is viewed as the basis vector and s((k) is regarded as the i'h projection component of x(k) onto the basis vector a,. Feature extraction using ICA aims to estimate A such that each component of s(k) becomes maximally independent of the other components. To proceed, the input patterns x(k) are first whitened by PC A [17] z(jfc) = D~'vTx(k)
(4)
where D and V are given as follows: D = diag[A[,A2,---,A,N] and V = [v,,v2,---vJV]. Here A, is the /* largest eigenvalue of the covariance matrix Eix(k)x(k)T\ and v,. is the i',h eigenvector. And D and V satisfy: VDV r =E{x(k)x(k)T]. While z(jfc) is uncorrelated, but still non-independent. The independent components s(k) are then given by s(Jfc) = Wz(ifc) = WD" 2 V r x(£)
(5)
where W is an NxN separation (unmixed) matrix. It can not be computed directly from other matrices like A , D or V . W is typically obtained by iterative approaches which depend on choice of ICA algorithms. For example, for the Infomax algorithm, W is determined by AWoc[WT'+(l-2y)xr (6) Aw0 « 1 - 2y
(7)
where the output vector y = g(Wx + w 0 ) with nonlinearity g(w) = tanh(w) [3]. After convergence of W , the estimated transformation matrix of ICA is constructed as A = VD 1 W
(8)
Where each column a. of A is considered the i'h basis vector and thus used as a feature vector. A block diagram of ICA feature extraction is shown Figure 2.1 Input pattern vector
x(k)
Whitened
D
2
Vr
vector
Z(k)
ICA feature
W
vector
at of A
Figure 2.1: A block diagram of ICA feature extraction To date there have been several application of feature extraction using ICA [1721]. An example for ICA feature extraction is illustrated in Figure 2.2.
369
Figure 2.2; The ICA feature representation of a natural image. st and Sj (i ^ j) are statistically independent components of the original data (adopted from [17]) 2.2
ICA for classification and clustering In pattern classification problems, it is common to come up against the challenges of high dimensional data (i.e., curse of dimensionality). High-dimensional data not only bring computational complexity, but also degrade a classifier's performance. To avoid the problem of dimensionality, the most common approach is to use feature extraction or dimensionality reduction algorithms. There have been several popular methods widely used in dimension reduction for pattern recognition problems. PCA is a well known method for reducing the dimensionality by extracting components which are uncorrelated with each other. Another approach is linear discriminative analysis [22] which can be used to derive a discriminative transformation, making use of second-order statistical information. Feature subset selection [23] is an approach which requires an exhaustive search for feature selection. Support vector machines [24] are a learning approach not suffering from the curse of dimensionality, however the robustness of the probabilistic models is not always guaranteed [25]. ICA, instead, seeks a representation that projects the data into a space where the components have maximal statistical independence. From the viewpoint of pattern recognition, this representation provides high efficiency for pattern classification. There are some advantages of ICA compared with the above methods. First of all, for image compression, the features extracted by ICA have been shown to be similar to the pattern found in the mammalian primary visual system [26]. Secondly, the ICA features have statistically independent activities, consistent with most theories of sensory coding that proposed models to effective internal representation by redundancy [27], The motivation for using ICA for clustering is typically based on the fact the data are linearly transformed by ICA so that the resulting components can be grouped into clusters, such that the components are dependent within cluster and independent between clusters [28]. ICA show to align the context grouping structure well in a human sense [28], thus can be used for clustering. More ICA applications for pattern classification and clustering can see references [18,19,29,30]. 3
Functional M M FMRI is a technique that provides the opportunity to study brain function noninvasively and is a powerful tool utilized in both research and clinical arenas since the early 90s [31]. The most popular technique utilizes blood oxygenation level dependent
370
(BOLD) contrast, which is based on the differing magnetic properties of oxygenated (diamagnetic) and deoxygenated (paramagnetic) blood. When brain neurons are activated, there is a resultant localized change in blood flow and oxygenation which causes a change in the MR decay parameter, T^. These blood flow and oxygenation (vascular or hemodynamic) changes are temporally delayed relative to the neural firing, a confounding factor known as hemodynamic lag. Scientific interest rests primarily with the electrical activity in the neurons, which cannot be directly observed by any variant of the MRI procedure. Since the hemodynamic lag varies in a complex way from tissue to tissue, and because the exact transfer mechanism between the electrical and hemodynamic processes is not known, it is not possible to completely recover the electrical process from the vascular process. Nevertheless, the vascular process remains an informative surrogate for electrical activity. However, relatively low image contrastto-noise ratio (CNR) of the BOLD effect, head movement, and undesired physiological sources of variability (cardiac, pulmonary) make detection of the activation-related signal changes difficult. ICA has shown to be useful for fMRI analysis for several reasons. Spatial ICA finds systematically non-overlapping, temporally coherent brain regions without constraining the temporal domain. The temporal dynamics of many fMRI experiments are difficult to study with functional magnetic resonance imaging (fMRI) due to the lack of a well-understood brain-activation model. ICA can reveal inter-subject and inter-event differences in the temporal dynamics. A strength of ICA is its ability to reveal dynamics for which a temporal model is not available [32]. Spatial ICA also works well for fMRI as it is often the case that one is interested in spatially distributed brain networks. 3.1
Data A cquisition The MRI signal is acquired as a quadrature signal. That is, two orthogonal "detectors" are used to capture the MRI signal [33]. The two outputs from such a system are often put in a complex form, with one output being treated as the real part and the other as the imaginary part. The time-domain data acquired by the spectrometer are, remarkably, equivalent to the spatial-frequency representation of the image data, and so a discrete Fourier transform yields the complex image-space data. It is then common to take the magnitude of this data prior to performing any fMRI analyses. FMRI studies rely upon the detection of small intensity changes over time, often with a contrast-to-noise ratio of less than 1.0. Virtually all fMRI studies analyze the magnitude images from the MRI scanner. A standard approach is to correlate the time-series data with an assumed reference signal [34]. Many generalizations have been proposed, usually involving linear modeling approaches utilizing an estimate of the hemodynamic response [35]. The information contained in the phase images is ignored in such analyses. 3.2
Types of Signal and Noise There are several types of signals that can be encoded within the hemodynamic signals measured by fMRI. Some of these were identified by McKeown in the first application of ICA to fMRI [36]. In this paper, infomax [3] was utilized and separated signals were classified as task-related, transiently task-related, and motion related.
371
In general, fMRI data may be grouped into signals of interest and signals not of interest. The signals of interest include task-related, function-related, and transiently task-related. The task-related signal has already been mentioned and is the easiest to model. A reference waveform, based upon the paradigm, is correlated with the data. The responses of the brain to a given task may not be regular however, for example the signal may die out before the stimulation is turned off or change over time as repeated stimuli are applied, leading to a transiently task-related signal. It is also conceivable that there are several different types of transiently task-related signals coming from different regions of the brain. The function-related signal manifests as similarities between voxels within a particular functional domain {e.g., the motor cortex on one side of the brain will correlate most highly with voxels in the motor cortex on the opposite side of the brain) [37]. An exciting application of this is for identifying synchronous auditory cortex activity [38,39] (see areas corresponding to the top time course in Figure 4.3). Most of these fMRI signals have been examined with ICA and other methods and have been found to be sub-Gaussian in nature (except perhaps the artifacts mentioned in the next section). The signals not of interest include physiology-related, motion-related, and scanner-related signals. Physiology-related signals such as breathing and heart rate tend to come from the brain ventricles (fluid filled regions of the brain) and areas with large blood vessels present, respectively. Motion-related signals can also be present and tend to be changes across large regions of the image (particularly at the edges of images). An example of a motion-related signal occurs during an experiment in which the subjects are mouthing letters, called the rapid automatized naming task [40]. Figure 3.1 depicts an example occurring during the mouthing that was extracted from orbitofrontal and inferior temporal brain regions using infomax algorithm [3]. For comparison, a typical hemodynamic response function is depicted as well. It is clear that the motion is occurring on a time scale too rapid to be related to hemodynamics.
Figure 3.1: Motion-related signal due to mouth movement from inferior temporal and orbitofrontal regions Finally, there are scanner-related signals that can be varying in time (such as scanner drift and system noise) or varying in space (such as susceptibility and radiofrequency artifacts) [41]. A number of such examples including slice dropout, motion artifact, and nyquist ghosting can be found at (http://www.frnrib.ox.ac.uk/ ~beckmann/ homepage/ academic/ littleshop/).
372
There are several types of noise one can characterize in an fMRI experiment. First, there is noise due to the magnetic resonance acquisition which can be discussed as 1) object variability due to quantum thermodynamics and 2) thermal noise. It can be shown that the thermal noise will result in white noise with a constant variance in the image dimension [42]. Additionally there is noise due to patient movement, brain movement, and physiologic noise (such as heart rate, breathing). It has been suggested that physiologic noise is the dominant factor in fMRI studies [43]. In the ICA model these "noises" are often not explicitly modeled, but rather manifested as separate components, (see, e.g., [41,44]). 3.3
Statistical Properties of fMRI Data Properties such as non-Gaussianity and spatial/temporal independence of sources need to be addressed for the application of ICA to fMRI data. If the "activations" do not have a systematic overlap in time and/or space then the distributions can be considered independent [45]. The temporal distribution of a task-related waveform is often nearly bimodal (off/on) and thus the algorithm needs to incorporate this fact. Some other basic assumptions of ICA have been considered in [36]. The assumption that components are spatially independent and add linearly was evaluated and it was concluded that the fMRI signals and noise are non-Gaussian and the accuracy of the ICA model may vary in specific regions of the brain. For example, cortex-based ICA assumes that cortical data are different from non-cortical data and processes a subset of the data determined by a priori information (see section 4.6) [46]. The signals of interest in fMRI are typically focal and thus have a sub-Gaussian spatial distribution. However the artifactual signals will be more varied and potentially super-Gaussian. Many aspects of the fMRI signal are well known and could be incorporated into an ICA analysis. First, local spatial correlation exists in MR images due only to the acquisition process. It is often the case that partial k-space acquisitions involve sampling fewer frequency samples than the desired number of spatial samples. One can use the fact that the matrix of frequency data is Hermitian-symmetric to reconstruct the image using a partially acquired frequency matrix (with the trade-off being a decrease in signal-tonoise-ratio). Another well-known method involves sampling the lower frequencies and padding the high frequencies with zero (with the trade-off being a decrease in spatial resolution). This broadens the well described MRI spatial point spread function in one direction, although it has been suggested that there is a real gain in resolution when zero padding is up to as much as twice the original number of samples [47]. This results in spatial correlation of the MR signal. In addition, spatial correlation is induced by the process being measured. The hemodynamic sources to be estimated have a spatial hemodynamic (vascular) point spread function. This is partially due to the hemodynamics, but is also a function of the pulse sequence and the parameters used. Differing degrees of sensitivity to blood flow and blood oxygenation as well as differences between low and high field magnets will measure different hemodynamics. The pulse sequence, parameters, and magnetic field strength are considered as constant to enable discussion of the hemodynamic point spread function without introducing the complexities of these parameters. There may also be some degree of temporal correlation. Temporal correlation is introduced by: 1) rapid
373
sampling (a scanner parameter) on the time scale of the magnetic equilibrium constant, 7^, 2) the temporal hemodynamic (vascular) point spread function (a physiologic variable), and 3) poorly understood temporal autocorrelations in the data [48]. FMRI provides a non-invasive surrogate measure of the brain's electrical activity. It is a diverse technique and research using fMRI is growing at a rapid pace. The richness of fMRI data is only beginning to be understood. We have provided a brief introduction to the fMRI technique and summarized some of the functionally-related brain signals. It is important to understand the properties of these signals when developing methods for analyzing this data. 4
ICA of fMRI Independent component analysis (ICA) has found a fruitful application in the analysis of functional magnetic resonance imaging (fMRI) data. A principal advantage of this approach is its applicability to cognitive paradigms for which detailed a priori models of brain activity are not available. ICA has been successfully utilized in a number of exciting fMRI applications including the identification of various signal-types (e.g. task and transiently task-related, and physiology-related signals) in the spatial or temporal domain, the analysis of multi-subject fMRI data, the incorporation of a priori information, and for the analysis of complex-valued fMRI data (which has proved challenging for standard approaches). 4.1
Spatial vs. Temporal Independent component analysis is used in fMRI modeling to understand the spatio-temporal structure of the signal. Let the observation data matrix be X, an NxM matrix (where N is the number of time points and M is the number of voxels). The aim of fMRI component analysis is to factor the data matrix into a product of a set of time courses and a set of spatial patterns. In principal component analysis this is achieved by singular value decomposition of the data matrix by which the data matrix is written as the outer product of a set of orthogonal, i.e., un-correlated time courses and set of orthogonal spatial patterns. Independent component analysis takes a more general position and aims at decomposing the data matrix a product of spatial patterns and corresponding time courses where either patterns or time courses are a priori independent. Since the introduction of ICA for fMRI analysis by McKeown et al. [44], the choice of spatial or temporal independency has been controversial. However, the two options are merely two different modeling assumptions. McKeown et al. argued that the sparse distributed nature of the spatial pattern for typical cognitive activation paradigms would work well with spatial ICA (SICA). Furthermore, since the proto-typical confounds are also sparse and localized, e.g., vascular pulsation (signal localized to larger veins that are moving as a result of cardiac pulsation) or breathing induced motion (signal localized to strong tissue contrast near discontinuities: "tissue edges"), the BellSejnowski approach with a sparse prior is very well suited for spatial analysis [49]. Stone et al, proposed a method which attempts to maximize both spatial and temporal independence [50]. An interesting combination of spatial and temporal ICA was pursued by Seifritz et al. [38]; they used an initial SICA to reduce the spatial dimensionality of the data by locating a region of interest in which they then subsequently performed
374
temporal ICA to study in more detail the structure of the non-trivial temporal response in the human auditory cortex. 4.2
A Synthesis/Analysis Model Framework The model shown in Figure 4.1, was introduced in [51] and provides a framework for understanding ICA as applied to fMRI data and for introducing the various processing stages in ICA of fMRI data. The model assumes SICA but can be easily modified for temporal processing. A generative model is assumed for the data including the brain (in a magnet) and the fMRI scanner. Such a model provides a way to monitor the properties of the signals as they propagate through the system and to design the post-processing block, i.e., the analysis stages in a way that matches well with the properties of the source generation mechanisms. The model is also useful for validating ICA results through simulations and hybrid-fMRI data. The data generation block consists of a set of statistically independent (magnetic) hemodynamic source locations in the brain (indicated by si (v) at location v for the *'* source). These sources are a function of magnetic tissue properties such as Tx, T2, r2*, changes in blood flow, changes in blood oxygenation, etc., that are detectable when the brain is placed in a magnetic field. The sources have weights that specify the contribution of each source to each voxel; these weights are multiplied by each source's hemodynamic time course. Finally, it is assumed that each of the N sources are added together so that a given voxel contains a mixture of the sources, each of which fluctuates according to its weighted hemodynamic time course. The first portion of the data generation block takes place within the brain in which the sources are mixed by the matrix A . The second portion of the data generation block involves the fMRI scanner. These sources are sampled ( B ) and represent a function of scan specific MR parameters such as the repeat time (TR), echo time (TE), flip angle, slice thickness, pulse sequence, field-of-view, etc. The data processing block consists of a transformation, T(-), representing a number of possible preprocessing stages, including slice phase correction, motion correction, spatial normalization and smoothing. It is common to perform a data reduction stage ( C ) using PCA or some other approach. The selection of the number of sources is often done manually, but several groups have used information theoretic methods to do order selection [39,52]. The resultant estimated source, s(y'), along with the unmixing matrix A" 1 , can then be thresholded and presented as fMRI activation images and fMRI time courses, respectively. Dala * iuitci.iiion
ytll!,|ts
MM
l->.sUi P]nccssin^
i-Ol'Irfm*™!^
, 1 , , (MlJ **.)«. •)..»
„ , |( 1
Figure 4.1: Model for applying ICA to fMRI data
375
43
Choice ofAlgorithms and Preprocessing As mentioned in the previous subsection, ICA of fMRI involves many preprocessing stages, and there are a number of choices both for those and the ICA algorithms that can be employed. An investigation of how different algorithms and preprocessing stages has been performed by several groups [51,53]. The selection of which algorithm will also depend upon whether one assumes that the sources are sub- or super- Gaussian. In [54], the model described in section 4.2 was utilized to evaluate different preprocessing stages and ICA algorithms using the Kullback-Leibler (KL) divergence between the estimated and "true" distributions. In the case of real fMRI data, validation is difficult as the source distributions are unknown. However, one can move in this direction by superimposing simulated source(s) upon real fMRI data to create a "hybrid" fMRI experiment (see Figure 4.2). Sources are estimated, extracted (by ranking components by their correlation with the known sources) and compared with the actual sources. While this approach is limited, it is useful in providing a quantitative ICA performance measure. Figure 4.2 showing a thresholded "true" source (a) and its mixing function (b). Also shown is a plot of the "hybrid" fMRI data for a voxel close the "true" source maximum (c). The contrast-to-noise level is calculated as the ratio of the source amplitude to the standard deviation (over time) of a voxel within the brain. In general, it is noted that certain choices and combinations make a difference in results. In this work, infomax outperformed (in approximation and variability) fastICA, and PCA outperformed clustering. The best overall combination for this case appears to be infomax and PCA.
Figure 4.2: (a,b,c) Hybrid-fMRI experiment in which a known source is added to a real fMRI experiment (from [32]). (d) Comparison of algorithms and preprocessing using hybrid data
Another approach for comparing algorithms is proposed by Esposito et al m [53]. Linear correlation and receiver operating characteristics are used to compare temporal and spatial outcomes, respectively. The infomax approach appeared to be better suited to investigate activation phenomena that are not predictable or adequately modeled by inferential techniques. 4.4
Group ICA ICA has been successfully utilized to analyze single-subject fMRI data sets, and recently extended for multi-subject analysis [39,55-57]. Unlike univariate methods (e.g., regression analysis, Kolmogorov-Smirnov statistics), ICA does not naturally generalize
376
to a method suitable for drawing inferences about groups of subjects. For example, when using the general linear model, the investigator specifies the regressors of interest, and so drawing inferences about group data comes naturally, since all individuals in the group share the same regressors. In ICA, by contrast, different individuals in the group will have different time courses, and they will be sorted differently, so it is not immediately clear how to draw inferences about group data using ICA. An approach was developed for performing an ICA analysis on a group of subjects [58] which extends the synthesis/analysis model mentioned in 4.2. In order to reduce computational load, data reduction was first performed for each subject's data then a second, aggregate model order reduction was performed. Back-reconstruction and statistical comparison of individual maps is performed following the ICA estimation. This approach is implemented in a Matlab toolbox [59]. Group maps for the fMRI ICA analyses are presented in Figure 4.3. The number of components is estimated to be twenty-one by the two information-theoretic criteria employed: the minimum description length and Akaike's information criterion. Thus, the aggregate data are reduced to this dimension and twenty-one components were estimated. Both maps are thresholded at /><0.001 (/=4.5, df =8). Several interesting components were identified within the data. Separate components for primary visual areas on the left and the right visual cortex (depicted in red and blue, respectively) were consistently taskrelated with respect to the appropriate stimulus. A large region (depicted in green) including occipital areas and extending into parietal areas appeared to be sensitive to changes in the visual stimuli. Additionally some visual association areas (depicted in white) had time courses which were not task related. A comparison of group ICA approaches is found in [60].
Figure 4.3: fMRI Group ICA results (from [39]) 4.5
Multiple Groups In many fMRI experiments, it is desirable to directly compare and contrast two different conditions either within or between subject groups. Methods for performing such comparisons have been developed within the framework of the general linear model; however such comparisons are not intuitive for ICA. Comparisons of two ICA groups can be problematic because the ICA results represent a comparison of two different linear
377
models with different time courses. A method for performing subtractive and conjunctive comparisons of group ICA data is proposed in [61]. An alternative method is given in [62]. One solution involves extracting components of interest using ana priori spatial or temporal template and quantifying whether the components extracted from the two groups have sufficiently unique time courses from the remaining components. An analysis of seven participants performing a three paradigms (visual, motor, and visuomotor experiment) can then be compared between paradigms as seen in Figure 4.4. a Visual - Visuomotor
b Motor - Visuomotor
900 000 *
*
*
*
*
*
*
*
*
*
Figure 4.4: Visual, motor, and visuomotor data comparisons Additional parameterizations are also possible. For example in this study onset latencies were estimated using a weighted least squares technique [63]. A small, but significant latency difference was observed between the onset of visual and motor activation. Group Onset Latencies Visual Experiment Motor Experiment Difference
Left Cortex 3.46±0.33s 3.91±O.30s .450±0.18s (p<0.05, df=7)
Right Cortex 3.19±0.28s 4.43±0.32s 1.24±0.53s (p<0.05, df=7)
Such an approach allows comparisons of both brain activation and time course parameters across paradigms for the flexible modeling approach, ICA. 4.6
Incorporation of a priori Information The incorporation of prior information into ICA methods is important as it can provide improved separability and allow selective exploratory analysis. In addition, ICA methods make assumptions about, e.g. the distributional shape of the sources, and thus it is important to both assess the impact of such assumptions and modify them based upon fMRI data. There have been a number of applications of ICA that have attempted to utilize prior information for fMRI analysis. For example, using a reference function to extract only a single component is proposed in [64]. Stone et al. propose a skewed symmetric nonlinearity (i.e., assume that the source distributions are skewed). This makes sense if one is interested in components that consist largely of either activations or deactivations [65]. Formisano et al. propose performing ICA upon data extracted from the cortex (where the activation is expected to be occurring) using a tessellation model of the brain cortex derived from a high resolution structural image [46]. Duann et al. examine timelocked temporal structure and propose a visualization approach to evaluate trial-by-trial
378
variability [66]. Bayesian methods provide a useful way to incorporate prior information into ICA and may prove useful for FMRI analysis [67]. 4.6.I
Spatial Prior Information The successful applications of ICA to fMRI signals are dependent on the validity of the statistical assumptions implicit in the method. In fMRI data the physiological signals of interest are governed by a small percentage of the whole brain map, whereas the majority of brain regions are governed by less significant homogeneous background with signal of non-interest in the task-related activation maps. While conventional ICA approaches, including a fixed nonlinear function-based algorithm or a high-order cumulant-based algorithm, work well in highly kurtotic or super-Gaussian and subGaussian sources, they might provide poor results with fMRI data due to lack sensitivity to such specifically skewed distributions and low kurtotic signals. Hong & Calhoun propose a source density-driven optimal ICA method, aiming to embody physically realistic assumptions and thus achieve optimal separation results [68]. The central idea is to use a two-stage separation process: 1) Conventional ICA used for all channel sources to obtain initial independent source estimates; 2) source estimate-based "optimal" nonlinearities used for the "refitting" separation to all channel sources. The optimal ICA algorithm is not based on fixed nonlinear functions, but on flexible nonlinearities of density matched candidates. The performance of ICA can be improved by seeking "matched" nonlinearities for each source and incorporating prior information into the ICA algorithm. Figure 4.5 shows the comparison between optimal ICA and Infomax ICA on the fMRI visual stimulation data of a single subject.
(a) Optimal ICA (b) Infomax ICA Figure 4.5: Comparison between Optimal ICA and Infomax ICA 4.6.2
Temporal Prior Information It is also useful to impose constrains directly upon the mixing matrix in a spatial ICA fMRI analysis. For example, a component selective constraint of the ICA model mixing matrix such that one or more specific components are constrained to be "close" to a paradigm-derived time course is shown in Figure 4.6. The degree of closeness is specified by the user based upon amount of confidence placed in the information provided. Results from this approach are shown below for an fMRI experiment for an auditory detection task. The participant was responding to the target with a button press.
379
Figure 4.6: Comparison of ICA and sblCA in one participant The left side of the figure demonstrates the task-related component for an unconstrained ICA analysis. On the right is the constrained (or semi-blind) analysis showing additional regions including motor cortex towards the top of the figure. The temporal correlation with the paradigm is much higher for the constrained analysis as expected. These results demonstrate the utility of incorporating mixing matrix constraints in an fMRI analysis. 4.7
Complex Images Functional magnetic resonance imaging (fMRI) is a technique that produces complex-valued data; however the vast majority of fMRI analyses utilize only magnitude images despite the fact that the phase information has a straightforward physiologic interpretation [69]. A number of ICA algorithms are extended to the complex domain and can be utilized for processing the fMRI data in its native complex domain. The performance of the complex infomax algorithm that uses an analytic (and hence unbounded) nonlinearity with the traditional complex infomax approaches that employ bounded (and hence non-analytic) nonlinearities as well as with a cumulant-based approach has been studied [70]. Results from a magnitude-only and complex-valued analysis are presented in Figure 4.7. The complex-valued approach results in a larger contiguously activated region in all subjects (from [71]). In addition, the phase information is captured by the complex-valued approach (c,d).
380
Figure 4.7: Results from magnitude-only analysis (a) and complex-valued analysis (b,c,d)
5
Conclusion
The application of ICA to fMRI data has proved to be quite fruitful. However there is still much work to be done in order to take full advantage of the information contained in the data. Additional prior information about multiple expected sources (both interesting and non-interesting) and their properties (fMRI properties, physiologic recording, etc) can be utilized. In addition to incorporating appropriate assumptions (and moving towards a semi-blind source separation) it is important to relax inappropriate assumptions (such as having a fixed temporal delay for each source). One of the strengths of ICA of fMRI is its ability to characterize the high-dimensional data in a concise manner. Continuing to do this and developing ways to mine the unexpected information in fMRI data will provide an exciting future for ICA of fMRI. 6
References
[1] A.Jung, "An introduction to a new data analysis tool: Independent Component Analysis," 2001. [2] A.Hyvarinen, "Survey on Independent Componenet Analysis," Neural Computing Surveys, vol. 2, pp. 94-128, 1999. [3] A.J.Bell and T.J.Sejnowski, "An Information Maximisation Approach to Blind Separation and Blind Deconvolution," Neural Comput., vol. 7, pp. 1129-1159, 1995.
381 [4] S.Amari and J.F.Cardoso, "Blind Source Separation-Semiparametric Statistical Approach," IEEE Trans. Acous. Speech, andSig. Proc, vol. 45, pp. 2692-2700, 1997. [5] J.P.Nadal and N.Parga, "Redundancy Reduction and Independent Component Analysis: Conditions on Cumulants and Adaptive Approaches," Neural Computation, vol. 9, pp. 14211456, 1997. [6] B.Pearlmutter and L.Parra, Maximum likelihood blind source separation: A context-sensitive generation of ICA. In: Advances in Neural Information Processing Systems, eds. M.C.Mozer, M.I.Jordan, and T.Petsche. 1997,pp. 613-619. [7] H.H.Yang and S.I.Amari, "Adaptive on-Line Learning Algorithms for Blind SeparatiomMaximum Entropy and Minimum Mutual Information," Neural Computation, vol. 9, pp. 1457-1482, 1997. [8] D.Obradovic and G.Deco, "Information Maximization and Independent Component Analysis: Is There a Difference?," Neural Computation, vol. 10, pp. 2085-2101, 1998. [9] A.Hyvarinen and E.Oja, "A Fast Fixed-Point Algorithm for Independent Component Analysis," Neural Comput., vol. 9, pp. 1483-1492, 1997. [10] J.F.Cardoso and A.Souloumiac, "Blind Beamforming for Non Gaussian Signals," IEEProceeding-F, vol. 140, pp. 362-370, 1993. [11] T.W.Lee, et al., "Independent Component Analysis Using an Extended Infomax Algorithm for Mixed Subgaussian and Supergaussian Sources," Neural Comput., vol. 11, pp. 417-441, 1999. [12] S.Choi, et al., "Flexible Independent Component Analysis," Journal of VLSI Signal Processing, vol. 26, pp. 25-38, 2000. [13] R.H.Boscolo, et al., "Non-parametric ICA.," in Proc. Int. Conf. on ICA and BSS, 2001. [14] Bach F. and J.M., "Kernel Independent Component Analysis," Ournal of Machine Learning Research, vol. 3, pp. 1-48, 2002. [15] N.Vlassis and Y.Motomura, "Efficient Source Adaptivity in Independent in Independent Component Analysis," IEEE Trans. Neural Networks, vol. 12, pp. 559-566, 2001. [16] B.Hong, et al, "Identification of Brain Activity in a Visual Stimulation Task - An Adaptive ICA Approach for fMRI Data," in Proc. HBM, 2004. [17] P.O.Hoyer and A.Hyvarinen, "Independent Component Analysis Applied to Feature Extraction From Color and Stereo Images," Computation in Neural Systems, vol. 11, pp. 191210,2000. [18] M.S.Bartlett, et al., "Face Recognition by Independent Component Analysis," IEEE Trans, on Neural Networks, vol. 13, pp. 1450-1464,2002. [19] B.A.Draper, et al, "Recognizing Faces With PC A and ICA," Computer Vision and Image Understanding, vol. 91, pp. 115-137, 2003. [20] A.Hyvarinen, "New Approximations of Differential Entropy for Independent Component Analysis and Projection Pursuit," Advances in Neural Inf. Proc. Sys., vol. 10, pp. 273-279, 1998. [21] M.Kotani, et al., "Application of independent component analysis to feature extraction of speech," 1999. [22] R.O.Duda, Pattern Classification, John Wiley & Sons, Inc., 2001. [23] R.L.Mason, R.F.Gunst, and J.L.Hess, Statistical Design and Analysis of Experiments, With Applications to Engineering and Science, John Wiley & Sons, Inc., 2003. [24] N.Cristianini and J.Shawe_Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, 2000.
382 [25] M.Bressan, et al, "Using an ICA representation of high dimensional data for object recognition and classification," 2001. [26] J.H.van Hateren and S.A.van der, "Independent Component Filters of Natural Images Compared With Simple Cells in Primary Visual Cortex," Proc. R. Soc. Lond. B Biol. Sci., vol. 265, pp. 359-366, 1998. [27] H.L.Borgne, et al., "Classification of images: ICA filters vs. human perception," vol. 2, pp. 251-254, 2003. [28] F.R.Bach and M.Jordan, "Beyond Independent Components: Trees and Clusters," Journal of Machine Learning Research, vol. 4, pp. 1205-1233, 2003. [29] C.Liu and H.Wechsler, "Independent Component Analysis of Gabor Features for Face Recognition," IEEE Trans, on Neural Networks, vol. 14, pp. 919-928, 2003. [30] C.A.Shah, et al., "ICA Mixture Model based Unsupervised Classification of Hyperspectral Imagery," 2002. [31] K.K.Kwong, et al, "Dynamic Magnetic Resonance Imaging of Human Brain Activity During Primary Sensory Stimulation," Proc. Natl. Acad. Sci., vol. 89, pp. 5675-5679, 1992. [32] V.D.Calhoun, etal, "Different Activation Dynamics in Multiple Neural Systems During Simulated Driving," Hum. Brain Map., vol. 16, pp. 158-167, 2002. [33] D.I.Hoult, et al, "Quadrature Detection in the Laboratory Frame," Magn. Res. Med., vol. 1, pp. 339-353, 1984. [34] P.A.Bandettini, et al, "Processing Strategies for Time-Course Data Sets in Functional MRI of the Human Brain," Magn. Res. Med., vol. 30, pp. 161-173, 1993. [35] K.J.Worsley and K.J.Friston, "Analysis of FMRJ Time-Series Revisited—Again," Neurolmage, vol. 2, pp. 173-181, 1995. [36] M.J.McKeown and T.J.Sejnowski, "Independent Component Analysis of FMRI Data: Examining the Assumptions," Hum. Brain Map., vol. 6, pp. 368-372, 1998. [37] B.Biswal, et al, "Functional Connectivity in the Motor Cortex of Resting Human Brain Using Echo-Planar MRI," Magn. Res. Med., vol. 34, pp. 537-541, 1995. [38] E.Seifritz, et al, "Spatiotemporal Pattern of Neural Processing in the Human Auditory Cortex," Science, vol. 297, pp. 1706-1708, 2002. [39] V.D.Calhoun, et al, "A Method for Making Group Inferences From Functional MRI Data Using Independent Component Analysis," Hum. Brain Map., vol. 14, pp. 140-151, 2001. [40] M.B.Denckla, "Performance on Color Tasks in Kindergarten Children," Cortex, vol. 8, pp. 177-190, 1972. [41] C.F.Beckmann, et al, "Artefact Detection in fMRI Data Using Independent Component Analysis," in Neurolmage, vol. 11, no. 5, p. S614, 2000. [42] Y.Wang, MR Image Statistics and Model-Based MR Image Analysis PhD. Dissertation, UMBC, 1995. [43] G.Kruger and G.H.Glover, "Physiological Noise in Oxygenation-Sensitive Magnetic Resonance Imaging," Magn Reson. Med., vol. 46, pp. 631-637, 2001. [44] M.J.McKeown, et al, "Analysis of FMRI Data by Blind Separation Into Independent Spatial Components," Hum. Brain Map., vol. 6, pp. 160-188, 1998. [45] V.D.Calhoun, et al, "Spatial and Temporal Independent Component Analysis of Functional MRI Data Containing a Pair of Task-Related Waveforms," Hum. Brain Map., vol. 13, pp. 4353,2001.
383 [46] E.Formisano, et al, "Cortex-based Independent Component Analysis," in Neurolmage, vol. 13, no. 6, p. S199, 2001. [47] R.T.Constable, et al, "High Quality Zoomed MR Images," J. Comput. Assist. Tomogr., vol. 13, pp. 179-181, 1989. [48] E.Zarahn, et al, "Empirical Analyses of BOLD FMRI Statistics. I. Spatially Unsmoothed Data Collected Under Null-Hypothesis Conditions," Neurolmage, vol. 5, pp. 179-197, 1997. [49] K.S.Petersen, et al, "On the Independent Components of Functional Neuroimages," in Proc. Int. Conf. on ICA andBSS, Helsinki, Finland, pp. 615-620, 2000. [50] J.V.Stone, et al, "Spatial, Temporal, and Spatiotemporal Independent Component Analysis of fMRI Data," in Proc. Leeds Statistical Research Workshop, Leeds, England, 1999. [51] V.D.Calhoun, T.Adali, and G.D.Pearlson, "Independent Components Analysis Applied to FMRI Data: A Generative Model for Validating Results," to appear Journal of VLSI Signal Proc. Systems, 2004. [52] C.F.Beckmann, et al, "Investigating the Intrinsic Dimensionality of fMRI Data for ICA," in Neurolmage, vol. 13, no. 6, p. S76, 2001. [53] F.Esposito, et al, "Spatial Independent Component Analysis of Functional MRI Time-Series: to What Extent Do Results Depend on the Algorithm Used?," Hum. Brain Map., vol. 16, pp. 149-157,2002. [54] V.D.Calhoun, et al, "Independent Component Analysis Facilitates fMRI of an Naturalistic Behavior: Hypothesized Neural Substrates of Simulated Driving," in Proc. ISMRM, Honolulu, HI, 2002. [55] V.D.Calhoun, et al, "FMRI Activation In A Visual-Perception Task: Network Of Areas Detected Using The General Linear Model And Independent Components Analysis," Neurolmage, vol. 14, pp. 1080-1088, 2001. [56] M.Svensen, et al, "ICA of FMRI Group Study Data," Neurolmage, vol. 16, pp. 551-563, 2002. [57] D.G.Leibovici and C.F.Beckmann, "An Introduction to Multiway Methods for Multi-Subject FMRI," FMRIB Technical Report (TR01DL1), 2001. [58] V.D.Calhoun, et al, "A Method for Making Group Inferences Using Independent Component Analysis of Functional MRI Data: Exploring the Visual System," in Neurolmage, vol. 13, no. 6, p. S88,2001. [59] E.Egolf, et al, "Group ICA of fMRI Toolbox (GIFT)," in Proc. HBM, Budapest, Hungary, 2004. [60] V.J.Schmithorst and S.K.Holland, "Comparison of Three Methods for Generating Group Statistical Inferences From Independent Component Analysis of Functional Magnetic Resonance Imaging Data," J. Magn Reson. Imaging, vol. 19, pp. 365-368, 2004. [61] V.D.Calhoun, et al, "A Method for Testing Conjunctive and Subtractive Hypotheses on Group fMRI Data Using Independent Component Analysis," in Proc. ISMRM, Toronto, Canada, 2003. [62] A.S.Lukic, et al, "An ICA Algorithm For Analyzing Multiple Data Sets," in Int. Conf. on Image Processing (ICIP), Rochester, NY, 2002. [63] V.D.Calhoun, et al, "A Weighted-Least Squares Algorithm for Estimation and Visualization of Relative Latencies in Event-Related Functional MRI," Magn. Res. Med., vol. 44, pp. 947954, 2000.
384 [64] W.Lu and J.C.Rajapakse, "ICA With Reference," in Proc. Int. Conf. on ICA and BSS, San Diego, CA, pp. 120-125, 2001. [65] J.V.Stone, et al., "Spatiotemporal Independent Component Analysis of Event-Related FMRI Data Using Skewed Probability Density Functions," Neuroimage., vol. 15, pp. 407-421, 2002. [66] J.R.Duann, et al., "Single-Trial Variability in Event-Related BOLD Signals," Neuroimage., vol. 15, pp. 823-835,2002. [67] R.A.Choudrey and S.J.Roberts, "Flexible Bayesian Independent Component Analysis for Blind Source Separation," in Proc. Int. Conf. on ICA and BSS, San Diego, CA, 2001. [68] B.Hong and V.D.Calhoun, "Source Density Driven Adaptive Independent Component Analysis Approach for fMRI Signal Analysis," in Proc. MLSP, 2004. [69] V.D.Calhoun, et al., "Independent Component Analysis of FMRI Data in the Complex Domain," Magn Reson. Med., vol. 48, pp. 180-192, 2002. [70] V.D.Calhoun and T.Adali, "Complex ICA for fMRI Analysis: Performance of Several Approaches," in Proc. ICASSP, Hong Kong, China, 2003. [71] V.D.Calhoun, et al., "On Complex Infomax Applied to Complex fMRI Data," in Proc. ICASSP, Orlando, FL, 2002.
C H A P T E R 4.1
MULTIMODAL E M O T I O N R E C O G N I T I O N Nicu Sebe , Ira Cohen , and Thomas S. Huang 3 Faculty of Science, University of Amsterdam, The Netherlands E-mail: [email protected] 2
HP Labs, USA E-mail: [email protected] Beckman Institute,
University of Illinois at Urbana- Champaign, USA E-mail: [email protected]
Recent technological advances have enabled human users to interact with computers in ways previously unimaginable. Beyond the confines of the keyboard and mouse, new modalities for human-computer interaction such as voice, gesture, and force-feedback are emerging. Despite important advances, one necessary ingredient for natural interaction is still missing-emotions. Emotions play an important role in human-to-human communication and interaction, allowing people to express themselves beyond the verbal domain. The ability to understand human emotions is desirable for the computer in several applications. This chapter explores new ways of human-computer interaction that enable the computer to be more aware of the user's emotional and attentional expressions. We present the basic research in the field and the recent advances into the emotion recognition from facial, voice, and pshysiological signals, where the diiferent modalities are treated independently. We then describe the challenging problem of multimodal emotion recognition and we advocate the use of probabilistic graphical models when fusing the different modalities. We also discuss the difficult issues of obtaining reliable affective data, obtaining ground truth for emotion recognition, and the use of unlabeled data. 1. I n t r o d u c t i o n Maybe no movie of modern time has explored the definition of what it means to be h u m a n better t h a n Blade Runner. T h e Tyrell Corporation's motto, "More h u m a n t h a n h u m a n " , serves as the basis for exploring the human experience through true humans and created humans, or Replicants. Replicants are androids t h a t were built to look like humans and t o work or fight their wars. In time, t h e y began t o acquire emotions (so much like humans) and it became difficult to tell t h e m apart. W i t h emotions, they began to feel oppressed and many of t h e m became dangerous and committed acts of extreme violence t o be free. Fortunately, Dr. Elden Tyrell, the creator of the Replicants, installed a built-in safety feature in these models: a fouryear life span. 387
388
It is evident from the above story that it is not sufficient for a machine (computer) to look like a human (e.g., have skin, face and facial features, limbs, etc). Something else is also essential: the ability to acquire and show the emotions. Moreover, the machine should learn to recognize faces and to understand the emotions to be able to have a human-like interaction with its human counterpart. Machines may never need all of the emotional skills that people need but they will inevitably require some of these skills to appear intelligent when interacting with people. It is argued that to truly achieve effective human-computer intelligent interaction (HCII), there is a need for the computer to be able to interact naturally with the user, similar to the way human-human interaction takes place. For example, if a machine talks to you but never listens to you, then it is likely to be annoying, analogous to the situation where a human talks to you but never listens. Reeves and Nass 55 have conducted several experiments of classical human-human interaction, taking out one of the humans and putting in a computer. Their conclusion is that for an intelligent interaction, the basic human-human issues should hold. Humans interact with each other mainly through speech, but also through body gestures, to emphasize a certain part of the speech and display of emotions. As a consequence, the new interface technologies are steadily driving toward accommodating information exchanges via the natural sensory modes of sight, sound, and touch. In face-to-face exchange, humans employ these communication paths simultaneously and in combination, using one to complement and enhance another. The exchanged information is largely encapsulated in this natural, multimodal format. Typically, conversational interaction bears a central burden in human communication, with vision, gaze, expression, and manual gesture often contributing critically, as well as frequently embellishing attributes such as emotion, mood, attitude, and attentiveness. But the roles of multiple modalities and their interplay remain to be quantified and scientifically understood. What is needed is a science of humancomputer communication that establishes a framework for multimodal "language" and "dialog", much like the framework we have evolved for spoken exchange. In some applications, it may not be necessary for computers to recognize emotions. For example, the computer inside an automatic teller machine or an airplane probably does not need to recognize emotions. However, in applications where computers take on a social role such as an "instructor," "helper," or even "companion," it may enhance their functionality to be able to recognize users' emotions. In her recent book, Picard 52 suggested several applications where it is beneficial for computers to recognize human emotions. For example, knowing the user's emotions, the computer can become a more effective tutor. Synthetic speech with emotions in the voice would sound more pleasing than a monotonous voice. Computer "agents" could learn the user's preferences through the users' emotions. Another application is to help the human users monitor their stress level. In clinical settings, recognizing a person's inability to expression certain facial expressions may help diagnose early psychological disorders.
389
Speech Recognition Vocal Affect MICROPHONE
Head Movement VIDEO CAMERA
COMPUTER Facial Expression
Gesture Recognition Eye Contact
-*
Body Posture
ANIMATED AGENT
SYNTHETIC SPEECH
Fig. 1.
Multimodal human-computer interaction.
Psychologists and engineers alike have tried to analyze facial expressions, vocal emotions, gestures, and physiological signals in an attempt to understand and categorize emotions. This knowledge can be used to teach computers to recognize human emotions from video images acquired from built-in cameras, and from speech waveforms gathered from on-board microphones. A natural two-way interaction between the human and the computer through multiple modalities is depicted in Figure 1. In this diagram, one of the inputs to the computer is vision (video), from which gaze, posture, gestures, and facial and lip movements can be extracted. Computers may learn to recognize gestures, postures, facial expressions, eye contact, etc. Likewise, speech and voice (audio) through the microphone may convey linguistic as well as paralinguistic information. On the output side, the computer may appear in the form of an "agent"—a computer-animated face or a personified animated character. This "agent" can speak to the human through a synthesized speech and display corresponding facial and mouth movements on the screen. Even if they are not explicitly presented in the figure, some other modalities such as tactile or physiological signals can also be used in conjunction with the video and audio signals. The goal of this chapter is to explore new ways of human-computer interaction by enabling the computer to be more aware of the human user's emotional and attentional expressions. In particular, we concentrate on the problem of integrating audiovisual inputs for the detection the users' facial and vocal emotional expressions and attentive states. By "emotional expression" we mean any outward expression that arises as a response to some stimulus event. These may include typical expressions such as a "smile" to show that one is happy, or to show one likes what one sees.
390
We begin by describing the basic research into the problems of what are emotions, their importance in human-human interaction, and how emotions are expressed by humans (Section 2). Some of this basic research lays the foundation to automatic emotion recognition by computers, and enables the computer science and engineering community to treat the problem as a pattern recognition one. Next, we review the advances in the area of emotion expression recognition from facial, voice, and physiological signals, where the different modalities are treated independently (Section 3). We then describe research into emotion expression recognition from face and voice signals, advocating the use of probabilistic graphical models when fusing the different modalities for achieving true multi-modal emotion expression recognition. We also discuss the difficult problems of gathering good affective data, obtaining ground truth for emotion recognition, and the use of unlabeled data (Section 4). Throughout the chapter we explore and try to provide answers to the following questions: • What clues are there on the face and in the voice that reveal a person's emotions, preferences, and attentional states? • How well can we use these clues to train the computer to recognize human emotion from audio and from video? • Does the use of joint audiovisual input allow for more accurate or efficient emotion recognition than using a single modality? • In realistic scenarios, can the two modalities be treated separately? • How to collect multimodal data with emotional expressions and how should it be labeled? • Can we get away with labeling only small amounts of data and use unlabeled data to help in training the models that recognize emotional expressions? • What data should be collected? Spontaneous or posed data? 2. Human Emotion Research There is a vast body of literature on emotions. The multifaceted nature prevents a comprehensive review, we will review only what is essential in supporting this work. Recent discoveries suggest that emotions are intricately linked to other functions such as attention, perception, memory, decision making, and learning. This suggests that it may be beneficial for computers to recognize the human user's emotions and other related cognitive states and expressions. In this chapter, we concentrate on the expressive nature of emotion, especially those expressed in the voice and on the face. 2.1. Affective
human-computer
interaction
In many important HCI applications such as computer aided tutoring and learning, it is highly desirable (even mandatory) that the response of the computer takes into
391 account the emotional or cognitive state of the human user. Emotions are displayed by visual, vocal, and other physiological means. There is a growing amount of evidence showing that emotional skills are part of what is called "intelligence" 58 ' 25 . Computers today can recognize much of what is said, and to some extent, who said it. But, they are almost completely in the dark when it comes to how things are said, the affective channel of information. This is true not only in speech, but also in visual communications despite the fact that facial expressions, posture, and gesture communicate some of the most critical information: how people feel. Affective communication explicitly considers how emotions can be recognized and expressed during human-computer interaction. Addressing the problem of affective communication, Bianchi-Berthouze and Lisetti 2 identified three key points to be considered when developing systems that capture affective information: embodiment (experiencing physical reality), dynamics (mapping experience and emotional state with its label), and adaptive interaction (conveying emotive response, responding to a recognized emotional state). In most cases today, if you take a human-human interaction, and replace one of the humans with a computer, then the affective communication vanishes. Furthermore, this is not because people stop communicating affect - certainly we have all seen a person expressing anger at his machine. The problem arises because the computer has no ability to recognize if the human is pleased, annoyed, interested, or bored. Note that if a human ignored this information, and continued babbling long after we had yawned, we would not consider that person very intelligent. Recognition of emotion is a key component of intelligence52. Computers are presently affect-impaired. Furthermore, if you insert a computer (as a channel of communication) between two or more humans, then the affective bandwidth may be greatly reduced. Email may be the most frequently used means of electronic communication, but typically all of the emotional information is lost when our thoughts are converted to the digital media. Research is therefore needed for new ways to communicate affect through computer-mediated environments. Computer-mediated communication today almost always has less affective bandwidth than "being there, face-to-face". The advent of affective wearable computers, which could help amplify affective information as perceived from a person's physiological state, are but one possibility for changing the nature of communication. 2.2. Theories
of
emotion
There is little agreement about a definition of emotion. Many theories of emotion have been proposed. Some of these could not be verified until recently when measurement of some physiological signals become available. In general, emotions are short-term, whereas moods are long-term, and temperaments or personalities are very long-term 29 . A particular mood may sustain for several days, and temperament for months or years. Finally, emotional disorders can be so disabling that people
392
affected are no longer able to lead normal lives. Darwin 14 held an ethological view of emotional expressions, arguing that the expressions from infancy and lower life forms exist in adult humans. Following the Origin of Species he wrote The Expression of the Emotions in Man and Animals. According to him, emotional expressions are closely related to survival. Thus in human interactions, these nonverbal expressions are as important as the verbal interaction. James 28 viewed emotions not as causes but as effects. Situations arise around us which cause changes in physiological signals. According to James, "the bodily changes follow directly the perception of the exciting fact, and that our feeling of the same changes as they occur is the emotion." Carl Lange proposed a similar theory independently at around the same time. This is often referred to as the "JamesLange" theory of emotion. Cannon 5 , contrary to James, believed that emotions are first felt, then exhibited outwardly causing certain behaviors. Despite the many theories, it is evident that people display these expressions to various degrees. One frequently studied task is the judgment of emotions—how well can human observers tell the emotional expressions of others, in the voice, on the face, etc? Related questions are: Do these represent their true emotions? Can they be convincingly portrayed? How well can people conceal their emotions? In such tasks, researchers often use two different methods to describe the emotions. One approach is to label the emotions in discrete categories, i.e., human judges must choose from a prescribed list of word labels, such as joy, fear, love, surprise, sadness, etc. One problem with this approach is that the stimuli may contain blended emotions. Also, the choice of words may be too restrictive, or culturally dependent. Another way is to have multiple dimensions or scales to describe emotions. Instead of choosing discrete labels, observers can indicate their impression of each stimulus on several continuous scales, for example, pleasant-unpleasant, attentionrejection, simple-complicated, etc. Two common scales are valence and arousal. Valence describes the pleasantness of the stimuli, with positive (or pleasant) on one end, and negative (or unpleasant) on the other. For example, happiness has a positive valence, while disgust has a negative valence. The other dimension is arousal or activation. For example, sadness has low arousal, whereas surprise has high arousal level. The different emotional labels could be plotted at various positions on a two-dimensional plane spanned by these two axes to construct a 2D emotion model 31 . Scholsberg62 suggested a three-dimensional model in which he had attention-rejection in addition to the above two. Another interesting topic is how the researchers managed to obtain data for observation. Some people used posers, including professional actors and non-actors. Others attempted to induce emotional reactions by some clever means. For example, Ekman showed stress-inducing film of nasal surgery in order to get the disgusted look on the viewers' faces. Some experimenter even dumped water on the subjects
393
or fired blank shots to induce surprise, while others used clumsy technicians who made rude remarks to arouse fear and anger 26 . Obviously, some of these are not practical ways of acquiring data. After studying acted and natural expressions, Ekman concluded that expressions can be convincingly portrayed 17 . A legitimate question that should be considered when doing multimodal emotion recognition is how much information does the face, as compared to voice, speech, and body movement, provide about emotion. Most experimenters found that the face is more accurately judged, produces higher agreement, or correlates better with judgments based on full audiovisual input than on voice or speech input 38 ' 17 . Ekman 17 found that the relative weight given to facial expression, speech, and body cues depend both on the judgment task (e.g., rating the stimulus subject's dominance, sociability, or relaxation) and the conditions in which the behavior occurred (e.g., subjects frankly describing positive reactions to a pleasant film or trying to conceal negative feelings aroused by a stressful film). The whole question of how much information is conveyed by "separate" channels may inevitably be misleading. There is no evidence that individuals in actual social interaction selectively attend to another person's face, body, voice, or speech or that the information conveyed by these channels is simply additive. The central mechanisms directing behavior cut across the channels, so that, for example, certain aspects of face, body, voice, and speech are more spontaneous and others are more closely monitored and controlled. It might well be that observers selectively attend not to a particular channel but to a particular type of information (e.g., cues to emotion, deception, or cognitive activity), which may be available within several channels. No investigator has yet explored this possibility or the possibility that different individuals may typically attend to different types of information.
3. Emotional Expression Recognition for Human-Computer Interaction The mounting evidence of the importance of emotions in human-human interaction provided the basis for researchers in the engineering and computer science communities to develop automatic ways for computers to recognize emotional expression, as a goal towards achieving human-computer intelligent interaction. The labeling of emotions into different states led most research to use pattern recognition approaches for recognizing emotions, using different modalities as inputs to the emotion recognition models. Next we review some of these works. 3.1. Facial expression
recognition
studies
Since the early 1970s, Paul Ekman and his colleagues have performed extensive studies of human facial expressions 18 . They found evidence to support universality in facial expressions. These "universal facial expressions" are those representing happiness, sadness, anger, fear, surprise, and disgust. They studied facial expressions
394 in different cultures, including preliterate cultures, and found much commonality in the expression and recognition of emotions on the face. However, they observed differences in expressions as well, and proposed that facial expressions are governed by "display rules" in different social contexts. For example, Japanese subjects and American subjects showed similar facial expressions while viewing the same stimulus film. However, in the presence of authorities, the Japanese viewers were more reluctant to show their real expressions. Matsumoto 36 reported the discovery of a seventh universal facial expression: contempt. Babies seem to exhibit a wide range of facial expressions without being taught, thus suggesting that these expressions are innate 27 . Ekman and Friesen19 developed the Facial Action Coding System (FACS) to code facial expressions where movements on the face are described by a set of action units (AUs). Each AU has some related muscular basis. Each facial expression may be described by a combination of AUs. This system of coding facial expressions is done manually by following a set prescribed rules. The inputs are still images of facial expressions, often at the peak of the expression. This process is very timeconsuming. Ekman's work inspired many researchers to analyze facial expressions by means of image and video processing. By tracking facial features and measuring the amount of facial movement, they attempt to categorize different facial expressions. Recent work on facial expression analysis and recognition 35>65>32>3>56.20,44,33,42,34,43,12,6,10 has used these "basic expressions" or a subset of them. The recent surveys in the area 21 ' 47 ' 48 provide an in-depth review of many of the research done in automatic facial expression recognition in recent years. Recent work in computer-assisted quantification of facial expressions did not start until the 1990s. Mase 35 used optical flow (OF) to recognize facial expressions. He was one of the first to use image processing techniques to recognize facial expressions. Lanitis et al. 32 used a flexible shape and appearance model for image coding, person identification, pose recovery, gender recognition and facial expression recognition. Black and Yacoob3 used local parameterized models of image motion to recover non-rigid motion. Once recovered, these parameters are fed to a rule-based classifier to recognize the six basic facial expressions. Yacoob and Davis 68 computed optical flow and used similar rules to classify the six facial expressions. Rosenblum et al. 56 also computed optical flow of regions on the face, then applied a radial basis function network to classify expressions. Essa and Pentland 20 also used an optical flow region-based method to recognize expressions. Otsuka and Ohya 44 first computed optical flow, then computed their 2D Fourier transform coefficients, which were then used as feature vectors for a hidden Markov model (HMM) to classify expressions. The trained system was able to recognize one of the six expressions near realtime (about 10 Hz). Furthermore, they used the tracked motions to control the facial expression of an animated Kabuki system 45 . A similar approach, using different features was used by Lien 33 . Nefian and Hayes 42 proposed an embedded
395
HMM approach for face recognition that uses an efficient set of observation vectors based on the DCT coefficients. Martinez 34 introduced an indexing approach based on the identification of frontal face images under different illumination conditions, facial expressions, and occlusions. A Bayesian approach was used to find the best match between the local observations and the learned local features model and an HMM was employed to achieve good recognition even when the new conditions did not correspond to the conditions previously encountered during the learning phase. Oliver et al. 43 used lower face tracking to extract mouth shape features and used them as inputs to an HMM based facial expression recognition system (recognizing neutral, happy, sad, and an open mouth). Chen 6 used a suite of static classifiers to recognize facial expressions, reporting on both person-dependent and person-independent results. Cohen et al. 12 describe classification schemes for facial expression recognition in two types of settings: dynamic and static classification. The static classifiers classify a frame in a video to one of the facial expression categories based on the tracking results of that frame. In this setting, the authors learn the structure of Bayesian networks classifiers using as input 12 motion units given by a face tracking system. The authors also use schemes that utilize data that are unlabeled and cheap to obtain, in conjunction with (expensively) labeled data 1 0 , 1 1 . For the dynamic setting, they used a multi-level HMM classifier that combines the temporal information and allows not only to perform the classification of a video segment to the corresponding facial expression, as in the previous works on HMM based classifiers, but also to automatically segment an arbitrary long sequence to the different expression segments without resorting to heuristic methods of segmentation. These methods are similar in the general sense that they first extract some features from the images, then these features are fed into a classification system, and the outcome is one of the preselected emotion categories. They differ mainly in the features extracted from the video images or the processing of video images to classify emotions. The video processing falls into two broad categories. The first is "feature-based," where one tries to detect and track specific features such as the corners of the mouth, eyebrows, etc.; the other approach is "region-based" in which facial motions are measured in certain regions on the face such as the eye/eyebrow and mouth regions. People have used different classification algorithms to categorize these emotions. In Table 1, we compare several facial expression recognition algorithms. In general, these algorithms perform well compared to trained human recognition of about 87% as reported by Bassili1. In contrast to the classification methods described above, Ueki et al. 65 extracted AUs and used neural networks (NN) to analyze the emotions, mapping seventeen AUs to two dimensions using an identity mapping network, and this showed resemblance of the 2D psychological emotion models. Later on, Morishima 39 proposed a 3D emotion model in order to deal with transitions between emotions, and claimed correlation to the 3D psychological emotion model 62 .
396 Table 1.
Comparisons of facial expression recognition algorithms.
Number of Subjects 1
Performance
kNN
Number of Categories 4
rule-based
6
40
92%
optical flow
rule-based
6
32
95%
optical flow
neural networks
2
32
88%
optical flow
distance-based
5
8
98%
HMM
6
4
93%
distance-based
7
-
74%
Winnow
6
5
86%
Bayesian networks
7
5+53
83%
Author
Processing
Classification
Mase Black & Yacoob Yacoob & Davis Rosenblum et al. Essa & Pentland Otsuka & Ohya Lanitis et al.
optical flow parametric model
Chen Cohen et al.
2D F T of optical flow appearance model appearance model appearance model
86%
Another interesting thing to point out is the problem of the commonly confused categories in the six basic expressions. As reported by Ekman, anger and disgust are commonly confused in judgment studies. Also, fear and surprise are commonly confused. The reason why these confusions occur is because they share many similar facial actions 19 . Surprise is sometimes mistaken for interest, but not the other way around. In the computer recognition studies, some of these confusions are observed 3,68,12 .
3.2. Vocal emotion
recognition
studies
The vocal aspect of a communicative message carries various kinds of information. If we disregard the manner in which the message was spoken and consider the verbal part (e.g., words) only, we might miss the important aspects of the pertinent utterance and we might even completely misunderstand what was the meaning of the message. Nevertheless, in contrast to spoken language processing, which has recently witnessed significant advances, the processing of emotional speech has not been widely explored. Starting in the 1930s, quantitative studies of vocal emotions have had a longer history than quantitative studies of facial expressions. Traditional as well as most recent studies in emotional contents in speech 40,9,13 ' 16,30,59,61 have used "prosodic" information which includes the pitch, duration, and intensity of the utterance 57 . Williams and Stevens66 studied the spectrograms of real emotional speech and compared them with acted speech. They found similarities which suggest the use of acted data. Murray and Arnott 40 reviewed findings on human vocal emotions. They also constructed a synthesis-by-rule system to incorporate emotions in syn-
397 thetic speech 41 . To date, most works have concentrated on the analysis of human vocal emotions. Some studied human abilities to recognize vocal emotions. These findings are very useful for the present work. There has been less work on recognizing human vocal emotions by computers than there has been on recognizing facial expressions by machine. Chiu et al. 9 extracted five features from speech and used a multilayered neural network for the classification. For 20 test sentences, they were able to correctly label all three categories. Dellaert et al. 16 used 17 features and compared different classification algorithms and feature selection methods. They achieved 79.5% accuracy with 4 categories and 5 speakers speaking 50 short sentences per category. Petrushin 51 compared human and machine recognition of emotions in speech and achieved similar rates for both (around 65%). In that work, 30 subjects spoke 4 sentences, with each sentence repeated 5 times, once for each emotion category. Scherer61 performed a large-scale study using 14 professional actors. In this study, he extracted as many as 29 features from the speech. According to Scherer, human ability to recognize emotions from purely vocal stimuli is about 60%. He pointed out that "sadness and anger are best recognized, followed by fear and joy. Disgust is the worst." Chen 6 proposed a rule-based method for classification of input audio data into one of the following emotions categories: happiness, sadness, fear, anger, surprise, and dislike. The input data contained 2 speakers, one speaking Spanish and the other one Sinhala. The choice of these languages was such that the subjective judgments were not influenced by the linguistic content as the observers did not comprehend either language. Each speaker was asked to speak 6 different sentences for each emotion and the contents of the sentences were related in most of the cases to one category and some of them could be applied to two different categories. From the audio signals pitch, intensity, and pitch contours were estimated as acoustic features which were then classified using some predefined rules. Recent studies seem to use the "Ekman six" basic emotions, although others in the past have used many more categories. The reasons for using these basic six categories are often not justified. It is not clear whether there exist "universal" emotional characteristics in the voice for these six categories. Table 2 shows a summary of human vocal affects as reported by Murray and Arnott 40 . This table describes mostly qualitative characteristics associated with these emotions. These are listed in relation to the neutral voice. 3.3. Emotion
recognition
from physiological
signals
Emotion consists of more than outward physical expression; it also consists of internal feelings and thoughts, as well as other internal processes of which the person having the emotion may not be aware. Still, these physiological processes can be naturally recognized by people. A stranger shaking your hand can feel its clamminess (related to skin conductivity); a friend leaning next to you may sense your heart pounding, etc.
398 Table 2.
Summary of human vocal affects described relative to neutral speech.
Speech Rate Pitch Average Pitch Range Intensity Voice Quality
Anger
Happiness
Sadness
Fear
slightly faster very much higher much wider higher breathy
faster or slower much higher much wider higher blaring
slightly slower slightly lower slightly narrower lower resonant
much faster very much higher much wider normal irregular
Disgust very much slower very much lower slightly wider lower grumbled
Physiological pattern recognition of emotion has important applications in medicine, entertainment, and human-computer interaction 53 . Physiological pattern recognition can potentially help in assessing and quantifying stress, anger, and other emotions that influence health. Affective states of depression, anxiety, and chronic anger have been shown to impede the work of the immune system, making people more vulnerable to infections, and slowing healing from surgery or disease. Changes in physiological signals can also be examined for signs of stress arising while users interact with the technology, helping detect where the product causes unnecessary irritation or frustration. This information may help developers to redesign and improve their technology. One of the big questions in emotion theory is whether distinct physiological patterns accompany each emotion 4 . The physiological muscle movements comprising what looks to an outsider to be a facial expression may not always correspond to a real underlying emotional state. This relation between the bodily feelings and externally observable expression is still an open research area, with a history of controversy. Historically, James was the major proponent of emotion as an experience of bodily changes, such as the perspiring hands or a pounding heart 28 . This view was challenged by Cannon 5 and by Schachter60 who argued that the experience of physiological changes was not sufficient to discriminate emotions. According to Schachter 60 , physiological responses such as sweaty hands and a rapid heart beat inform our brain that we are aroused and then the brain must analyze the situation we are in before it can label the state with an emotion such as fear or love. Since these classic works, there has been a debate about whether or not emotions are accompanied by specific physiological changes other than simply arousal level. Winton et al. 67 provided some of the first findings showing significant differences in autonomic nervous system signals according to a small number of emotional categories or dimensions, but there was no exploration of automated classification. Fridlund and Izard 22 appear to have been the first to apply pattern recognition (linear discriminants) to classification of emotion from physiological features, attaining rates of 38-51 percent accuracy (via cross-validation) on subject-dependent classification of four different facial expressions (happy, sad, anger, fear) given four facial electromyogram signals. Picard et al. 53 classified physiological patterns for a set of eight emotions (including neutral) by applying pattern recognition techniques
399 and by focusing on felt emotions of a single subject over sessions spanning many weeks. 4. Multimodal Approach to Emotion Recognition The studies in facial expression recognition and vocal affect recognition have been done largely independent of each other. The aforementioned works in facial expression recognition used still photographs or video sequences where the subject exhibits only facial expression without speaking any words. Similarly, the works on vocal emotion detection used only the audio information. There are situations where people would speak and exhibit facial expressions at the same time. For example, "he said hello with a smile." Pure facial expression recognizers may fail because the mouth movements may not fit the description of a pure "smile." For computers to be able to recognize emotional expression in practical scenarios, these cases must be handled. 4.1. Related
research
Combining audio and visual cues has been studied in recent years for speech recognition 54 . It has been shown that in situations when background noise makes the speech waveforms very noisy, cues from the lip movements improve speech recognition accuracy a great deal. In speech recognition, the lip movements and speech sounds are tightly coupled. For emotional expression recognition, the coupling is not so tight. Very little has been done to utilize both modalities for recognizing emotions. Pelachaud et al. 49 constructed a system that generated animated facial expressions for synthetic speech. Again, this work only emphasized the synthetic aspect and not the recognition of emotions. De Silva and Ng 15 proposed a rule-based method for singular classification of audiovisual input data into one of the six emotion categories: happiness, sadness, fear, anger, surprise, and dislike. Each of their subjects was asked to portray 12 emotion outbursts per category by displaying the related prototypical facial expression while speaking a single English word of his choice. The audio and visual material has been processed separately. They used optical flow for detecting the displacement and velocity of some key facial features (e.g., corners of the mouth, inner corners of the eye brows). From the audio signal, pitch and pitch contours were estimated by using the method proposed by Medan et al. 37 . A nearest neighbor method has been used to classify the extracted facial features and an HMM has been used to classify the estimated acoustic features into one of the emotion categories. Per subject, the results of the classification were plotted in two graphs and based upon these graphs the rules for emotion classification of the audiovisual input material were defined. Chen and Huang 7 proposed a set of methods for singular classification of input audiovisual data into one of the basic emotion categories: happiness, sadness,
400
disgust, fear, anger, and surprise. They collected data from five subjects which displayed 6 basic emotions 6 times by producing the appropriate facial expression right before or after speaking a sentence with the appropriate vocal emotion. Each of these single-emotion sequences started and ended with a neutral expression. Considering the fact that in the recorded data a pure facial expression occurred right before or after the sentence spoken with the appropriate vocal emotion, the authors applied a single-modal classification method in a sequential manner. To conclude, the most surprising issue regarding the multimodal emotion recognition problem, is that although the recent advances in video and audio processing could make the multimodal analysis of human affective state tractable, there were only a few research efforts which tried to implement a multimodal emotion analyzer. Further, there is no record of a research effort that aims at integrating all nonverbal modalities into a single system for affect-sensitive analysis of human behavior. 4.2. Fusing multimodal models
information
using probabilistic
graphical
A typical issue of multimodal data processing so far is that the multisensory data are typically processed separately and only combined at the end. Yet this is almost certainly incorrect; people display audio and visual communicative signals in a complementary and redundant manner. Chen et al. 8 have shown this experimentally. In order to accomplish a human-like multimodal analysis of multiple input signals acquired by different sensors, the signals cannot be considered mutually independent and cannot be combined in a context-free manner at the end of the intended analysis but, on the contrary, the input data should be processed in a joint feature space and according to a context-dependent model. In practice, however, besides the problems of context sensing and developing context-dependent models for combining multisensory information, one should cope with the size of the required joint feature space, which can suffer from large dimensionality, different feature formats, and timing. A potential way to achieve the target tightly coupled multisensory data fusion is to develop context-dependent versions of a suitable method such as the Bayesian inference method proposed by Pan et al. 46 . If we consider the state of the art in audio and visual signal processing, noisy and partial input data should also be expected. A multimodal system should be able to deal with these imperfect data and generate its conclusion so that the certainty associated with it varies in accordance to the input data. A way of achieving this is to consider the time-instance versus time-scale dimension of human nonverbal communicative signals as suggested by Pantic and Rothkrantz 48 . By considering previously observed data (time scale) with respect to the current data carried by functioning observation channels (time instance), a statistical prediction and its probability might be derived about both the information that have been lost due to malfunctioning/inaccuracy of a particular sensor and the currently displayed action/reaction. Probabilistic graphical models, such as hidden Markov Models (in-
401
eluding their hierarchical variants), Bayesian networks, and Dynamic Bayesian networks are very well suited for fusing such different sources of information. These models can handle noisy features, temporal information, and missing values of features all by probabilistic inference. Hierarchical HMM-based systems 12 have been shown to work well for facial expression recognition. Dynamic Bayesian Networks and HMM variants 24 have been shown to fuse various sources of information in recognizing user intent, office activity recognition, and event detection in video using both audio and visual information 23 .
Context
Fig. 2.
Bayesian network topology for bimodal emotion expression recognition.
The success of these research efforts has shown that fusing audio and video for detection of discrete events using probabilistic graphical models is possible. Therefore, we propose the Bayesian network topology for recognizing emotions from audio and facial expressions presented in Figure 2. While the network shown is static, it can be extended to be a dynamic Bayesian network in a straightforward manner. The network topology combines the two modalities in a probabilistic manner. The top node is the class variable (recognized emotional expression). It is affected by recognized facial expressions, recognized vocal expressions, recognized keywords that have an affective meaning, and by the context in which the system operates (if that is available). Vocal emotions are recognized from audio features extracted from the person's audio track. Facial expressions are recognized by facial features tracked using video, but the recognition is also affected by a variable that indicates whether the person is speaking or not. Recognizing whether a person is speaking uses both visual cues (mouth motion) and audio features (using similar techniques as Garg et al. 24 ). The parameters of the proposed network can be learned from data, or manually set for some variables. Inferring the human emotional expression can be performed even when some pieces of information are missing, e.g., when audio is too noisy, or the face tracking loses the face.
402
Another issue which makes the problem of emotional expression recognition even more difficult to solve in a general case is the dependency of a person's behavior on his/her personality, cultural, and social vicinity, current mood, and the context in which the observed behavioral cues were encountered. One source of help for these problems is machine learning: rather than using a priori rules to interpret human behavior, we can potentially learn application-, user-, and context-dependent rules by watching the user's behavior in the sensed context 50 . This leads to another advantage of probabilistic graphical models: well known algorithms exist to adapt the models, and it is possible to use prior knowledge when learning new models. For example, a prior model of emotional expression recognition trained based on a certain user can be used as a starting point for learning a model for another user, or for the same user in a different context. Though context sensing and the time needed to learn appropriate rules are significant problems in their own right, many benefits could come from such an adaptive affect-sensitive HCI tool. While fusing multimodal information for emotion recognition is an important issue, there are some other aspects that are of equal importance. Difficult problems are those of obtaining the ground truth of the data and getting data that genuinely corresponds to a particular emotional state. Even though there are cases when the data can be easily labeled (e.g., a singular strong emotion is captured, such as an episode of rage), in most of the cases the ground truth — which emotion was present — is difficult to establish. We discuss these issues in detail in the following sections.
4.3. Collecting
multimodal
data for emotion
recognition
In general, the goal of emotional expression is to detect the emotional state of the person in a natural situation. However, as any photographer can attest, getting a real smile can be challenging. Asking someone to smile often does not create the same picture as an authentic smile. The fundamental reason of course is that the subject often does not feel happy so his smile is artificial and in many subtle ways quite different than a genuine smile. Picard et al. 53 outlined five factors that influence the affective data collection: • Spontaneous versus posed: Is the emotion elicited by a situation or stimulus that is outside the subject's control or the subject is asked to elicit the emotion? • Lab setting versus real-world: Is the data recording taking place in a lab or the emotion is recorded in the usual environment of the subject? • Expression versus feeling: Is the emphasis on external expression or on internal feeling? • Open recording versus hidden recording: Is the subject aware that he is being recorded? • Emotion-purpose versus other-purpose: Does the subject know that he is a part of an experiment and the experiment is about emotion?
403
Note that these factors are not necessarily independent. The most natural setup would imply that the subject feels the emotion internally (feeling), the emotion occurs spontaneous, while the subject is in his usual environment (real-world). Also, the subject should not know that he is being recorded (hidden recording) and that he is a part of an experiment (other-purpose). Such data are usual impossible to obtain because of privacy and ethics concerns. As a consequence, several researchers 53 ' 64 were trying to use a setup that resembled as much as possible the natural setup. Picard et al. 53 collected data using a posed, closed to real-world (subject's comfortable usual workplace), feeling, open-recording, and emotion-purpose methodology. The key factor that made their data unique is that the subject tried to elicit an internal feeling of each emotion. Sebe et al. 64 were more interested in collecting spontaneous emotion data. They created a video kiosk (lab setting) with a hidden camera (hidden-recording) which displayed segments from recent movie trailers. This setup had the main advantage that it naturally attracted people to watch and could potentially elicit emotions through different genres of video footage — i.e. horror films for shock, comedy for joy, etc. The issue of whether to use posed or spontaneous expressions in selecting facial stimuli, has been hotly debated 17 . Experimentalists and most emotion theorists argue that spontaneous expressions are the only "true" expressions of facial emotion and therefore such stimuli are the only ones of merit. When recording authentic (spontaneous) emotions several aspects should be considered 64 . Not all people express emotion equally well; many individuals have idiosyncratic methods of expressing emotion as a result of personal, familial, or culturally learned display rules. Situations in which authentic emotions are usually recorded (e.g., lab setting) are often unusual and artificial. If the subject is aware of being photographed or filmed (open-recording), his emotional response may not be spontaneous anymore. Even if the subject is unaware of being filmed (hiddenrecording), the laboratory situation may not encourage natural or usual emotion response. In interacting with scientists or other authorities (emotion-purpose), subjects will attempt to act in appropriate ways so that emotion expression may be masked or controlled. Additionally, there are only a few universal emotions and only some of these can be ethically stimulated in the laboratory. On the other hand, posed expressions may be regarded as an alternative, provided that certain safeguards are followed. Increased knowledge about the face, based in large part on observation of spontaneous, naturally occurring facial expressions, has made possible a number of methods of measuring the face. The same situation stands for voice analysis. These measurement techniques can be used to ascertain whether or not emotional behavior has occurred and what emotion is shown in a given instance. Such facial scoring provides a kind of stimulus criterion validity that is important in this area. Additionally, posers can be instructed, not to act or pose a specific emotion, but rather to move certain muscles so as to effect the desired emotional expression. In this way, experimental control may be exerted
404
on the stimuli and the relationship between the elements of the expression and the responses of observers may be analyzed and used as a guide in item selection. It should be noted that the distinction between posed and spontaneous behavior is not directly parallel to the distinction between artificial and natural occurrences. Though posing is by definition artificial, spontaneous behavior may or may not be natural 17 . Spontaneous behavior is natural when some part of life itself leads to the behavior studied. Spontaneous behavior elicited in the laboratory may be representative of some naturally occurring spontaneous behavior, or conceivably it could be artificial if the eliciting circumstance is unique and not relevant to any known real life event. From the above discussion, it is clear that the authentic emotion analysis should be performed whenever is possible. Posed expression may be used as an alternative only in restricted cases and they can be mostly used for benchmarking the authentic expressions.
4.4. Leveraging
unlabeled
data for emotion
recognition
As pointed out in the previous section, collecting emotional expression data is a difficult task. Labeling those data adds an additional challenge, as it is time-consuming, error prone, and expensive. In addition, an emotion expression recognition system that is deployed in a realistic setting would easily obtain an abundance of emotion expressions, but would not be able to obtain manual labeling of that data — if a computer constantly asks a user for his/her emotion, we can be quite sure that eventually the response would be that of anger or annoyance. Therefore, it would be very beneficial to construct methods that utilize both scarcely available labeled data and abundance of unlabeled data — where the labels are the emotional state (or expression) of a user. Again, probabilistic graphical models are ideal candidates for such data, as efficient and convergent algorithms exist for handling missing data in general and unlabeled data in particular. Cohen et al. 11 have shown that unlabeled data can be used for recognizing facial expressions using Bayesian networks with a combination of labeled and unlabeled data. However, they have shown that care must be taken when attempting such schemes. While in the purely supervised case (with only labeled data), adding more labeled data always improves the performance of the classifier, adding more unlabeled data can be detrimental to performance. As shown by Cohen et al. 11 such detrimental effects occur when the assumed classifier's model does not match the distribution generating data. They propose an algorithm for stochastically searching the space of Bayesian networks, converging on a classifier which does utilize positively unlabeled data. To conclude, further research is required to achieve maximum utilization of unlabeled data for the problem of emotion recognition, but it is clear that such methods would provide great benefit.
405
5. Discussion and Conclusion As remarked Salovey and Mayer 58 and Goleman 25 emotional skills are an essential part of what is called "intelligence". This is based on recent scientific findings about the role of emotional abilities in human intelligence and on the way human-machine interaction imitates human-human interaction. As a consequence, emotions, largely overlooked in early efforts to develop machine intelligence, are increasingly regarded as an area of important research. Emotion modulates almost all modes of human communication —facial expression, gestures, posture, tone of voice, choosing of words, respiration, skin temperature and clamminess, etc. Emotions can significantly change the message: sometimes it is not what was said that is the most important, but how it was said. Faces tend to be the most visible form of emotion communication, but they are also most easily controlled in response to different social situations when compared to the voice and other ways of expression. As noted by Picard 52 affect recognition is most likely to be accurate when it combines multiple modalities, information about the user's context, situation, goal, and preferences. A combination of low-level features, high-level reasoning, and natural language processing is likely to provide the best emotion inference. Considering all these aspects, Reeves and Nass 55 and Pentland 50 believe that multimodal context-sensitive human-computer interaction is likely to become the single most widespread research topic of the artificial intelligence research community. Advances in this area could change not only how professionals practice computing, but also how mass consumers interact with the technology. As we discussed in this chapter and pointed out by Pantic and Rothkrantz 48 , although there were significant advances in the fields of video and audio processing, pattern recognition, computer vision, and affective computing, the realization of a robust, multimodal, adaptive, context-sensitive analyzer of human nonverbal affective state is far from being a reality. Currently, the researchers have to cope with the lack of a better understanding of individual- and context-dependent human behavior and with a better integration of multiple sensors and pertinent modalities according to the model of human sensory system. Besides these problems there are other social and ethical issues that should be considered. The context-sensitive multimodal system that is supposed to interact with the human should not invade the user's privacy. Computer technology and especially affect-sensitive monitoring tools might be regarded as "big brother" tools. As remarked by Schneiderman 63 , a large proportion of the population would be terrified by the vision of the universal use of computers in the coming era of ubiquitous computing. Another important aspect is related to teaching the HCI systems our interaction patterns and related behavior and our social and cultural profile. It is obvious that is inefficient and annoying for the human user to train separately all the HCI systems that will be all around us in the future. One way of dealing with this problem is the incorporation of unlabeled data as we pointed out in this chapter. Moreover, the system itself should be able to monitor human nonverbal behavior and to adapt to the current user, to his context,
406 and t o t h e current scenario and environment. By taking all of these aspects into account, we hope t o be able t o develop into the near future multimodal context-sensitive systems t h a t are smart, perceptually aware, recognize t h e context in which they act, can a d a p t t o their users, and can understand how they feel, and respond appropriately. In some sense, these systems will be t h e friendly variants of t h e Replicants from t h e Blade Runner.
References 1. J.N. Bassili. Emotion recognition: The role of facial movement and the relative importance of upper and lower areas of the face. Journal of Personality and Social Psychology, 37(ll):2049-2058, 1979. 2. N. Bianchi-Berthouze and C. Lisetti. Modeling multimodal expression of user's affective subjective experience. User Modeling and User-Adapted Interaction, 12:49-84, 2002. 3. M.J. Black and Y. Yacoob. Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In Proc. International Conf. on Computer Vision, pages 374-381, 1995. 4. J.T. Cacioppo and L.G. Tassinary. Inferring psychological significance from physiological signals. American Psychologist, 45:16-28, 1990. 5. W.B. Cannon. The James-Lange theory of emotion: A critical examination and an alternative theory. American Journal of Psychology, 39:106-124, 1927. 6. L.S. Chen. Joint processing of audio-visual information for the recognition of emotional expressions in human-computer interaction. PhD thesis, University of Illinois at Urbana-Champaign, Dept. of Electrical Engineering, 2000. 7. L.S. Chen and T.S. Huang. Emotional expressions in audiovisual human computer interaction. In Proc. International Conference on Multimedia and Expo (ICME), pages 423-426, 2000. 8. L.S. Chen, H. Tao, T.S. Huang, T. Miyasato, and R. Nakatsu. Emotion recognition from audiovisual information. In Proc. IEEE Workshop on Multimedia Signal Processing, pages 83-88, 1998. 9. C.C. Chiu, Y.L. Chang, and Y.J. Lai. The analysis and recognition of human vocal emotions. In Proc. International Computer Symposium, pages 83-88, 1994. 10. I. Cohen, N. Sebe, F. Cozman, M. Cirelo, and T.S. Huang. Learning bayesian network classifiers for facial expression recognition using both labeled and unlabeled data. In Proc. Conf. on Computer Vision and Pattern Recognition, volume 1, pages 595-601, 2003. 11. I. Cohen, N. Sebe, F. Cozman, M. Cirelo, and T.S. Huang. Semi-supervised learning of classifiers: Theory, algorithms, and applications to human-computer interaction. IEEE Trans, on Pattern Analysis and Machine Intelligence, to appear, 2004. 12. I. Cohen, N. Sebe, A. Garg, L. Chen, and T.S. Huang. Facial expression recognition from video sequences: Temporal and static modeling. Computer Vision and Image Understanding, 91(1-2):160-187, 2003. 13. R. Cowie and E. Douglas-Cowie. Automatic statistical analysis of the signal and prosodic signs of emotion in speech. In Proc. International Conf. on Spoken Language Processing, pages 1989-1992, 1996. 14. C. Darwin. The Expression of the Emotions in Man and Animals. John Murray, London, 2nd edition, 1890. 15. L.C. De Silva and P.C Ng. Bimodal emotion recognition. In Proc. Automatic Face and
407 Gesture Recognition, pages 332-335, 2000. 16. F. Dellaert, T. Polzin, and A. Waibel. Recognizing emotion in speech. In Proc. International Conf. on Spoken Language Processing, pages 1970-1973, 1996. 17. P. Ekman, editor. Emotion in the Human Face. Cambridge University Press, New York, NY, 2nd edition, 1982. 18. P. Ekman. Strong evidence for universals in facial expressions: A reply to Russell's mistaken critique. Psychological Bulletin, 115(2):268-287, 1994. 19. P. Ekman and W.V. Friesen. Facial Action Coding System: Investigator's Guide. Consulting Psychologists Press, 1978. 20. LA. Essa and A.P. Pentland. Coding, analysis, interpretation, and recognition of facial expressions. IEEE Trans, on Pattern Analysis and Machine Intelligence, 19(7) :757763, 1997. 21. B. Fasel and J. Luettin. Automatic facial expression analysis: A survey. Pattern Recognition, 36:259-275, 2003. 22. A. Fridlund and C. Izard. Electromyographic studies of facial expressions of emotions and patterns of emotion. In J. Cacioppo and R. Petty, editors, Social Psychophysiology: A Sourcebook, pages 243-286, 1983. 23. A. Garg, M. Naphade, and T.S. Huang. Modeling video using input/output markov models with application to multi-modal event detection. In B. Furht, O. Marques, and B. Furht, editors, Handbook of Video Databases: Design and Applications, 2003. 24. A. Garg, V. Pavlovic, and J. Rehg. Boosted learning in dynamic Bayesian networks for multimodal speaker detection. Proceedings of the IEEE, 91 (9): 1355-1369, 2003. 25. D. Goleman. Emotional Intelligence. Bantam Books, 1995. 26. E. Hilgard, R.C. Atkinson, and R.L. Hilgard. Introduction to Psychology. Harcourt Brace Jovanovich, New York, NY, 5th edition, 1971. 27. C.E. Izard. Innate and universal facial expressions: Evidence from developmental and cross-cultural research. Psychological Bulletin, 115(2):288-299, 1994. 28. W. James. The Principles of Psychology. Henry Holt, New York, NY, 1890. 29. J.M. Jenkins, K. Oatley, and N.L. Stein, editors. Human Emotions: A Reader. Blackwell Publishers, Maiden, MA, 1998. 30. T. Johnstone. Emotional speech elicited using computer games. In Proc. International Conf. on Spoken Language Processing, pages 1985-1988, 1996. 31. P. Lang. The emotion probe: Studies of motivation and attention. American Psychologist, 50(5):372-385, 1995. 32. A. Lanitis, C.J. Taylor, and T.F. Cootes. A unified approach to coding and interpreting face images. In Proc. International Conf. on Computer Vision, pages 368-373, 1995. 33. J. Lien. Automatic recognition of facial expressions using hidden Markov models and estimation of expression intensity. PhD thesis, Carnegie Mellon University, 1998. 34. A Martinez. Face image retrieval using HMMs. In IEEE Workshop on Content-based Access of Images and Video Libraries, pages 35-39, 1999. 35. K. Mase. Recognition of facial expression from optical flow. IEICE Trans., E74( 10) :34 74-3483, 1991. 36. D. Matsumoto. Cultural influences on judgments of facial expressions of emotion. In Proc. ATR Symposium on Face and Object Recognition, pages 13-15, 1998. 37. Y. Medan, E. Yair, and D. Chazan. Super resolution pitch determination of speech signals. IEEE Trans, on Signal Processing, 39:40-48, 1991. 38. A. Mehrabian. Communication without words. Psychology Today, 2(4):53-56, 1968. 39. S. Morishima. Emotion model: A criterion for recognition, synthesis and compression of face and emotion. In Proc. Automatic Face and Gesture Recognition, pages 284-289, 1995.
408 40. I.R. Murray and J.L. Arnott. Toward the simulation of emotion in synthetic speech: A review of the literature of human vocal emotion. Journal of the Acoustic Society of America, 93(2): 1097-1108, 1993. 41. I.R. Murray and J.L. Arnott. Synthesizing emotions in speech: Is it time to get excited? In Proc. International Conf. on Spoken Language Processing, pages 1816-1819, 1996. 42. A. Nefian and M. Hayes. Face recognition using an embedded HMM. In IEEE Conf. on Audio and Video-based Biometric Person Authentication, pages 19-24, 1999. 43. N. Oliver, A. Pentland, and F. Berard. LAFTER: A real-time face and lips tracker with facial expression recognition. Pattern Recognition, 33:1369-1382, 2000. 44. T. Otsuka and J. Ohya. Recognizing multiple persons' facial expressions using HMM based on automatic extraction of significant frames from image sequences. In Proc. International Conf. on Image Processing, pages 546-549, 1997. 45. T. Otsuka and J. Ohya. A study of transformation of facial expressions based on expression recognition from temporal image sequences. Technical report, Institute of Electronic, Information, and Communications Engineers (IEICE), 1997. 46. H. Pan, Z.P. Liang, T.J. Anastasio, and T.S. Huang. Exploiting the dependencies in information fusion. In Proc. Conf. on Computer Vision and Pattern Recognition, volume 2, pages 407-412, 1999. 47. M. Pantic and L.J.M. Rothkrantz. Automatic analysis of facial expressions: The state of the art. IEEE Trans, on Pattern Analysis and Machine Intelligence, 22(12):14241445, 2000. 48. M. Pantic and L.J.M. Rothkrantz. Toward an affect-sensitive multimodal humancomputer interaction. Proceedings of the IEEE, 91 (9).-1370-1390, 2003. 49. C. Pelachaud, N. Badler, and M. Steedman. Generating facial expression for speech. Cognitive Science, 20:1-46, 1996. 50. A. Pentland. Looking at people. Communications of the ACM, 43(3):35-44, 2000. 51. V.A. Petrushin. How well can people and computers recognize emotions in speech? In Proc. AAAI Fall Symposium, pages 141-145, 1998. 52. R W. Picard. Affective Computing. MIT Press, Cambridge, MA, 1997. 53. R.W. Picard, E. Vyzas, and J. Healey. Toward machine emotional intelligence: Analysis of affective physiological state. IEEE Trans, on Pattern Analysis and Machine Intelligence, 23(10):1175-1191, 2001. 54. G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):13061326, 2003. 55. B. Reeves and C. Nass. The Media Equation: How People Treat Computers, Television and New Media Like Real People and Places. Cambridge Univ. Press, 1996. 56. M. Rosenblum, Y. Yacoob, and L.S. Davis. Human expression recognition from motion using a radial basis function network architecture. IEEE Trans, on Neural Network, 7(5):1121-1138, 1996. 57. Y. Sagisaka, N. Campbell, and N. Higuchi, editors. Computing Prosody. SpringerVerlag, New York, NY, 1997. 58. P. Salovey and J.D. Mayer. Emotional intelligence. Imagination, Cognition, and Personality, 9(3):185-211, 1990. 59. J. Sato and S. Morishima. Emotion modeling in speech production using emotion space. In Proc. IEEE Int. Workshop on Robot and Human Communication, pages 472-477, 1996. 60. S. Schachter. The interaction of cognitive and physiological determinants of emotional state. In L. Berkowitz, editor, Advances in Experimental Psychology, volume 1, pages 49-80, 1964.
409 61. K.R. Scherer. Adding the affective dimension: A new look in speech analysis and synthesis. In Proc. International Conf. on Spoken Language Processing, pages 18081811, 1996. 62. H. Schlosberg. Three dimensions of emotion. Psychological Review, 61:81-88, 1954. 63. B. Schneiderman. Human values and the future of technology: A declaration of responsibility. In B. Schneiderman, editor, Sparks of Innovation in Human-computer Interaction, 1993. 64. N. Sebe, M.S. Lew, I. Cohen, Y. Sun, T. Gevers, and T.S. Huang. Authentic facial expression analysis. In Automatic Face and Gesture Recognition, pages 517-522, 2004. 65. N. Ueki, S. Morishima, H. Yamada, and H. Harashima. Expression analysis/synthesis system based on emotion space constructed by multilayered neural network. Systems and Computers in Japan, 25(13):95-103, 1994. 66. C.E. Williams and K.N. Stevens. Emotions and speech: Some acoustical correlates. Journal of the Acoustic Society of America, 52(4):1238-1250, 1972. 67. W. Winton, L. Putman, and R. Krauss. Facial and autonomic manifestations of the dimensional structure of the emotion. Journal of Experimental Social Psychology, 20:195-216, 1984. 68. Y. Yacoob and L.S. Davis. Recognizing human facial expressions from long image sequences using optical flow. IEEE Trans, on Pattern Analysis and Machine Intelligence, 18(6):636-642, 1996.
C H A P T E R 4.2 G A I T - B A S E D H U M A N IDENTIFICATION FROM A M O N O C U L A R VIDEO S E Q U E N C E
Amit Kale Center for Visualization and Virtual Environments 1, Quality St Suite 800-B KY 40507 USA E-mail: [email protected]
Aravind Sundaresan' Amit K. RoyChowdhury Department of Electrical Engineering University of California, Riverside CA 92521 USA E-mail: [email protected] Rama Chellappa f Center for Automation Research University of Maryland at College Park MD 2074S USA E-mail:[email protected]. edu Human gait is a spatio-temporal phenomenon that characterizes the motion characteristics of an individual. It is possible to detect and measure gait even in lowresolution video. In this chapter, we discuss algorithms for identifying people by their gait from a monocular video sequence. Human identification using gait, similar to text-based speaker identification, involves different individuals performing the same task and a template-matching approach is suitable for such problems. In situations where the amount of training data is limited, we demonstrate the utility of a simple width feature for gait recognition. By virtue of their deterministic nature, template matching methods have limited noise resilience. In order to deal with noise we introduce a systematic approach to gait recognition by building representations for the structural and dynamic components of gait using exemplars and hidden Markov models (HMMs). The above methods assume that an exact side-view of the subject is available in the probe sequence. For the case when the person walks at an arbitrary angle far away from the camera we present a view invariant gait recognition algorithm which is based on synthesizing a side view of a person from an arbitrary monocular view. 411
412 1. Introduction Automated person identification is an important component of surveillance. An effective approach to person identification is to reduce it to the problem of identifying physical characteristics of the person. This method of identification of persons based on his/her physiological/behavioral characteristics is called biometrics. Established biometric methods range from fingerprint and hand-geometry techniques to more sophisticated methods based on face recognition and iris identification. Unfortunately, no single biometric is perfect or complete. Fingerprints and hand-geometry are reliable but require physical contact. Although, signatures based on face and iris are non-intrusive in nature, the applicability of all these methodologies is restricted to very controlled environments. In fact, current technology is capable of recognizing mostly frontal faces. At the time of writing, iris recognition is being attempted at distances of not more than five meters. When person identification is attempted in natural settings such as those arising in surveillance applications, it takes on a new dimension. Biometrics such as fingerprint or iris are no longer applicable. Furthermore, night vision capability (an important component in surveillance) is usually not possible with these biometrics. Even though an IR camera would reveal the presence of people, the facial features are far from discernible in an IR image at large distances. A biometric that can address some of these shortcomings is human 'gait' or the walking style of an individual. The attractiveness of gait as a biometric arises from the fact that it is non-intrusive and can be detected and measured even in low resolution video. Furthermore, it is harder to disguise than static appearance features such as a face and it does not require a cooperating subject. Early research on gait primarily involved psychophysical studies of gait viz. studying the ability of human observers to recognize gait. The belief that humans can distinguish between gait patterns of different individuals is widely held. Intuitively, it is possible to think of the qualities of walk such as stride length or body swing that help a perceiver identify an approaching figure even before the face becomes discernible. The earliest and most recognized psychophysical study of human perception of gait was the work of Johansson : . Small light bulbs were attached to the body joints of a darkly dressed walker. In this way only gait related cues were available and thus the perception of pure biological motion could be examined. When these point-light displays were static, the random collection of dots were variously interpreted as star constellations. However, as soon as the figures moved, the points of light were immediately perceived to be a human in motion. Motivated by Johanssons work, Kozlowski and Cutting 2 investigated whether observers could identify the gender of a point-light walker. The demonstration that gender could be extracted from gait provided insight into how observers might discriminate between gait patterns of different individuals. The prospect for observers being able to identify individuals from their gaits was thus encouraging. Cutting and Kozlowski 3 demonstrated that perceivers could reliably recognize themselves and their friends
413 from dynamic point-light displays. Barclay et al. 4 suggested that individual walking styles might be captured by differences in a basic series of pendular limb motions. Psychophysical evidence that there exists identity information in gait spurred development of computer vision based algorithms for gait-based human recognition. We attempt to give a summary of some of examples below, but the listing is by no means complete. Most of the methods for gait recognition are appearance based. Appearance based methods work reasonably well in the face of inaccurate background segmentation, changes in speed etc. However, such methods cannot tolerate drastic changes in clothing. Cunado et al. 5 extract a gait signature by fitting the movement of the thighs to an articulated pendulum-like motion model. Huang et al. 6 use optical flow to derive a motion image sequence for a walk cycle followed by eigenanalysis of the binarized silhouette to derive what are called eigen gaits. Benabdelkader et al. 7 use image self-similarity plots as a gait feature. Tolliver and Collins 8 use a spectral partitioning framework for identifying humans by their shape. Lee and Grimson 9 propose an approach in which several ellipses are fitted to different parts of the binarized silhouette of the person and the parameters of these ellipses such as location of its centroid, eccentricity etc. are used as a feature to represent the gait of a person. Hayfron-Acquah et al 10 proposed a method based on analyzing the symmetry of human motion using the Generalised Symmetry Operator. Han and Bhanu n proposed a gait-energy image approach for recognition. 2. Feature Selection An important issue in gait is the extraction of appropriate salient features that will effectively capture the gait characteristics. The features must be reasonably robust to operating conditions and should yield good discriminability across individuals. As mentioned earlier, we assume that the side view of each individual is available. Intuitively, the silhouette appears to be a good feature to look at as it captures the motion of most of the body parts. It also supports night vision capability as it can be derived from IR imagery also. While extracting this feature we can either use the entire silhouette or use only the outer contour of the silhouette. The choice of using either of the above features depends upon the quality of the binarized silhouettes. If the silhouettes are of good quality, the outer contour retains all the information of the silhouette and allows a representation, the dimension of which is an order of magnitude lower than that of the binarized silhouette. However for low quality, low resolution data, the extraction of the outer contour from the binarized silhouette may not be reliable. In such situations, direct use of the binarized silhouette may be more appropriate. We choose the width of the outer contour of the silhouette as one of our feature vectors. In order to generate the width vectors background subtraction 12 is first applied to the image sequence and the resulting motion image is binarized into foreground and background pixels. A bounding box is then placed around the part of the motion image that contains the moving person. Given the binarized silhouettes,
414 Width Pro«9 for Para
(a)
WfdthPrsfitetorPe
(I,)
Fig. 1. Width vector profile for several gait cycles of two individuals
the left and right boundaries of the body are traced. The width along a given row is simply the difference in the locations of the right-most and the left-most boundary pixels in that row. It is easy to see that the norm of the width vector show a periodic variation. In Figure 1 we show plots of the width profiles of two different individuals for several gait cycles. Since we use only the distance between the left and right extremities of the silhouette, the two halves of the gait cycle are almost indistinguishable. Prom hereon, we refer to half cycles as cycles, for the sake of brevity. In Figure 1, the x-axis denotes the frame index while the y-axis denotes the index of the width vector (the row index). The ith horizontal line in the image shows the variations in the ith element of the width vector as a function of time. A brighter gray-scale indicates a higher value of the width. We observe that within each cycle, there is a systematic temporal progression of width vector magnitude, viz. the dynamics. A similar observation has been made in 13 where the gait patterns are analyzed as FVieze patterns. For the two width profile plots shown in Figure 1 , the differences are quite visible. For instance, by observing the bright patterns in the upper region of the two images we see that the brightness is more pronounced in the first image as compared to the second. This area of the plot corresponds to the swings of the hand. Secondly, note that the brightness gradient (which translates to velocity in the video sequence) in the lower part of the images is more pronounced for Person 1 as compared to Person 2. This part of the plot corresponds to the swings of the extremities of the foot. Additionally, note that the height, as well as the visibility of the neck part of the two persons are different. It must be pointed out, however, that the differences need not be so pronounced for all individuals. Thus, the width profile contains structural and dynamic information peculiar to each individual. Also, by definition, the width vector is translation-invariant. Hence, the width of the outer contour of the silhouette is indeed a potentially good candidate as a feature.
415
3. G a i t - B a s e d H u m a n Identification Using A p p e a r a n c e M a t c h i n g A gait cycle corresponds to one complete cycle from rest (standing) position to-rightfoot-forward-to-rest-to-leift-foot-forward-to-rest position. The movements within a cycle consist of the motion of the different parts of the body such as head, hands, legs etc. The characteristics of an individual are reflected not only in the dynamics and periodicity of a gait cycle but also in the size and shape of that individual. Our aim is to build a model for representation and recognition of individual gait. In a pattern classification problem, choice of the feature as well as the classifier is important. As discussed earlier, if the gait data is clean, the width of the outer contour of the silhouette of the person can be a good feature for gait recognition. Prom the temporal width plots, we note that although the width vector changes with time within a gait cycle, there is a high degree of correlation among the width vectors across frames. Most changes occur in the hand and in the leg regions. Hence, it is reasonable to expect that gait information in the width vector can be derived with much fewer coefficients. Given the width vectors {W(l), • • • , W(N)},foi the N frames W(.) £ RM, we compute the eigen vectors { F ( l , ) • • • , V(M)} corresponding to the eigen values of the scatter matrix arranged in the descending order and reconstruct the corresponding width vectors using m(< M) most significant eigen vectors as m
Wr(i) = C£wjV(j)) + W 3=1
where wj =< W(i), V(j) > and W = wM+"+wW, Figure 2 shows the width vectors reconstructed using two eigenvectors. Temporally-ordered sequences of eigens-
>?&*&?
\
(a)
(b)
Fig. 2. Effect of eigen decomposition and reconstruction on the width vectors, (a) Overlapped raw width vectors (b) Smoothed width vectors. Notice that the leg region (the bottom half of the figures) contain a significant portion of the dynamics
moothed width vectors are used for compactly representing the person's gait.
416 For a normal walk, gait sequences are repetitive and exhibit nearly periodic behavior. The gait problem is analogous to text-based speaker identification/verification wherein different individuals utter the same text but differ only in the characteristics of their utterance 14 . A template matching approach is suitable for such problems especially if the amount of training data is limited. Typically, gait cycles when taken at different times tend to be unequal in length due to changes in walking speeds of the individuals. To deal with this issue, dynamic time-warping (DTW) is employed for matching gait sequences. The DTW method was originally developed for isolated word recognition 15 , and later adapted for text-dependent speaker verification 14 . DTW uses an optimum time expansion/compression function for producing non-linear time normalization so that it is able to deal with misalignments and unequal sequence lengths of the probe and the reference gait sequences. A distance metric (usually the Euclidean distance) defined as a function of time is computed between the two feature sets representing the gait data. A decision function is arrived at by integrating the metric over time. Assuming that the first frame of the reference and probe sequence are both indexed as 1 and the last frames of the reference and probe sequences be indexed as X and Y, respectively the match between the two sets can be represented by a sequence of K points C(l),C(2),....,C(k),...,C(K), where C{k) = (x(k),y(k)), and x{k) is a frame index of the probe sequence and y{k) is a frame index of the reference sequence. Here, C{k) represents a mapping of the time axis of probe sequence onto that of the reference sequence. The sequence F = C(l), C(2), ....C(k),...., C(K) is called the warping path. The process of time normalization involves computing the cumulative distance subject to endpoint, local continuity and global path constraints 16
Table 1. Cumulative match scores for the UMD database using different eigen-features. Feature\Rank Eigenvector 1 Eigenvectors 1,2 Eigenvectors 1,2,3 Eigenvectors 1,2,3,4
1 73 80 68 73
2 75 87 80 77
3 80 90 84 84
4 80 90 84 84
5 84 91 84 84
We experimented with different databases to test our method including the UMD, CMU and MIT datasets 1T. Table 1 shows the gait recognition result for the UMD dataset using different number of eigenvectors for reconstruction. Note that by using just the first two eigenvectors an accuracy of 80% is achievable. Other eigenvectors are noisy and, in fact, tend to lower the accuracy. We also considered the USF database a which has been identified as the gait challenge database 18 . The database has variations as regards viewing direction, shoe type, surface type. a
More details about this database can be found at http://figment.csee.usf.edu/GaitBaseline/
417
Also the subjects were asked to carry a briefcase for one testing condition. The USF database has the largest number of individuals among all the databases. It has variations with respect to floor surface (grass (G) or concrete(C)), shoe type (A or B), and camera viewing direction (left (L) or right (R)). The reference for all the experiments was chosen to be (G, A, R). The number of frames corresponding to four half cycles varied from 65 to 90. Different probe sequences for the experiments along with the cumulative match scores are given in Table 2 for the baseline algorithm 19 as well as our method using the eigensmoothed width feature. Note that recognition performance suffers most due to difference in surface characteristics, and least due to difference in viewing angle. An examination of the USF database revealed that the silhouettes provided were noisier compared to the previous datasets. We wanted to see what the performance would be by using the binary silhouettes directly as the feature. In this case we used the binary correlation distance in the local distance computation. As can be seen from the last two columns of Table 2 usage of the binarized silhouettes yields better performance numbers compared to the width vector in this case. Table 2. Probe Sets and match scores for the USF database using the baseline algorithm and our approach using width feature and entire binary silhouette. Experiment (Probe) A(G,A,L) B (G, B, R) C (G,B,L) D (C, A, R) E (G, B, R) F(C,A,L) G (C, B, L)
Baseline Rank 1 Rank 5 96 79 66 81 56 76 61 29 55 24 30 46 10 33
Width Vector Rank 1 Rank 5 91 79 79 67 55 30 42 17 39 15 30 16 31 9
Binary Silhouette Rank 1 Rank 5 97 84 91 83 79 59 41 64 53 24 51 27 38 24
4. A Framework for gait-based person identification using continuous H M M s In the previous section we saw the application of a simple template matching approach to gait recognition. A careful analysis of gait would reveal that it has two important components. The first is a structural component that captures the physical build of a person e.g. body dimensions, length of limbs etc. The second component is the motion kinematics of the body during a gait cycle. A recent paper by Veeraraghavan et al 20 evaluates the contribution from these factors in vision-based gait recognition. In this section we propose a systematic approach to gait recognition by building representations for the structural and kinematic components of gait. A closer examination of the physical process behind the generation of gait signature reveals that, during a gait cycle, it is possible to identify certain distinct phases or stances. In Figure 3, we show five frames that we have picked from a gait cycle for two individuals.
418
Hill Hill (a) (b) Fig. 3. Stances corresponding to the gait cycle of two individuals
In the first stance, the person holds the two feet together. In the second, he is just about to start and his hand is slightly raised. In the third stance, the hands and the feet are separated, while in the fourth, the hands and feet are displaced to a maximum. Finally, in the fifth stance, the person is returning to the rest state. Clearly, every person transits among these successive stances as he/she walks. Although, these stances are generic, there exist differences not only in their image appearance based on the physical build of an individual but also in the way an individual transits across these stances as he/she walks which represents the gait kinematics of the individual. A reasonable way to build a structural representation for a person is to pick N exemplars (or stances) 5 = {ei, • • • ,ejv} from the pool of images that will minimize the error in representation of all the images of that person. Given the image sequence for an unknown person y = {y(l), • • • ,y(T)}, these exemplars can be directly used for recognition as T
ID = arg mm,- ^ m i n n G { l j . . . i i V } d ( y ( £ ) , e £ ) , t=i
where y(t) represents the image of an unknown person at the ith time instant, while e£ represents the nth exemplar of the j t h individual. Note, however, that such a simple discrimination criterion is susceptible to failures not only due to noise but more importantly due to the presence of structural similarities among people in the database. To improve discriminability, the kinematics of the data must be exploited. A closer look at the gait cycle reveals that there is a temporal progression in the proximity of the observed silhouette to the different exemplars. Note that at the start of the gait cycle, a frame is closer to the first exemplar as compared to the other four. As time progresses, the frame will be closer to the second exemplar as compared to the others and so on. Underlying the proximity of the silhouette to the exemplars is a probabilistic dependence across the exemplars. This encompasses information about how long a gait cycle persists in a particular exemplar as well as the way in which the gait cycle transits from one exemplar to the other. For two people who are similar in physical build, this kinematic knowledge can be used to improve the recognition performance. Because the transitions are systematic, it is possible to model this probabilistic dependence by a Markov matrix as shown below. A=[P(ei(t)\ej(t-l))}
(1)
419 for i, j € {1, • • • , N}. The matrix A encodes the kinematics in terms of state duration densities and transition probabilities. Often, in a practical situation, only a finite amount of training data is available and modeling can be difficult if the feature dimensionality is high. The dimension of the feature vectors described in the previous section is at least 100. Directly using the feature vectors to estimate the structure of the person and the kinematics of gait is clearly not advisable. We propose two different approaches to model the structure and kinematics of gait viz. an indirect approach and a direct approach. The choice of nomenclature will become apparent in the following discussion. 4 . 1 . Approach
1: Indirect
Approach
In this approach we pick N exemplars (or stances) £ = {ei, • • • , ejy} from the pool of images that will minimize the error in representation of all the images of that person, different starting positions). In order to do this, we divide each gait cycle into N equal segments. We pool the image features corresponding to the ith. segment for all the cycles. The centroids (essentially the mean) of the features of each part were computed and denoted as the exemplar for that part. Doing this for all the N segments gives the optimal exemplar set £ = {e|, • • • , e ^ } . The next issue is how to choose N. In problems like image compression, it is a common practice to look at the rate-distortion curves to examine the marginal reduction in the distortion as the bits per pixel are increased. We found that increasing JV beyond 5 or 6 does not lead to a significant drop in distortion. In order to reliably estimate the gait kinematics we propose a novel way of compactly encoding the observations observations. Let x(t) denote the feature extracted from the image at time t. The distance of x(i) from the corresponding exemplars e„ G £ can be computed to build a frame-to-exemplar distance (FED) vector, f(t), which serves as a lower iV-dimensional representation of the image at time t. For instance, for the jth individual we compute the nth element of the FED vector as
[ff(t)]n=d(x>(t),e>n), J
(2)
where, t £ {I,--- ,T}, e n denotes the nth exemplar of the j t h person and n e {1, • • • ,N}. Thus, fj~3(t) constitutes an observation vector for person j . Similarly, f ** (t) represents the observation sequence of the person i encoded in terms of the exemplars of person j . Note that for a frame at the start of the gait cycle, the first element of the observation vector will be smaller in magnitude as compared to the remaining four elements. As time progresses, the first element will increase in magnitude because the frame moves closer to the second stance. This temporal variation in the FED vector components corresponds precisely to the transition across exemplars. In particular it is possible to look upon the FED vector sequence fj- (t) as the observed manifestation of the transition across exemplars (a hidden process). An HMM is appropriate for such a signal. HMMs 21 use a Markov process to model the changing statistical characteristics that are manifested in the actual
420
observations. For the gait problem, the exemplars can be considered as analogues to the states of the HMM while the FED vector sequence can be considered as the observed process. Since the feature vectors are transformed to the intermediate FED vector representation, we refer to this approach as an indirect approach. In the proposed model for gait, the primary HMM parameters of interest are the number of states, the initial probability vector (ir), the transition probability matrix (A) and the output probability distribution B which we model as a continuous probability distribution. A = (A, B, it) will be used to compactly represent the HMM. In order to recognize an unknown person the FED vector sequence ff(t) is computed for all j G {l,--- ,M} for him/her using (2). The likelihood that the observation sequence fj was generated by the HMM corresponding to the j t h person can be deciphered by using the forward algorithm 21 as P^logiPifflXj))
(3)
We repeat the above procedure for every person in the database thereby producing Pj, j £ {I,--- ,M}. If the unknown person was m, Pm would be expected to be the largest among all Pj's since the distance between y and the stances of person in will be smaller than that between y and any other person. Also the pattern of transitions between stances/states for 3^ will be closest to that for person m. 4.2. Approach
2: Direct
Approach
In this approach we use the feature vector in its entirety to estimate the HMM A = (A, B, it) for each person. Hence we refer to this approach as the direct approach. One of the important issues in training is learning the observation probability B. As discussed before, the reliability of the estimated B depends on the number of training samples available and the dimension of the feature vector. In order to deal with the high dimensionality of the feature vector, we propose an alternative representation for B. As discussed in the previous section it is possible, during a gait cycle, to identify certain distinct phases or stances. We build a structural representation for a person by picking N exemplars (or stances) from the training sequence, X = {X(l), • • • , X(T)}. We now define B in terms of the distance of this vector from the exemplars as follows. bn(X(t))
= P(X(i)|e n ) = ^ e -^(X(*),o„)
(4)
The probability, P(X(t)|e n ) is defined as a function of D(X(t), e n ) , the distance of the feature vector X(t) from the nth exemplar, e„. The motivation behind using an exemplar-based model in the above manner is that the recognition can be based on the distance measure between the observed feature vector and the exemplars. During the training phase, a model is built for all the subjects in the gallery. Note that B is completely denned by £ if a and (3 are fixed beforehand. An initial estimate of £ and A is formed from X, and these estimates are refined iteratively using Expectation-Maximization 22 . An initial estimate of an ordered set
421 of exemplars from the sequence as described in Section 4.1. A corresponding initial estimate of the transition matrix, A^ (with A^ = AfJmodN+1 = 0.5, and all other A-1 = 0) is also obtained. The initial probabilities 7r„ are set to be equal to 1/N. The iterative refinement of the estimates is performed in two steps. In the first step, a Viterbi evaluation 21 of the sequence is performed using the current values for the exemplars and the transition matrix. We can thus cluster feature vectors according to the most likely state they originated from. Using the current values of the exemplars, £^ and the transition matrix, A^\ Viterbi decoding on the sequence X yields the most probable path Q = {
eii+1) = £ t e T « X(«)
(5)
Given £( t + 1 ) and A^l\ we can calculate A^+l"> using the Baum-Welch algorithm 21 . We set 7Tn — jf a t each iteration. Thus we can iteratively refine our estimates of the HMM parameters. It usually takes only a few iterations to obtain an acceptable estimate. Given the feature vector sequence of the unknown person, y, and the exemplars and HMM model parameters for the different people in the database, the likelihood that the observation sequence was produced by the j t h individual in the database is computed using the forward algorithm as Pj = log(P(^|A,)).
(6)
Note that Xj implicitly includes the exemplar set corresponding to person j . The difference between the direct and indirect methods is that in the former the feat ure vector is directly used as the observation vector for the HMM whereas in the latter, the FED is used as the observation vector. We present the results of both our methods and a comparative analysis on the USF dataset. Different probe sequences for the experiments along with the cumulative match scores are given in Table 3 for the baseline algorithm 19 , our direct and indirect approaches. The image quality for the USF database is worse than the previous two databases in terms of resolution and amount of noise. We experimented with both the width feature as well as the binarized silhouette for the USF dataset. However, the extraction of the outer contour in this case is not reliable and the width vectors were found to be noisy. In Table 3, we report only the results of our methods using the silhouettes as the image feature. From Table 3 we observe that the direct method is more robust to the presence of noise than the indirect method. We also note that the recognition performance suffers most due to differences in surface and background characteristics, and least due to difference in viewing angle. Results from other research groups using this data can be found in 8 and websites (http://degas.umiacs.umd.edu/links.html). From
422
Tables 2 and 3, we note that the HMM approach indeed surpasses the performance using the appearance matching method. Recently, the gallery in the USF database was extended by adding subjects who walked with only one shoe type on grass, which happened to be labelled as Shoe B. Since the shoe type labeling is arbitrary, they were put in the gallery to increase the gallery size to 122. The results for this case are reported in 23 . Table 3. Probe Sets and match scores for the USF database using the baseline algorithm and our indirect and direct approaches. Experiment (Probe) A(G,A,L) B (G, B, C(G,B,L) D (C, A, E (C, B, F(C,A,L) G (C, B,
R) R) R) L)
Baseline Rank 1 Rank 5 79 96 66 81 56 76 29 61 24 55 30 46 10 33
Indirect Approach Rank 5 Rank 1 91 100 81 76 65 76 61 25 29 39 46 24 15 33
Direct Approach Rank 5 Rank 1 100 99 89 90 90 78 35 65 29 65 60 18 50 24
5. View Invariant Gait Recognition The gait of a person is best reflected when he/she presents a side view (referred to in this chapter as a canonical view) to the camera. Hence, most of the above gait recognition algorithms rely on the availability of the side views of the subject. The situation is analogous to face recognition where it is desirable to have frontal views of the person's face. In realistic scenarios, however, gait recognition algorithms need to work in a situation where the person walks at an arbitrary angle to the camera. For a person walking along a non-canonical direction, appearance based features which are used for recognition get distorted. To explain this better we consider the width feature discussed earlier. Temporal plots of the width-vector for the same person walking in the canonical and non-canonical(# = 45) direction are shown in Figures 5 (a) and (b) respectively. A simple gait feature, viz. the stride length or the maximal separation of the feet, can be derived from the width plots by measuring the highest intensity in the leg regions (lower halves of the width plot). Clearly the apparent stride-length is smaller for the non-canonical view. The second effect that is obvious from the plots is a foreshortening effect as the person walks away from the camera. In order to obtain good gait recognition performance, it is necessary to correct for both of these effects through view synthesis. The most general solution to this problem is to estimate the 3-D model for the person. Features extracted from the 3-D model can then be used to provide the gait model for the person. This problem requires the solution of the structure from motion (SfM) or stereo reconstruction problems 24>25, which are known to be hard for articulating objects. In the absense of methods for recovering accurate 3-D models, a simple way to
423
exploit existing appearance based methods is to synthesize the canonical views of a walking person. In 26 , Shakhnarovich et al. compute an image based visual hull from a set of monocular views which is then used to render virtual canonical views for tracking and recognition. Gait recognition is achieved by matching a set of image features based on moments extracted from the silhouettes of the synthesized probe video to the gallery. An alternative to synthesizing canonical views is the work of Bobick and Johnson 27 . In this work, two sets of activity-specific static and stride parameters are extracted for different individuals. The expected confusion for each set is computed to guide the choice of parameters under different imaging conditions (viz. indoor vs outdoor, side-view vs angular-view etc). A cross-view mapping function is used to account for changes in viewing direction. The set of stride parameters (which is smaller than the set of static parameters) is found to exhibit greater resilience to viewing direction. Representation using such a small set of parameters may not give good recognition rates on large databases. In this section we present a view-invariant gait recognition algorithm for the single camera case. Consider a person walking along a straight line which subtends an angle 9 with the image plane (AC in Figure 4). If the distance, ZQ, of the person from the camera is much larger than the width, AZ, of the person, then it is reasonable to replace the scaling factor z ?AZ for perspective projection by an average scaling factor -^-. In other words, for human identification at a distance, we can approximate the actual 3-D human as a planar object. Assume that we are given a video of a person walking at a fixed angle 9 with a translational velocity V = [vx, 0, vz]T(Figure 4). We show that by tracking the direction of motion, a, in the video sequence, we can estimate the 3-D angle 6. Using the planarity assumption, knowing angle 9 and the calibration parameters, we can synthesize side-views of the sequence of images of an unknown walking person without explicitly computing the 3D model of the person. PROJECTION PLANE
Zl»f Fig. 4. Imaging Geometry
424
Tracking Assuming that we can find the location (xref,yref) of the persons head at the start of such a segment, we use a sequential Monte Carlo particle filter 28 to track the head of the person to get {(xl (t), yl (t)), wl (t)} where the superscript denotes the index of the particle and ro' (t) denotes the probability weight for the estimate ( ( * ' ( * ) ,
* ' ( * ) ) •
Estimation of 3-D Azimuth Angle Assume that the motion between two consecutive frames in the video sequence is small. Using the optical flow based SfM equations 29 for constant velocity models vz(t) = vz(^ 0) and vx(t) = » x ( ^ 0), cot(9{t)) = ^- and given the initial position of the tracked point (xref,yref) it can be shown that X
cot,gs
ref
=
~ yrefCOt(a(xref,
yref))
Knowing (xo, yo),cot(a) and 9, f can be computed as part of a calibration procedure. View Synthesis Having obtained the angle 6, we synthesize the canonical view. Let Z denote the distance of the object from the image plane. If the dimensions of the object are small compared to Z, then the variation in 9, d9 « 0. This essentially corresponds to assuming a planar approximation to the object. Let \Xe,Yg,Ze\' denote the coordinates of any point on the person (as shown in the Figure 4) who is walking at an angle 6 > 0 to the plane passing through the starting point [XrefYrefZref}' and parallel to the image plane which we shall refer to, hereafter, as the canonical plane. Computing the 3-D coordinates of the synthesized point involve a rotation about the line passing through the starting point followed by a perspective transformation we can obtain the equations for [xo,yo]' as _
xgcos(9) + xref(l - cos(9)) -sin(9)(xg + Xref) + f
2/o = /
r-7^-7
9
-
r—-7,
(8)
-sm(9)(xg + xref) + J where x = f— and y = f— (8) is attractive since it does not involve the 3D depth; rather it is a direct transformation of the 2D image plane coordinates in the non-canonical view to get the image plane coordinates in the canonical one. Thus using the estimated azimuth angle 9 we can obtain a synthetic canonical view using (8). View synthesis provides for a correction of both the foreshortening and distortion of stride length (see Figures 5 (c) and (d)) and improves the gait recognition performance appreciably. 5.1. Gait-based
Recognition
We present gait recognition results on two databases. U M D 3 database: This consists of 12 people, who walk along straight lines at different values of azimuth angle 9 = 0,15,30 and 45. The image sequences
425
Fig. 5. Width profile as a function of time for (a) Canonical View (0 = 0); (b)Unnormalized sequence for 0 — 45; synthesized views for: (c) 6 = 30 (d) 6 = 45
corresponding to 0 = 0 were used as the gallery while the other sequences were used as a probe. The width profile plot for the canonical view and the view synthesized from 0 = 45 are shown in Figure 5. As can be seen from this plot, our method has compensated for both the foreshortening effect as well as restored the true legswing. Two consecutive cycles in the canonical view are chosen as the gallery to be compared with two consecutive gait cycles in the probe sequence. The DTW technique is used to match a given probe sequence to the different gallery sequences using binary correlation as a local distance measure and a similarity matrix S = s(i,j) is obtained, where s(i,j) refers to the similarity between the probe i and the gallery j . Gait recognition performance for 0 = 30 and 45° is shown in Figures 6(a) and (b) using the synthesized and raw images in terms of a cumulative match characteristic. As noted before, the algorithm results in a broader reproduction of the torso region. The situation can be remedied by assigning a lower weight to the torso region when computing the binary correlation or simply ignoring it. We take the latter approach by computing the binary correlation only over the lower half of the boxed image. The result using only the leg region is shown as the dashed lines in the Figure 6. It can be seen that the gait recognition result is better than what is obtained by using the entire body. Interestingly, 19 notes that the lower 20 % of the silhouette accounts for roughly 90% of the recognition. To boost the gait recognition performance further, certain structural characteristics of the individual that are extracted subsequent to view synthesis e.g. height can be fused with the leg dynamics. The height of the probe sequence is estimated robustly from the synthesized video as h{i) = medianfcJ'(i),j = 1 • • • M M, being the length of the probe sequence. We fuse height information together with the leg dynamics by scaling each entry s(i,j) of the similarity matrix by the corresponding height
426
ratio, max(£!4,2 - j | | ) . The results for this case are shown as the solid line with circles in Figure 6. The fact that the gait recognition results are encouraging upto angles of 45 degrees allows us to hypothesize that it is possible to do reasonable human identification using gait with only two cameras (installed perpendicular to each other).
30
40 With Leg and height Fusion With leg only With complete synthesized images Unnormalized Images
4
6
With Leg and height Fusion With leg only With complete synthesized images Unnotmalized Images 8
10
10
Rank — >
(a)
(b)
Fig. 6. Cumulative Match Characteristics for Original and Synthesized images for (a) 9 = 30 (b) 9 = 45 for UMD3 database N I S T database: This consists of 30 people walking along a S -shaped walking pattern as shown in Figure 7(a). There are two cameras looking at the top horizontal part of the sigma. The camera that is located further away is used in our experiments since the planar approximation we make is more valid in that case. The segment of the sigma next to the top horizontal part is used as a probe. This segment is at an angle 33° to the horizontal part. To do gait recognition we employed the fusion of the leg-dynamics with the height since it gave the best performance for Database 1. The gait recognition result is shown in Figure 7(b). As can be seen the recognition rate is about 60%. One of the reasons for the lower recognition performance in this case is that the image size is rather small. Note however that the recognition goes to 100% within 6 ranks.
6. Conclusions and Future Work In this chapter we investigated the information contained in the video sequences of human gait and how to extract and represent that information in ways that facilitate human identification. Human identification using gait, similar to text-based speaker identification, involves different individuals performing the same task and a template-matching approach is suitable for such problems. In situations where the amount of training data is limited, we showed the utility of a simple feature viz. the width of the outer contour of the binarized silhouette of the subject and its
427 Gallery 100
20 Performance using synthesized images Performance using unnormalized Images 10 \J
(a)
Camera
15 Rank
20 >
25
30
(b)
Fig. 7. (a)£ shaped walking pattern in the NIST database (b)Gait recognition performance on the NIST database
derivatives for gait recognition in a dynamic time warping framework. By virtue of their deterministic nature, template matching methods have limited noise resilience. To improve robustness a systematic approach to gait recognition by building representations for the structural and dynamic components of gait using exemplars and hidden Markov models (HMMs) was discussed. Gait can serve as a cue for recognizing people if the database is small. But for large databases, gait information, by itself, may not be sufficient to recognize an individual. In fact, we must realize that the gait recognition capability of even humans is limited. However, it can be used to narrow down the list of potential matches. In a recent paper 3 0 we demonstrated the use of gait as a filter to achieve faster human identification by limiting the number of candidates being passed to a more accurate face recognition algorithm. Gait can also be used in conjunction with other cues such as the color of clothing etc. for short time verification problems viz "was this the same person who walked in front of this camera t minutes ago?". Finally, a view invariant gait recognition algorithm which is based on synthesizing a side view of a person from an arbitrary monocular view was discussed. We also presented a view invariant gait recognition algorithm. The fact that the gait recognition results are encouraging upto angles of 45 degrees allows us to hypothesize that it is possible to do reasonable human identification using gait with only two cameras (installed perpendicular to each other). This could prove to be less restrictive than the visual hull approach that needs at least 4 cameras. For indoor multiple camera settings it would also be interesting to study the contribution of 3-D information for gait recognition using multi-camera kinematic models 3 1 , 3 2 .
428
References 1. G Johansson, "Visual motion perception," Scientific American, vol. 232, pp. 76-88, 1975. 2. L. Kozlowski and J. Cutting, "Recognizing the sex of a walker from a dynamic point display," Perception and Psychophysics, vol. 21, pp. 575-580, 1977. 3. J. Cutting and L. Kozlowski, "Recognizing friends by their walk:gait perception without familiarity cues," Bulletin of the Psychonomic Society, vol. 9, pp. 353-356, 1977. 4. C D . Barclay, J. E. Cutting, and L.T. Kozlowski, "Temporal and spatial factors in gait perception that influence gender recognition," Perception and Psychophysics, vol. 23, pp. 145-152, 1978. 5. D. Cunado, J.M. Nash, M.S. Nixon, and J. N. Carter, "Gait extraction and description by evidence-gathering," Proc. of the International Conference on Audio and Video Based Biometric Person Authentication, pp. 43-48, 1995. 6. P.S. Huang, C.J. Harris, and M.S. Nixon, "Recognizing humans by gait via parametric canonical space," Artificial Intelligence in Engineering, vol. 13, no. 4, pp. 359-366, October 1999. 7. R. Cutler C. Benabdelkader and L.S. Davis, "Motion based recognition of people in eigengait space," Proceedings of the IEEE Conference on Face and Gesture Recognition, pp. 267-272, 2002. 8. D. Tolliver and R. Collins, "Gait shape estimation for identification," Proceedings of AVBPA, pp. 734-742, 2003. 9. L. Lee and W.E.L. Grimson, "Gait analysis for recognition and classification," Proceedings of the IEEE Conference on Face and Gesture Recognition, pp. 155-161, 2002. 10. J. Hayfron Acquah, M.S. Nixon, and J.N. Carter, "Automatic gait recognition by symmetry analysis," Pattern Recognition Letters, pp. 2175-2183, 2003. 11. J. Han and B. Bhanu, "Statistical feature fusion for gait-based human recognition," Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2004. 12. A. Elgammal, D. Harwood, and L. Davis, "Non-parametric model for background subtraction," FRAME-RATE Workshop, IEEE, 1999. 13. R.T. Collins Y. Liu and Y. Tsin, "Gait sequence analysis using frieze patterns," Proceedings of the European Conference on Computer Vision, vol. 2, pp. 657—671, 2002. 14. Sadaoki Furui, "Cepstral analysis technique for automatic speaker verification," IEEE Trans. Acoust., Speech, Signal Processing, vol. 29, no. 2, pp. 254-272, April 1981. 15. H. Sakoe and S. Chiba, "Dynamic programming algorithm optimization for spoken word recognition," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-26 no. 1, pp. 43-49, 1978. 16. Jinu Mariam Zacharia, "Text-independent speaker verification using segmental, suprasegmental and source features," M.S. thesis, IIT Madras Chennai India, March 2002. 17. A. Kale, N. Cuntoor, B. Yegnanarayana, A.N. Rajagopalan, and R. Chellappa, "Gaitbased human identification using appearance matching," in Optical and Digital Techniques for Information Security, B. Javidi, Ed. Springer Verlag, December 2004. 18. P. J. Phillips, S. Sarkar, I. Robledo, P. Grother, and K. W. Bowyer, "The gait identification challenge problem: Data sets and baseline algorithm," Proc of the International Conference on Pattern Recognition, 2002. 19. P. J. Phillips, S. Sarkar, I. Robledo, P. Grother, and K. W. Bowyer, "Baseline results for the challenge problem of human id using gait analysis," Proc. of the 5th IEEE Int. Conf. on Automatic Face and Gesture Recognition, 2002. 20. A. Veeraraghavan, A. RoyChowdhury, and R. Chellappa, "Role of shape and kinemat-
429
21. 22.
23.
24. 25. 26.
27.
28. 29. 30. 31.
32.
ics in human movement analysis," Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2004. L.R. Rabiner, "A tutorial on hidden markov models and selected applications in speech recognition," Proceedings of the IEEE, vol. 77, no. 2, pp. 257-285, February 1989. A. P. Dempster, N. M. Laird, and D. B. Rubin, "Maximum likelihood from incomplete data via the em algorithm," Journal of the Royal Statistical Society, vol. 39, no. 1, pp. 1-38, 1977. A. Kale, A. Sundaresan, A. N Rajagopalan, N. Cuntoor, A. RoyChowdhury, V.Kruger, and R. Chellappa, "Identification of humans using gait," IEEE Transactions on Image Processing, July 2004. O.D. Faugeras, Three-Dimensional Computer Vision: A Geometric Viewpoint, MIT Press, 1993. R. I. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press, 2000. G.Shakhnarovich, L.Lee, and T.Darrell, "Integrated face and gait recognition from multiple views," Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, December 2001. A.F. Bobick and A. Johnson, "Gait recognition using static activity-specific parameters," Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, 2001. Michael Isard and Andrew Blake, "Contour tracking by stochastic propagation of conditional density," Proceedings of ECCV, , no. 1, pp. 343-356, 1996. Vishwjit Nalwa, A Guided Tour of Computer Vision, Addison-Wesley, 1993. A. Kale, A. RoyChowdhury, and R. Chellappa, "Fusion of gait and face for human identification," Proc. of the ICASSP, 2004. A. Sundaresan, A. RoyChowdhury, and R. Chellappa, "3d modelling of human motion using kinematic chains and multiple cameras for tracking," Proc. of Eighth International Symposium on the 3-D Analysis of Human Movement, Tampa, 2004. C. Bregler and J. Malik, "Tracking people with twists and exponential maps," Proc. of IEEE Conference on Computer Vision and Pattern Recognition, 1998.
CHAPTER 4.3 PALMPRINT AUTHENTICATION SYSTEM
David Zhang Biometrics Research Centre (UGC/CRC), Department of Computing Hong Kong Polytechnic University, Kowloon, Hong Kong E-mail: [email protected]. edu. hk
In this chapter, we present a novel biometric authentication system to identify a person's identity by his/her palmprint. In contrast to existing palmprint systems for criminal applications, the proposed system targets at the civil applications, which require identifying a person in a large database with high accuracy in real-time. The system is constituted by four major components: User Interface Module, Acquisition Module, Recognition Module and External Module. More than 7,000 palmprint images have been collected to test the performance of the system. The system can identify 400 palms with a low false acceptance rate, 0.02%, and a high genuine acceptance rate, 98.83%. For verification, the system can operate at a false acceptance rate, 0.017% and a false rejection rate, 0.86%. The execution time for the whole process including image collection, preprocessing, feature extraction and matching is less than 1 second..
1. Introduction Biometrics lies in the heart of today's society. There has been an ever-growing need to automatically authenticate individuals at various occasions in our modern and automated society, such as information confidentiality, homeland security, and computer security. Traditional knowledge based or token based personal identification or verification is so unreliable, inconvenient, and inefficient, that it is incapable to meet such a fast-pacing society. Knowledge-based approaches use "something that you know" to make a personal identification, such as password and personal identity number. Token-based approaches use "something that you have" to make a personal identification, such as passport or ID card. Since those approaches are not based on any inherent attributes of an individual to make the identification, it is unable to differentiate between an authorized person and an impostor who fraudulently acquires the "token" or "knowledge" of the authorized person. This is why biometric identification or verification systems have received more attention in recent years. 431
432
I Enrollment Path User Interface Module
w^
Acquisition Module
Palmprint Capture
Preprocessing
41
Recognition Module
&
Feature Extraction
41
I Identification 1/
External Module
41
Template Creation
~~T>
Matching
rftV> Result
T Master Temphi Database
[
Fig. 1. The breakdown of each module of the palmprint authentication system.
Biometrics involves identifying an individual based on his/her physiological or behavioral characteristics. Many parts of our body and various behaviors are embedded with such information for personal identification. In fact, using biometrics for person authentication is not new, which has been implemented over thousands years, numerous research efforts have been put on this subject resulting in development of various techniques related to signal acquisition, feature extraction, matching and classification. Most importantly, various biometric systems including, fingerprint, iris, hand geometry, voice and face recognition systems have been deployed for various applications [1]. According to the International Biometric Group (IBG, New York), the market for biometric technologies will nearly double in size this year alone. Among all biometrics, hand-based biometrics, including hand geometry and fingerprint are the most popular biometrics gaining 60% market share in 2003 [2]. The proposed palmprint system is also a hand-based biometric technology. Palmprint is concerned with the inner surface of a hand and looks at line patterns and surface shape.
433
A palm is covered with the same kind of skin as the fingertips and it is larger than a fingertip in size. Therefore, it is quite natural to think of using palmprint to recognize a person, like fingerprint, hand geometry and hand vein [3-6]. Because of the rich features including texture, principal lines and wrinkles on palmprints. we believe that they contain enough stable and distinctive information for separating an individual from a large population. We also expect that palmprints are robust to the noise because of the large surface area. Several companies, including NEC and PRINTRAK. have developed palmprint systems for criminal applications [8-9]. On the basis of fingerprint technology, their systems exploit high resolution palmprint images to extract the detailed features like minutiae for matching the latent prints. Such approach is not suitable for developing a palmprint authentication system for civil applications, which requires a fast, accurate and reliable method for personal identification. Based on our previous research work [10], we develop a novel palmprint authentication system to fulfill such requirements. The rest of the chapter is organized as follows. The system framework is shown in Section 2. The recognition engine is described in Section 3. Experimental results of verification, identification, robustness, and computation time are provided in Section 4. Finally, conclusion is given in Section 5.
Fig. 2. The palmprinl authentication system installed al Biometric Research Center (UGC/CRCO. The Hong Kong Polytechnic University.
2. System Framework The proposed palmprint authentication system has four major components: User Interface Module, Acquisition Module, Recognition Module and External Module. Fig. 1 shows the breakdown of each module of the palmprint authentication system. Since March 2003, this palmprint authentication system has been installed at Biometric Research Center. Department of Computing. The Hong Kong Polytechnic University (see Fig. 2). The functions of each component are listed below:
434
A)
User Interface Module provides an interface between the system and users for the smooth authentication operation. We designed a flat platen surface for the palm acquisition. It is crucial to develop a good user interface such that users are happy to use the device. B) Acquisition Module is the channel for the palmprints to be acquired for the further processing. C) Recognition Module is the key part of our system, which will determine whether a user is authenticated. It consists of image pre-processing, feature extraction, template creation, database updating, and matching. Then it gives an identification/ verification result. D) External Module receives the signal coming from the recognition module, to allow some operations to be performed, or denied the operations requested. This module actually is an interfacing component, which may be connected to another hardware components or software components. Our system presents an external interface for physical door access control or an employee attendance system. Since the design philosophy and implementation of the user interface module and the acquisition module have been described in [10], and the external interface is an application dependent component, we are not intended in this chapter to discuss them further, and we will concentrate on the discussion about the recognition module in detail.
3. Recognition Engine After the palmprint images are captured by the Acquisition Module, they are fed into the recognition engine for palmprint authentication. The recognition engine is the key part of the palmprint authentication system, which consists of the stages of: image preprocessing, feature extraction, and matching. 3.1. Image Pre-processing When capturing a palmprint, the position, direction and stretching degree may vary from time to time. As a result, even the palmprints from the same palm could have a little rotation and translation. Also the sizes of palms are different from one another. So, the preprocessing algorithm is used to align different palmprints and extract the corresponding central part for feature extraction [7]. In our palmprint system, both the rotation and translation are constrained to some extent by the capture device panel, which can locate the palms by several pegs. Then the preprocessing algorithm can locate the coordination system of the palmprints quickly by the following five steps: Step 1: Use a threshold to convert the original grayscale image into a binary image, then using a low-pass filter to smooth the binary image. Step 2: Trace the boundary of the holes HI and H2 between those fingers (see Fig. 3). Step 3: Compute the tangent of the holes HI and H2. Tl and T2 are the tangent points of HI and 7/2, respectively. Step 4: Aign 77 and T2 to determine the Y-axis of the palmprint coordination system and making a line passing through the midpoint of the two points (77 and T2), which is perpendicular to this Y-axis to determine the origin of the system.
435
Step 5: Extract a sub-image of a fixed size based on the coordinate system (see Fig. 3). The sub-image is located at a certain area of the palmprint image for feature extraction.
Fig. 3. The major steps on palmprint preprocessing: HI and H2 are boundary of the holes between the two fingers, where Tl and T2 are the tangent point of/// and H2. respectively. The central bo.x is the sub-image of a palm.
3.2. Feature Extraction The feature extraction technique implemented on the proposed palmprint system is modified from [10], where single circular zero DC Gabor filter is applied to the preprocessed palmprint images and the phase information is coded as feature vector called PalmCode. The modified technique exploited four circular zero DC Gabor filters with the following general formula: GD =
1
T-exp 2no~
(* ~x0)
, (>• - v,,)-
(7
a
>|exp(/Ixcax )-exp(-2;r~rtr
(1)
where, x = .\cost9+vsint? and y =-xsinf? + y c o s # : (x(h yu) is the center of the function in the spatial domain of the function; co is the frequency of the sinusoidal plane wave along the orientation, 9: a is the standard deviations of the circular Gaussian function; 9 is the direction of the filter. The four Gabor filters share the same parameters, a and co, only different in 9. The corresponding values of 9 are 0, TI/4,rc/2and 37t/4. In the previous approach, only the phase information is exploited but the magnitude information is totally neglected. The proposed method is to use the magnitude to be a fusion condition to combine different PalmCodes generated by the four Gabor filters. Mathematically, the implementation has the following steps. 1) The four Gabor filters are applied to the preprocessed palmprint image. / described as G*I. where G, (/—1, 2. 3. 4) is the circular zero DC Gabor filter and "*" represents an operator of convolution. 2) The square of the magnitudes of the sample point is obtained by M,{x,y) = Gj(.x,y)*IxG,(x,y)* I, where "—" represents complex conjugate. 3)
According to the fusion rule, k = argmax(M/(.Y,>')), the phase information at point / (x. y) is coded as the followings
436
hr=l
if
Re[G A */]>0,
(2)
hr=0
if
Re[Gt*/]<0,
(3)
h, =1 //
Im[G,*/]>0,
(4)
h, = 0 //
Im[GA. * /] < 0 .
(5)
This coding method is named as Fusion Code, which is represented by a set of bits. Our experiments show that the Fusion Code is more stable and efficient for palmprint authentication. 3.3. Feature Matching The feature matching determines the degree of similarity between two templates - the identification template and the master template. In this paper, the normalized hamming distance is implemented for comparing two Fusion Codes. The normalized hamming distance is represented by: ,
N
X Xp"(:'j)
n Q (; j) n ( P ( ; j ]
" -
( « - ® QK (i> j ) + p i ( i - j ) ® Q'(/- J)))
where PR (QR), PI (QI) and PM(QM) are the real part, imaginary part and mask of the Fusion Code P(Q), respectively; <8> and n are Boolean operators, XOR and AND, respectively [10]. The ranges of normalized hamming distances are between zero and one. Zero represents perfect matching. Because of the imperfect preprocessing, one of the Fusion Code is vertically and horizontal translated to match the other again. The ranges of the vertical and the horizontal translations are defined from -2 to 2. The minimum D0 value obtained from the translated matching is considered to be the final matching score.
4. Performance Evaluation 4.1. Testing Database We collected palmprint images from 200 individuals using our palmprint capture device described in [10]. The subjects are mainly students and staff volunteers from The Hong Kong Polytechnic University. In this dataset, 134 people are male, and the age distribution of the subjects is: about 86% are younger than 30, about 3% are older than 50, and about 11% are aged between 30 and 50. In addition, we collected the palmprint images on two separate occasions. On each occasion, the subject was asked to provide about 10 images each of the left palm and the right palm. Therefore, each person provided around 40 images, resulting in a total number of 8,025 images from 400 different palms in our database. In addition, we changed the light source and adjusted the focus of the CCD camera so that the images collected on the first and second occasions could be regarded as being captured by two different palmprint devices. The average time interval between the first and second occasions was 70 days. The size of all the testing
437 images used in the following experiments is 384x284 with 75dpi, in 256 gray levels.
I
1
1
S a 4 CD
S. 3
0 1
0.2
0.3 0.4 Hamming distance
0.5
(a)
7
f1
6
•-
J-5 0} ay
2 4
-
-
£= OJ
i
u
°- 3 2
/ i
1
0
0.1
0 2
0.3 0.4 Hamming distance
1,
0.5
0.6
(b) Fig. 4. Verification test results, (a) and (b) show the Genuine and imposter distributions for verification tests with one and three registered images per palm, respectively.
4.2. Experimental
Results of Verification
Verification refers to the problem of confirming or denying a claim of individuals. It is also considered as one to one matching. To obtain the verification result, two group experiments are carried out separately. In the first experiment, only one palmprint image of each palm is used for registration, while 3 palmprint images of each palm are used for registration in the second experiment. In the first experiment, each palmprint image is
438 matched with all other palmprint images in the database. A correct matching occurs if two palmprint images are from the same palm, incorrect matching otherwise. The total number of matching is 32,119,735. Number of genuine matching is 76,565 and the rest of them are incorrect matchings. Fig. 4(a) shows the probability of genuine and imposter distributions estimated by the correct and incorrect matchings. Some thresholds and corresponding false acceptance rates (FARs) and false rejection rates (FRRs) are listed in Table 1. According to Table 1, using one palmprint image for registration, the proposed system can be operated at a low false acceptance rate 0.096% and a reasonably low false rejection rate 1.05%. Table 1. False acceptance rates (FARs) and false rejection rates (FRRs) with different threshold values for the palmprints verification results.
Threshold 0.32 0.34 0.36 0.38 0.40
Registered image= 1 FAR(%) FRR(%) 0.000027 8.15 0.00094 4.02 0.011 1.94 0.096 1.05 0.68 0.59
Registered images=3 FAR(%) FRR (%> 0.000012 5.12 0.0016 2.18 0.017 0.86 0.15 0.43 1.03 0.19
Table 2. FARs and FRRs with different threshold values for the l-to-400 palmprints identification results.
Threshold 0.320 0.325 0.330 0.335 0.340 0.345
Trial=1
\R(%) FRR(%) 3.69 0.0049 2.93 0.0439 2.29 0.15 0.37 1.90 0.84 1.51 1.16 1.45
Trial=2
FAR (%) 0.0098 0.088 0.28 0.68 1.43 2.32
FRR(%) 1.80 1.34 1.02 0.72 0.57 0.42
Trial=3
FAR (%) 0.020 0.131 0.42 0.96 1.93 3.02
FRR (%) 1.17 1.06 0.68 0.48 0.37 0.26
In the second experiment, the testing database is divided into two databases, 1) registration database and 2) testing database. Three palmprint images of each palm collected in the first occasion are selected for the registration database. Totally, the registration database contains 1,200 palmprint images and the rest of them are for the testing database. In this verification test, each palmprint image is matched with all the palmprint images in the testing database. Therefore, each testing image produces three hamming distances for one registered palm. We take the minimum of them as the final hamming distance. For achieving statistically reliable results, this test is repeated for three times by selecting other palmprint images for the registration database. Total number of hamming distances from correct matchings and incorrect matchings are 20,475 and 8,169,525, respectively. Fig. 4(b) shows the probability of genuine and imposter distributions estimated by the correct and incorrect matchings, respectively. Some threshold values along with its corresponding false acceptance and false rejection rates are also listed in Table 1. According to Table 1 and Fig. 4, we can conclude that using three templates can provide better verification accuracy. In fact, using more palmprint
439
images of the same palm during registration can provide more information to the system so that it can recognize the noise or deformed features. It is also the reason for commercial verification systems requiring more than one sample for registration. 100 99.5 qq
#< <1>
cr 9B.5 m
>CU) r
98
m
() n
Wb
n ID
o
y/ 9G.5-
10'3
- 10"2
1 aginst 400 identification based on 1 trial 1 aginst 400 identification based on 2 trial 1 aginst 400 identification based on 3 trial
10'' 10° False Acceptance Rate (%)
10'
102
Fig. 5. The ROC curves for a 1-against-400 identification testing with different number of trials.
4.3. Experimental Results of Identification Identification test is a one-against-many, N comparison process. In this experiment, N is set to 400, which is the total number of different palms in our database. Same as the previous verification experiment, the palmprint database is divided into two databases, 1) registration and 2) testing databases. The registration database contains 1,200 palmprint images, three images per palm. The testing database has 6,825 palmprint images. Each palmprint image in the testing database is matched to all of the palmprint images in the registration database. Therefore, each testing image generates 3 correct and 1,197 incorrect matchings. The minimum hamming distances of correct matchings and incorrect matchings are regarded as the identification hamming distances of genuine and impostor, respectively. This experiment is also called a one-trial test since the user only provides one palmprint image in the test to make one decision. In fact, a practical biometric system collects several biometric signals to make one decision. Therefore, in this experiment, we implement one, two, and three-trial tests. In the two-trial test, a pair of palmprint images in the testing database belongs to the same palm is matched to all of the palmprint images in the registration database. Each pair of the palmprint images in the two-trial test generates 6 correct and 2,394 incorrect matchings. The minimum hamming distances of correct matchings and incorrect matchings are considered as the identification hamming distances of genuine and imposter, respectively. Similarly, in the three-trial test, the identification hamming distances of genuine and imposter are obtained from 9 correct and 3,591 incorrect matchings, respectively. Each test is repeated three
440 times by selecting other palmprints from the registration database. In each test, the number of identification hamming distances of genuine and imposter matchings both are 20,475. Fig. 5 shows ROC curves of the three tests and Table 2 lists the threshold values along with its corresponding FARs and FRRs of the tests. According to Fig. 5 and Table 2, more input palmprints can provide more accurate results.
(e)
(D
Fig. 6. Three palmprint images with and without rings on the fingers and their corresponding pre-processed sub-images.
4.4. Computation Time Another key issue for a civilian personal identification system is whether the system can run in real time. In other words, the system should be running as fast as possible. The proposed method is implemented using C language and Assemble language on a PC
441
embedded Intel Pentium IV processor (1.4GHz) with 128M memory. The execution time for image collection, image preprocessing, feature extraction and matching are listed in Table 3. The total execution time for a l-against-400 identification, each palm with 3 templates, is less than 1 second. Users will not feel any delay when using our system. Table 3. Execution time of the paimprint authentication system.
Operations Image collection Preprocessing Feature extraction Matching
Execution Time 340ms 250ms 180ms 1.3jus
Fig. 7. Paimprint images with different texts for testing the robustness of the system.
442 Table 4. The hamming distances of Fig. 7
Figs 6(a) 6(b) 6(c) 6(d) 6(e)
6(b) 0.19
'(c)
6(c) 0.21 0.18
6(d) 0.27 0.27 0.27
6(e) 0.29 0.26 0.28 0.23
6(f) 0.28 0.27 0.28 0.19 0.19
(d)
Fig. 8. Identical twins" palmprints. (a), (b) are their left hands, and (c), (d) are their right hands, respectively.
4.5. Robustness As a practical biometric system, other than accuracy and speed, robustness of the system is another important issue. Here, we present three experiments to illustrate the robustness of our system. The first one is an individual's jewelry such as rings, which may influence the accuracy of some preprocessing algorithms. The second one is the noise on the palmprints, which directly affect the performance of the system. The third is the ability of identifying the identity of identical twins' palmprints. Fig. 6 shows three palmprint images with and without rings on the fingers and their
443
corresponding preprocessed sub-images. It shows that the preprocessing algorithm described in Section 3 is not affected by the rings. However, some preprocessing algorithms such as [16] using such information may not be stable under the influence of the rings. To verify the robustness of noise palmprints, Fig. 7(a) provides a clear palmprint image and Figs. 7(b)-(f) show five palmprint images, with different texts. Their hamming distances are given in Table 3; all of them are smaller than 0.29. Comparing the hamming distances of imposter in Tables 1 and Tables 2, it is ensured that all the hamming distances in Table 4 are relatively small. Fig. 7 and Table 4 illustrates that the proposed palmprint authentication system is very robust to the noise on the palmprint. A test of identical twins is regarded as an important test for biometric authentication, but not all biometrics, including face and DNA, can pass this test. However, the palmprints of identical twins have enough distinctive information to distinguish them. We collected 590 palmprint images from 30 pairs of identical twins' palms. Each of them provides around 10 images of their left palms and 10 images of their right palms. Their age range is between 6 and 45 years old. Some samples of identical twins' palmprints are shown in Fig. 8. Based on this database, we match a palmprint in the twin database with his/her identical twin sibling to produce imposter matching scores, and match the samples of their own to get the genuine scores. The genuine and imposter distributions are given in Fig. 9. From the figure, we can find that identical twins' palmprint can easily be separated, just like twins' fingerprints [11].
Genuine imposter
£4
V. j \ D
0.1
0.2
0.3 Matching scores
0.4
0.5
Fig. 9. The genuine and imposter distributions for measuring the similarity of identical twins' palmprints.
5. Conclusions In this chapter, we have presented a novel biometric system based on the palmprint. The proposed system can accurately identify a person in real time, which is suitable for
444 various civil applications such as access control. Experimental results show that the proposed system can identify 400 palms with a low false acceptance rate, 0.02%, and a high genuine acceptance rate, 98.83%. For verification, the system can operate at a false acceptance rate, 0.017% and a false rejection rate, 0.86%. The experimental results including accuracy, speed and robustness demonstrate that the palmprint authentication system is comparable with other hand-based biometrics systems, such as hand geometry and fingerprint verification system [12-14] and is practical for real-world applications. The system has been installed at the Biometric Research Center, Department of Computing, The Hong Kong Polytechnic University since March 2003 for access control. Acknowledgements This research is partially supported by the UGC/CRC fund from the Hong Kong SAR Government and the central fund from The Hong Kong Polytechnic University. Thank my team members, Guangming Lu, Adams Kong, and Michael Wong for their hard work and unstinting support. References 1. 2. 3.
4.
5. 6.
7. 8. 9. 10.
11. 12. 13. 14.
A. Jain. R. Bolle and S. Pankanti (eds.). Biometrics: Personal Identification in Networked Society. Boston. Mass: Kluvver Academic Publishers. 1999. International Biometric Group's Biometric Market Report 2003-2007: http://w\vw.biometricgroup.com/reports/public/market report.html. R. Sanchez-Reillo. C. Sanchez-Avilla and A. Gonzalez-Marcos. ""Biometric identification through hand geometry measurements". IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22. no. 10. pp. 1168-1171. 2000. S.K. Im. H.M. Park. Y.W. Kim, S.C. Han. S.W. Kim and C.H. Kang. "An biometric identification system by extracting hand vein patterns". Journal of the Korean Physical Society, vol. 38.' no. 3. pp. 268-272."2001. A. Jain. L. Hong and R. Bolle. "On-line fingerprint verification'". IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19. no. 4. pp. 302-314. 1997. A.K. Jain. A. Ross and S. Prabhakar. ""An introduction to biometric recognition". IEEE Transactions on Circuits and Systems for I 'ideo Technology, Special Issue on Image- and Video- Based Biometrics, vol. 14. no. 1. January 2004. G. Lu. D. Zhang and K. Wang. "Palmprint Recognition Using Eigenpalms Features'". Pattern Recognition Letters. 24. pp. 1473-1477. 2003. http://w\v\v.nectech.com/afis/'download./PalmprintDtsht.q.pdf - NEC automatic palmprint identification system. http://www.printrakintemational.eortvomnitrak.htrn - Prinkrak palmprint identification system. D. Zhang. W.K. Kong. J. You and M. Wong. ""On-line palmprint identification"'. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25. no. 9. pp. 1041-1050. 2003. A. K. Jain. S. Prabhakar and S. Pankanti. "On the similarity of identical twin fingerprints." Pattern Recognition, vol. 35. no. 11. pp 2653-2662. 2002. A.K. Jain. S. Prabhakar. L. Hong and S. Pankanti. "Filterbank-based fingerprint matching". IEEE Transactions on Image Processing, vol. 9. no. 5. pp. 846-859. 2000. A.K. Jain. L. Hong and R. Bolle. "On-line fingerprint verification". IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19. no. 4. pp. 302-314. 1997. R. Sanchez-Reillo. C. Sanchez-Avilla and A. Gonzalez-Marcos. "Biometric identification through hand geometry measurements". IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22. no. 18. pp.1168-1171. 2000.
C H A P T E R 4.4 R E C O N S T R U C T I O N OF HIGH-RESOLUTION FACIAL IMAGES FOR VISUAL SURVEILLANCE
Jeong-Seon Park and Seong-Whan Lee Korea University Anam-dong, Seongbuk-ku, Seoul 136-701, Korea E-mail: {jspark, swlee} ©image.korea.ae.hr
In this chapter we provide a summary of our previous works concerning the reconstruction of high-resolution facial images for visual surveillance. Specifically we present our methods of reconstructing high-resolution facial image from a low-resolution facial image based on example-based learning and iterative error back-projection. In our method, a face is represented by a linear combination of prototypes of shape and texture. With the shape and texture information about the pixels in a given low-resolution facial image, we can estimate optimal coefficients for a linear combination of prototypes of shape and those of texture by solving least square minimization. Then high-resolution facial image can be obtained by using the optimal coefficients for linear combination of the high-resolution prototypes. Moreover iterative error back-projection is applied to improve the result of high-resolution reconstruction. The encouraging results of our methods show that our high-resolution reconstruction methods can be used to improve the performance of the face recognition by enhancing the resolution of low-resolution facial images captured in visual surveillance systems.
1. I n t r o d u c t i o n There is a growing interest in visual surveillance systems for security areas such as international airports, borders, sports grounds, and safety areas. And various researches on face recognition have been carried out for a long time. B u t there still exist a number of difficult problems such as estimating facial pose, facial expression variations, resolving object occlusion, changes of lighting conditions, and in particular, the low-resolution images captured in visual surveillance systems(see Fig. 1). Handling low-resolution images is one of t h e most difficult and commonly occurring problems in various image processing applications such as analysis of scientific, medical, astronomical, and weather images, archiving, retrieval and transmission of those images as well as video surveillance or monitoring. 1 Numerous methods have been reported in t h e area of estimating or reconstructing high-resolution im445
446
Fig. 1.
Example of low-resolution facial image captured in visual surveillance systems
ages from a series of low-resolution images or single low-resolution image. Superresolution is a typical example of techniques reconstructing a high-resolution image from a series of low-resolution images, 2 - 4 whereas interpolation produces a large image from only one low-resolution image. We are concerned with building a high-resolution facial image from a lowresolution facial image for visual surveillance. Our task is distinguished from previous works that built high-resolution images mainly from scientific images or image sequence of video data. Our reconstruction method is example-based, object-classspecific or top-down approach. It is highly tolerant to sensor noise, 5 incompleteness of input images and occlusion by other objects. The example-based approach to interpreting images of deformable objects are now attracting considerable interest among many researchers. 6,7 The motivation for example-based learning lies in its potential of deriving highlevel knowledge from a set of prototypical examples. This chapter presents our reconstruction methods of high-resolution facial image from a low-resolution facial image based on the iterative error back-projection of example-based learning. The extended morphable face model 11 is used in example-based learning, and a mathematical procedure for solving least square minimization (LSM) is applied to the model. More over, iterative error back-projection procedure is applied to improve the result of high-resolution reconstruction. This paper is organized as follows. Section 2 explains the overview of our reconstruction method based on extended morphable face model. In Sec. 3, we explain the procedure of high-resolution shape estimation using example-based learning, the problem of high-resolution reconstruction and solution to solve the least square minimization of reconstruction error. Then, in the next section, we describe our iterative error back-projection procedure which is composed of estimation, simulation and error compensation. In Sec. 5, experimental results with low-resolution facial
447
images are provided along with an analysis of these results. Finally, conclusions and future works will be discussed in Sec. 6. 2. Overview of High-resolution Reconstruction In this section, we present an overview of our reconstruction methods using examplebased learning based on extended morphable face model. Suppose that sufficiently large amount of facial images are available for off-line training, then we can represent any input face by a linear combination of a number of facial prototypes. 9 Moreover, if we have a pair of low-resolution facial image and its corresponding high-resolution image for the same person, we can obtain an approximation to the deformation required for the given low-resolution facial image by using the coefficients of examples. Then we can obtain a high-resolution facial image by applying the estimated coefficients to the corresponding high-resolution example faces(see Fig. 2). Consequently, our goal is to find an optimal parameter set a which best estimates the high-resolution facial image from a given low-resolution facial image.
no Prototype 1
+
no
Prototype 1 Fig. 2.
Prototype 2
Prototype M
Basic idea of the high-resolution reconstruction based on example-based learning
2.1. Extended
morphable
face
model
Our reconstruction method is based on the morphable face model introduced by Poggio et al.s and developed further by Vetter et a/..9,10 Assuming that the pixelwise correspondence between facial images has already been established, 9 the 2D shape of a face is coded as the displacement fields from a reference face. The shape of a facial image is represented by a vector S = (df ,d\,• • • ,d%,,dyN)Te $t2N, where N is the number of pixels in facial image, (d%,d^) the x,y displacement of a pixel that corresponds to a pixel x^ in the reference face and can be denoted by S(xk). The
448
texture is coded as the intensity map of the image which results from mapping the input face onto the reference face. Thus, the shape normalized(or free) texture is represented as a vector T = (ii,- ,lN) e Jft , where i^ is the intensity or color of a pixel that corresponds to a pixel Xk among N pixels in the reference face and can be denoted by T(xfe). Next, we transform the orthogonal coordinate system by principal component analysis(PCA) into a system denned by eigenvectors sp and tp of the covariance matrices Cs and CT on the set of M faces. Where S and T represent the mean of shape and that of texture, respectively. Then, a facial image can be represented by the following equation. M-l
M-l a s
S = S + 22 P p<
T = TP=i
where a, ^ c R * * - 1 . In order to reconstruct a high-resolution facial image from a low-resolution one, we would like to define an extended morphable face model as an extended face is composed of a pair of low-resolution face and high-resolution one. Then, an extended face is separated by an extended shape and an extended texture according to the definition of morphable face model as shown in Fig. 3.
Reference (ft)
Extended Reference ( / l + )
Extended Input ( / + )
iiife
IICl F Lii5 LJ Fl P L m
•
•is i
Extended Shape ( S + ) (a) M o r p h a b l e face model
a
Fin Extended Texture ( T + )
( b ) E x t e n d e d m o r p h a b l e face model
Fig. 3. Examples of face representations based on extended morphable face model
T h e n we c a n define 5 + = (df, df, • • •, d% dyL, d*L+1, d\+l
•••, d*L+H,
dyL+H)T
to be an extended shape vector by simply concatenating low-resolution shape and
449
high-resolution shape, where L and H is the number of pixels (h, • • • i*Li*L+i»- • • AL+H)T Then, by applying PC A to in Eq. (1) can be expressed as
is the number of pixels in low-resolution facial image in high-resolution one. Similarly, let us define T+ = t° be an extended texture vector. both the shape S+ and texture T+, the face image follows.
M-l
5+ = S+ + J2
M-l
a *pSp
i
-*
— •*•
/? P tp +
(2)
P=i
where a,
0e9tM~1.
2.2. Reconstruction
procedure
Before explaining our example-based reconstruction procedure, we define two types of warping processes, forward and backward warping(see Fig. 4).
forward warpj
backward warping Fig. 4.
Examples of forward warping and backward warping
Forward warping warps a texture expressed in reference face onto each input face by using its shape information. This process results in an input facial image. Backward warping warps an input face onto the reference face by using its shape information. This process results in a texture information expressed in reference shape. The mathematical definition and more details about the forward and backward warping can be found in Ref. 9. Our reconstruction procedure consists of 4 steps, starting from a low-resolution facial image to a high-resolution facial image(see Fig. 5). Here the displacement of the pixels in an input low-resolution face which correspond to those in the reference face is known. Step 1. Step 2.
Obtain the texture of an input low-resolution facial image by backward warping. (a) Estimate a high-resolution shape from a given low-resolution shape.
450
Fig. 5.
Procedure for reconstructing a high resolution facial image from a low-resolution one
(b) Improve the estimated high-resolution shape by iterative error backprojection. Step 3. (a) Estimate a high-resolution texture from the obtained low-resolution texture at Step 1. (b) Improve the estimated high-resolution texture by iterative error backprojection. Step 4. Synthesize a high-resolution facial image by forward warping the improved texture with the improved shape. Step 1 and Step 4 are explained from the previous studies of morphable face models in many studies. 6,9 Step 2(a) and Step 3(b) are carried out by similar mathematical procedure except that the shape about a pixel is 2D vector and the texture is lD(or 3D for RGB color image) vector. Therefore, we will describe only the Step 2 of (a) estimating high-resolution shape information from low-resolution one using example-based learning and (b) improvement the results of initial high-resolution estimation by applying iterative error back-projection procedure.
3. High-resolution Estimation using Example-based Learning In this section, we present our reconstruction method using example-based learning, the problem of high-resolution reconstruction and solution to solve the least square minimization of reconstruction error.
3.1. Problem
definition
Since there is a shape information about only low-resolution facial image, we need an approximation to the deformation required for the low-resolution shape by using coefficients of the bases(see Fig. 2). The goal is to find an optimal set ap that
451 satisfies M-l
S+(xj) = X ] aP4(xi)>
J = 1, • • • , i ,
(3)
P=i
where iCy IS £1 pixel in the low-resolution facial image, L the number of pixels in low-resolution image, and M — 1 the number of bases. We assume that the number of observations, L, is larger than the number of unknowns, M — 1. Generally there may not exist a set of ap that perfectly fits the S+. So, the problem is to choose a so as to minimize the reconstruction error. For this, we define an error function, E(a), the sum of square of errors which measures the difference between the known displacements of pixels in the low-resolution input image and its represented ones. L
M-\
E(a) = £ ( 5 + 0 r , 0 - ] T aps+{Xj)f j=i
(4)
p=i
where x\, • • • , XL are pixels in the low-resolution facial image. Then the problem of reconstruction is formulated as finding a which minimizes the error function : a = arg min E{a).
3.2. Solution
by least square
(5)
minimization
The solution to Eqs. (4) - (5) is nothing more than least square solution. Eq. (3) is equivalent to the following equation. ^(xi).--s^_1(a;1)\/ai \
/5 + (xi)N (6)
\sf(xL)---sl[_1(xL)/\?M-J
\S+(xLy
This can be rewritten as : S + a = S+,
(7)
where j+
4(xi)
••• s i - i ^ i )
,s+(xL)
••• s ^ _ 1 ( x L ) ,
-
3 = ( « ! , - • • ,OIM-l) T ,
S+ = (S+(x1),-.-,S+(xL))T.
(8)
452
The least square solution to an inconsistent S+a = S + of L equation in M — 1 unknowns satisfies S + S+a* = S + S + . If the columns of S + are linearly independent, then ( S + S + ) is non-singular and has an inverse a* = ( S + T S + ) - 1 S + T S + .
(9)
The projection of S + onto the column space is therefore S+ = S + a*. By using Eq. (9), we can obtain a high-resolution shape M-l
S(xL+j)
= S+(xL+j)
+ J2 ^ps+{xL+j),
j = l,...,H,
(10)
P=I
where ££+i, • • • ,XL+H are pixels in the high-resolution facial image, L and H is the number of pixels in the low-resolution facial image and that of high-resolution facial image, respectively. By using Eq. (10), we can get the correspondence of high-resolution shape. Previously we made the assumption that the columns of S are linearly independent. Otherwise, Equation (9) may not be satisfied. If S has dependent columns, the solution a* will not be unique. The optimal solution in this case can be solved by pseudoinverse of S + . 7 But, for our purpose of effectively reconstructing a highresolution shape(or texture) from a low-resolution one, this is unlikely to happen. 4. Improvement by Iterative Error Back-Projection In this section, we explained the procedure of our iterative error back-projection method for improving the results of high-resolution estimation. In order to reconstruct a high-resolution facial image from only one lowresolution image, we used an example-based learning or top-down approach. Also to improve the accuracy of high-resolution estimation, we applied iterative error back-projection procedure to the example-based learning. Numerous kinds of error back-projection have been used to various applications such as super-resolution. 4 According to our example-based learning methods, we approximate a given lowresolution shape with some error, that is defined by Eq. 4 of estimated a*. So, we can easily guess that the estimated high-resolution shape also has some error if we know the original high-resolution shape, H M-\ E{a) = £ ( S + ( * L + J ) - Y, a;s+(xL+j))2 (11) j=i
p=i
where xi, • • • , XH are pixels in the high-resolution facial image. Our iterative error back-projection is applied to reduce the reconstruction error by iteratively compensate the high-resolution error which estimated by similar error reconstruction from simulated low-resolution error. The flowchart of the procedure we have designed for iterative error backprojection is outlined in Fig. 6. The notations showed in this figure are defined as follows.
453
Estimation
-j
H* = Reconstruction(Z/) Dp
Low - resolution : L
=0
.::::
1=1
I
L, = Downsampling(//,_1) Simulation
I
u
1_
£>/=Distance(L 7 , L?) t=r+l
High -resolution://,B
L* = lJ Error compensation
H,
±
= Reconstruction(Lf )
H^H^ Fig. 6.
-L< - t
-Lf
-Dt
-T2
-Lf
- uit
-Lst
I
+ Of-Hf
Flowchart of our iterative error back-projection method
Input low-resolution data (shape or texture) Iteration index, t = 1,2, • • • , T Reconstructed high-resolution data at iteration t Low-resolution data simulated by down-sampling reconstructed highresolution one at iteration t Reconstruction error by measuring Euclidean distance between an input low-resolution data and a simulated low-resolution one at iteration t Threshold value to determine whether the current reconstruction is accurate or not Threshold value to determine whether the current iteration is convergent or not Reconstruction error data by pixel-wise difference between an input lowresolution data and a simulated low-resolution one at iteration t Reconstructed high-resolution error data from a low-resolution error data at iteration t Weight value for error compensation at iteration t
First, as an initial stage, we estimate the high-resolution d&ta,(Hff) from an input low-resolution one(L') by using our solution of least square minimization
454
which described in previous Sec. 3.2. Second, as a simulation stage to verify the reconstruction accuracy of our example-based method, we simulate a low-resolution data(Lf) from the estimated high-resolution one by down-sampling it, then measure the distance(Df) between the input low-resolution and simulated one by distance function. We assume that if a reconstruction is successful, the reconstruction error (measured distance) will be very small. From this assumption, we determine whether the reconstruction is accurate or not by comparing current reconstruction error and one threshold value (Ti) and whether the iteration is convergent or not by comparing the changes of previous and current distances and another threshold v a l u e ^ ) . If one or both comparisons are satisfied, then the current result of reconstruction is considered as a final highresolution shape, otherwise next error back-projection stage is performed. Third, as an error back-projection stage, we create a low-resolution error shape vector between the input low-resolution shape and the simulated one by simple pixelwise difference operation, estimate a high-resolution shape error vector by similar procedure of example-based reconstruction from a low-resolution error vector, then compensate the previously estimated high-resolution shape by adding the currently estimated high-resolution shape error. In this stage, we can use a weight value(u>t) which is smaller than 1 in order to prevent divergence of iterations. The weight can be varied according to the current distance, that is, if the distance is large then the weight is also large. By using these iterative procedures, we tried to improve the result of highresolution estimation by iteratively compensate error vector. 5. Experimental Results and Analysis 5.1. Face
database
For testing the performance of our reconstruction methods, we used 200 facial images of Caucasian faces that were rendered from a database of 3D head models recorded by a laser scanner. 9,10 The original images were color image set of size 256 x 256 pixels. They were converted to 8-bit gray level and resized to 16 x 16 and 32 x 32 for low-resolution facial images by Bicubic interpolation. PCA was applied to a random subset of 100 facial images for constructing bases of the defined face model. The other 100 images were used for testing our algorithm. Next, we use a hierarchical, gradient-based optical flow algorithm to obtain a pixel-wise correspondence.9 The correspondence is defined between a reference facial image and every image in the database. It is estimated by the local translation between corresponding gray level intensity patches. 5.2. Reconstruction
results and
analysis
As mentioned before, 2D-shape and texture of facial images are treated separately. Therefore, a high-resolution facial image is reconstructed by synthesizing the es-
455
timated(or improved) high-resolution shape and the estimated(or improved) highresolution texture.
(a) Input
|\?
(b) Bilinear
(c) Bicubic
©©
0 in G
|9
Example-based learning
Interpolation
*
*
(d) Previous I (e) Proposed
#
#
#
#
(f) Original
'
«
#
Fig. 7. Examples of 256 x 256 high-resolution facial images reconstructed from 16 x 16 lowresolution facial images Figures 7 and 8 show the examples of the 256 x 256 high-resolution facial image reconstructed from 16 x 16 and 32 x 32 low-resolution images, respectively. In these figures, (a) shows the input low-resolution images, (b) to (e) the reconstructed high-resolution images using Bilinear interpolation, Bicubic interpolation, previous method 11 (using only example- based reconstruction) and proposed method (enhanced (d) by applying iterative error-back projection), respectively. And (f) shows the original high-resolution facial images. As shown in Fig. 7, classifying the input low-resolution faces are almost impossible, even with the use of Bilinear or Bicubic interpolations. On the other hand, reconstructed facial images by the example-based learning methods, especially the reconstructed images by proposed method are more similar to the original faces than others. Also, similar but better results were obtained from 32 x 32 low-resolution facial images(see Fig. 8). For quantitative evaluation of the performance of our reconstruction methods, we measured the mean reconstruction errors in shape, texture and facial image from the original high-resolution data, respectively.
456 Interpolation
(b) Bilinear
#05
#19
#71
#99
Example-based learning
uuu nnnnn uuuy n (a) Input
(c) Bicubic
(d) Previous
(o) Proposed
(f) Original
Fig. 8. Examples of 256 x 256 high-resolution facial images reconstructed from 32 X 32 lowresolution facial images
The horizontal axes of Figs. 9 and 10 represent the input low-resolution, two interpolation methods and our previous and proposed reconstruction methods. Vertical axes of them represent the mean displacement error per pixel about shape vectors and the mean intensity error per pixel(for an image using 256 gray level) about texture and image vector, respectively. ErrSx and ErrSy in Fig. 9 are the s-directional mean displacement errors along the a;— and y— axes for shape, respectively. And Err J" and Err A in Fig. 10 implies the mean intensity error for texture and for facial image, respectively. As shown in Figs. 9 and 10, the reconstruction errors of the proposed method are smaller than others. From the encouraging results of the proposed method as shown in Figs. 7-10, we can expect that our reconstruction method can be used to improve the performance of face recognition systems by reconstructing a high-resolution facial image from a low-resolution facial images captured in visual surveillance systems. In order to verify the effect of the resolution enhancement, we carried out simple face recognition experiments under following configurations. The original 256 x 256 facial images were registered, and the reconstructed high-resolution facial images from 16 x 16 facial images are used as recognition data. Figure 11 shows us the correct recognition rates of face recognition experiments.
457 Mean displacement 4-0
error' per pixel
S.5 S.O
\
2.5
s:
2.0
-a-
ErrSr
--&-
ErrS,,
1.5 1.0 0.5 0 (a) Input Fig. 9.
(b) Bilinear
(c) Bicubic (d) Previous (e) Proposed
Comparisons of mean reconstruction errors in shape
Mean intensity
error per
pixel
4.0 S.5 3.0 *v
2.5
\
^ ^
2.0
• ° \ — ~4
1.5
\
_o-
ErrJT
—*-
Err.I
v - • - _, _
—-—-~^. ~" "*
1.0 0.5 i
0 (a) Input Fig. 10.
i
(b) Bilinear
•
(c) Bicubic (d) Previous (e) Proposed
Comparisons of mean reconstruction errors in texture and image
As we can see, the recognition performance is improved by using the proposed reconstruction methods. 6. Conclusions In this chapter, we provided efficient methods of reconstructing high-resolution facial image based on an example-based learning and iterative error back-projection. Our reconstruction method consists of the following steps : computing linear coefficients minimizing the error or difference between the given shape or texture and the linear combination of the shape or texture prototypes in the low-resolution image, and applying the coefficient estimates to the shape and texture prototypes in the high-
458 Recognition
rates(%)
100
(a) Input Fig. 11.
(I)) Bilinear
(c) Bicubic (d) Previous (e) Proposed
Comparisons of recognition performance
resolution facial image, respectively. T h e experimental results are very natural a n d plausible like original highresolution facial images, when displacement among t h e pixels in an input face which correspond to those in the reference face, is known. It is a challenge for researchers t o obtain the correspondence between the reference face and a given facial image under low-resolution vision tasks. In order to apply our example-based reconstruction methods t o real-environment visual surveillance systems, we must develop an automatic algorithm t o overcome restrictions of a prior condition t h a t the displacement among pixels in an input face which correspond t o those in the reference face is known.
Acknowledgments We would like t o thank the Max-Planck-Institute for providing the M P I Face Database.
References 1. B. Tom and A.K. Katsaggelos. Resolution Enhancement of Monochrome and Color Video Using Motion Compensation. IEEE Transactions on Image Processing, Vol. 10, No. 2, pages 278-287, Feburary 2001. 2. S. Baker and T. Kanade. Limit on Super-Resolution and How to Break Them. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24 No. 9, pages 11671183, September 2002. 3. M. G. Kang and S. Chaudhuri. Super-Resolution Image Reconstruction. IEEE Signal Processing Magazine, Vol. 23 No. 3, pages 19-36, May 2003. 4. F. Dekeyser, P. Perez, and P. Bouthemy. Restoration of Noisy, Blurred, Undersampled Image Sequences Using Parametric Motion Model. In Proceedings of the International Symposium on Image/Video Communications over Fixed and Mobile Networks, ISIVC 2000, Rabat, Morocco, April 2000.
459 5. P. S. Windyga. Fast Impulsive Noise Removal. IEEE Transactions on Image Processing, Vol. 10, No. 1, pages 173-178, 2001. 6. M. J. Jones, P. Sinha, T. Vetter, and T. Poggio. Top-down Learning of Low-level Vision Tasks [Brief Communication]. Current Biology, Vol. 7, pages 991-994, 1997. 7. B.-W. Hwang, S.-W. Lee. Reconstruction of Partially Damaged Face Images Based on a Morphable Face Model. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, No. 3, pages 365-372, 2003. 8. D. Beymer, A. Shashua, and T. Poggio. Example-Based Image Analysis and Synthesis. AI Memo 1431/CBCL Paper 80, Massachusetts Institute of Technology, Cambridge, MA, November 1993. 9. T. Vetter and N. E. Troje. Separation of Texture and Shape in Images of Faces for Image Coding and Synthesis. Journal of the Optical Society of America A. Vol. 14, No. 9, pages 2152-2161, 1997. 10. V. Blanz, S. Romdhani, and T. Vetter. Face Identiflcation across Different Poses and Illuminations with a 3D Morphable Model. In Proceedings of the 5th Internatioal Conference on Automatic Face and Gesture Recognition, Washington, D.C., pages 202-207, 2002. 11. J.-S. Park and S.-W. Lee. Resolution Enhancement of Facial Image Based on Topdown Learning. In Proceedings of ACM SIGMM 2003 Workshop on Video Surveillance, Berkeley, CA, pages 59-64, November 2003.
CHAPTER
4.5
OBJECT RECOGNITION WITH DEFORMABLE FEATURE G R A P H S : FACES, H A N D S , A N D C L U T T E R E D SCENES
Jochen Triesch Department of Cognitive Science, UC San Diego 9500 Gilman Drive, La Jolla, California, 92093-0515, USA E-mail: [email protected]
Christian Eckes Fraunhofer IMK NetMedia Schloss Birlinghoven, D- 53754 Sankt Augustin, E-mail: [email protected]
Germany
A fundamental question in invariant object recognition is that of representation. This chapter reviews object representations based on deformable feature graphs that describe particular views of an object as a spatial constellation of image features. These representations are particularly useful in situations of high clutter and partial occlusions. We demonstrate the benefits of these representations in three recognition applications: face analysis, hand gesture recognition, and the interpretation of cluttered scenes composed of multiple partly occluded objects. We conclude by discussing current trends and open challenges. 1. I n t r o d u c t i o n 1.1. The Problem
of Invariant
Object
Recognition
Object recognition is perhaps the most fundamental problem in computer vision. Humans can recognize objects without effort despite background clutter, partial occlusions, or changes in object location, view point, lighting situation, etc. This capability is called invariant recognition because the recognition process has to be invariant to these and other environmental changes t h a t can dramatically change all the pixels of an image while the depicted object remains its identity (Fig. 1). Despite considerable effort it has not been possible to endow computer vision systems with levels of competence t h a t are anywhere close to h u m a n performance. W h a t makes the task so difficult? A p a r t of the answer is t h a t invariant recognition is a computationally very demanding task and today's computers are no match for the processing power of the h u m a n visual system. Another p a r t of the answer may be t h a t we have not found the right way of representing objects t h a t facilitates efficient solutions t o t h e invariant recognition problem. This chapter reviews approaches t h a t 461
462
Fig. 1. Left: invariant recognition. Humans can easily recognize a person or an object despite great variability in the appearance due to pose, lighting, occlusion, clutter, facial expression and other factors. Right: segmenting an object from the background can be quite difficult and may require prior knowledge of the object's appearance — in this case the dalmatian dog in R.C. James's famous picture.
represent particular views of an object as a deformable constellation of certain image features. 1.2. Recognition
and
Segmentation
A problem closely related to the invariant recognition problem is the segmentation problem. Segmentation refers to the process of grouping together all pixels in an image that belong to the same object. The way in which researchers view the relation between recognition and segmentation is a topic of substantial controversy. The traditional view is that segmentation should happen independent of and prior to object recognition. Image elements are grouped together solely based on low level properties of individual pixels or small image patches such as brightness, color, motion, stereo disparity, or texture. Once an object has been segmented recognition proceeds by extracting relevant features from the segmented object and subsequent classification. This approach may be called "segmentation, feature extraction, recognition (SFR)". It is the method of choice in many application domains such as industrial inspection, where a high degree of control over the imaging process is possible. In this case, the proper choice of background and lighting as well as a possibly quite restricted set of objects can make segmentation relatively easy. When it comes to vision in more natural and less controlled environments, however, the SFR approach is severely limited. Despite decades of intensive research, it has been impossible to find robust solutions to the segmentation problem that can handle the complexities and ambiguities of everyday visual scenes 1 . The necessary information for correctly grouping all the pixels of an image to the proper objects may simply not be present in the low level pixel properties, as illustrated by R.C. James's famous picture of a dalmatian dog (Fig. 1, right). There are two alternatives to the SFR scheme. The first one simply ignores the segmentation problem altogether and tries to recognize objects directly from features extracted from the entire image. We will refer to this approach as "direct
463
recognition (DR)". This is the viewpoint we will focus on in this chapter. The second alternative tries to couple segmentation and recognition in a closed feedback loop such that partial or tentative segmentation results aid recognition while provisional recognition results try to improve segmentation. This idea could be called "integrated segmentation and recognition (ISR)". This second approach has hardly been studied in the literature but is likely to become an important topic for computer vision research over the next years. 1.3. Object Representations
for Direct
Recognition
In the following we will contrast two kinds of methods that try to recognize objects on the basis of features extracted directly from the raw unsegmented input image: feature constellation techniques and invariant feature techniques. Our goal is not to provide a comprehensive review but rather to just highlight some important ideas. Feature constellation techniques represent a particular view of an object as a spatial constellation of localized image features 2 ' 3 ' 4 ' 5 ' 6 ' 7 ' 8 ' 9 ' 10 ' 11 ' 12 ' 13,14 ' 15 . During recognition, a similar constellation of image features is searched for in a novel input image. A goodness of fit criterion evaluates the match. These techniques vary in a number of aspects including the kinds of image features used in the representation and the algorithm for matching. In order to reach invariance with respect to pose changes, appearance changes, articulation, etc. these methods have to do one or several of the following: 1) represent very different views of an object with separate models, 2) model the variability in the appearance of individual features, 3) model variability in the spatial arrangement of the features. Depending on the number of degrees of freedom in the models, the matching procedure can become very expensive to compute. An early example of such a technique was proposed by Fischler and Elschlager 2 who proposed a method similar to dynamic programming to match object models to the input image. The matching procedure seeks to minimize a cost function that is a sum of two terms. The first punishes dissimilarities of image features to the stored model features while the second penalizes geometric distortions. It has been shown that such a cost function can be derived from a statistical approach for modeling object appearance under a number of assumptions. A widely used feature constellation technique is Elastic Graph Matching4'6'13 (EGM), which will be discussed in more detail below. There, the image features stored in an object model are typically convolution results of Gabor wavelets16 of different orientations and spatial scales. Matching is usually done in a heuristic coarse to fine strategy to make the optimization more tractable. EGM has been very successful in a number of difficult recognition applications, ranging from face detection and recognition 4 ' 6 ' 17 , over gesture recognition against cluttered backgrounds 18 ' 13 , to the analysis of scenes with partially occluded objects 19 ' 8 , just to name a few. There
has
also
been
significant
progress
in
employing
probabilistic
464
frameworks 7 ' 10 ' 12 ' 20 . For example Lanitis et al. achieve compact descriptions of face shape variations using a point distribution model and use discriminant analysis techniques to aid recognition 7 . Nelson and Selinger9'21 have used a manually constructed "perceptual grouping hierarchy" in a 3D object recognition system. At the lowest level, local edge elements are grouped to curve segments. These are in turn grouped to so-called context patches that describe a local arrangement of curve segments. An arrangement of these context patches then describes a particular view of an object. Finally, several such views are grouped together to obtain a full 3D object description. Their approach is very interesting because of the use of a hierarchical object description and it gives very good results for a 3D object recognition task for uniform backgrounds. However, it is not very robust with respect to cluttered backgrounds. Invariant feature techniques. The origins of these approaches 22 ' 23 ' 24 can be traced back to the work of Swain and Ballard 25 on using color histograms for object recognition. In a sense, they represent an extreme stance for reaching invariant recognition. The idea is to extract a number of invariant features, i.e. features whose responses are supposed to stay the same under a number of object transformations like translation, scaling, rotation, and change of aspect. Early work in this direction focused on color histograms 25 and ways to make them more robust with respect to illumination changes 26 ' 27 . More recently, researchers have also used a number of other more complex features. To this end, the responses of a potentially large number of filters of different scales and orientations are applied to the entire image and the responses are then pooled together. Mel's SeeMore system 22 simply adds all the responses of the same filter type for different scales, and orientations applied to the entire image. It evaluates 102 different feature channels, whose responses signal how frequent features of one type are observed in the image regardless of their position, orientation, or scale. Recognition is performed by comparing the 102 dimensional vectors using a nearest neighbor classifier. Similarly, Schiele and Crowely24 use a histogram representation. The responses of filters of one type but different scales, orientations, and positions are all combined into the same histogram bin. Recognition is done by comparing these histograms with statistical techniques. These approaches have given impressive results for recognition on large object data bases. However, they do not perform very well for cluttered scenes. Typically, researchers use data bases with objects on homogeneous backgrounds, and experiments have shown that heavily cluttered backgrounds can have a strong negative impact on performance. The problem is that as soon as there is cluttered background or there are multiple objects in the image, the invariant feature representation will throw all local image features "into one pot" resulting in a super-position catastrophe. For example, the histogram representation is a super-position of contributions from the background and all objects in the scene from which it is no longer possible
465
to tell the individual sources if several objects can give rise to the same features. There is no way of "grouping" or "binding" together the features stemming from the same object. Several attempts have been made to remedy this problem. The first restricts histogram computation to localized image regions 28,24 . This approach works reasonably well if there are few objects in an otherwise empty scene but it is unclear whether this approach is sufficient to overcome the challenge of unfamiliar cluttered backgrounds without any prior assumptions about object sizes and shapes. The second is the use of highly specific feature detectors that are only triggered for very few objects 29 . The idea is that if several such features are "triggered" this provides very strong evidence for the presence of the object. At the same time false positives are thought to be avoided through the high specificity of the feature detectors. To this date, it remains an open question how feature detectors with the needed specificity and the desired invariance characteristics can be found. A third approach is to introduce semi-local constraints23 between the invariant features extracted at interest points. This approach is quite related to the feature constellation techniques discussed above. Considering that some feature constellation techniques use somewhat invariant features and that some invariant feature techniques consider spatial constraints among features it becomes clear that these approaches really lie on a continuum. The features extracted can have varying degrees of invariance to specific transformations 11 (translation, rotation, scale, etc.). and there can be more or less tight constraints regarding their arrangement in the image. Just how much invariance of what kind is optimal for a specific recognition problem and how proper features can be learned is an open question. Similarly, the right level of complexity and specificity of features is an important unsolved problem. Inspiration regarding the answers to these difficult questions may come from the study of biological vision systems, that seem to use a hierarchy of feature detectors of increasing complexity and invariance properties 30 . 2. Object Recognition with Labeled Feature Graphs In the remainder of this chapter we will focus on a specific kind of feature constellation technique — Elastic Graph Matching 4 (EGM). Elastic Graph Matching (EGM) is an object recognition architecture, which has been successfully applied to object recognition, face and gesture recognition, and the analysis of cluttered scenes. Although being motivated by a theory of neural information processing, EGM is similar to other elastic matching approaches 2 . In EGM views of objects are represented as labeled graphs with an underlying two-dimensional topology. The nodes of a graph are labeled with a local image description. Edges of a graph are labeled with a distance vector. Sometimes, the graphs are chosen to have a simple grid-like topology, while sometimes nodes are placed on fiducial points (interest points, landmarks) on the object of interest as illustrated in Fig. 2.
466
Fig. 2. Representation of faces with labeled graphs. Left image: simple graph with grid-like topology. Right two images: Graphs with flexible topology where nodes are placed on fiducial points on the face. The first image shows a graph manually placed on the face during training. The second image shows the graph matched to a novel input image where the face has a slightly different pose and expression. Each node of the graph is labeled with a local image description, edges between nodes are labeled with distance vectors.
Elastic matching of a model graph to an image means to search for a set of node positions such that 1) the local image description attached to each node matches the image region around the position where the node is placed and 2) the graph is not distorted too much. This is expressed as the optimization of a goodness-of-match function with a term for the similarity of the features and a cost term punishing graph distortions. 2.1. Gabor Wavelet
Transform
As a local image description for labeling the nodes of a graph a Gabor jet is usually used. This is a vector of responses of Gabor wavelets (Fig. 3):
<Mx)=—exp--liJ-
exp (ik T x) - exp ( - y - J
.
(1)
These wavelets represent a plane wave with wave vector k restricted by a Gaussian envelope function of width a; x denotes the two-dimensional image location. The responses of several such filters with different sizes and orientations, parameterized by different k, constitute a jet: k„„ = K (C°S.(Q".)N) with fc„ = kmax/r, aM = ^ . (2) \sin(a p )y D Here, the index v € { 0 , . . . ,L — 1} labels the L different spatial scales or frequency levels, and / J £ { 0 , . . . , D — 1} labels the D different orientations. The value / is the spacing factor between kernels in the frequency domain and kmax is the maximum wave number. A jet is a complex vector composed of the L x D complex filter responses Cj, where the index j is a double index running over different orientations and scales. The c,- are represented as absolute value a,- and phase 4>f- cj = Oj-e'w.
467
Fig. 3. Gabor wavelet transform. From left to right: original image; real (top) and imaginary (bottom) part of Gabor wavelet (not drawn to scale); real part of the result of convolving the image with the complex wavelet; absolute value of the convolution result. The parameters used were |k| = 1, a = 2TT, a = ir/4.
Interestingly, the visual system of primates seems to extract Gabor-like feature at its early processing stages 31 , which may be related to the statistical structure of natural images 32 . The absolute responses of such filters are quite robust to small rotations and translations which introduces a certain amount of invariance into the representation. Two similarity functions are commonly used for comparing Gabor jets 4 , e : 5abs(J,J')=
, ^
^
, ,
(3)
Sabs uses only magnitudes of the complex filter responses. It has the form of a normalized scalar product. 5 p h a uses the magnitude and phase of the complex filter responses. Both functions yield similarity values between zero and one. While SabsiJyJ1) slowly changes when J' is "moved" across the image, Spha,(J,J') varies very rapidly because the phases of filter responses change significantly on a spatial scale corresponding to the wave-vector k of strongly responding kernels. This difference between Sabs and S p h a can be understood by comparing the real part of the convolution result with the absolute value of the convolution result in Fig. 3. The absolute response varies much more smoothly. 2.2. Elastic
Matching
Process
and
Recognition
Let us consider the case of having a set of objects we may want to recognize and we have created one model graph for each object. For recognition, i.e. finding the best model for a given input image, all object models are matched sequentially. For each we need to find the minimum of the similarity or goodness-of-fit function. The similarity of a graph G matched to the image I at node positions x n may be defined
468
as simply the average of the similarities of the individual jets: f
i G
S b
* *(jn>
S (G, I) = jfY,
J
(x"))'
<5)
71=1
Often it is useful to introduce an additional cost term to punish geometric distortions of the graph. Also, allowing for different weightings of different nodes in the evaluation of a match can be beneficial. After matching all models to the image, the model obtaining the highest similarity represents the classification result. For object detection, thresholds for all object models must be learned. An object is detected if the matching process yields a score that is above threshold. During matching we need to simultaneously optimize all degrees of freedom of the deformable object model. If object models with many degrees of freedom are allowed (such as shifting across the image, independent scaling in x- and ydirection, rotation in plane, various non-rigid deformations) then matching time can become an issue making an exhaustive search over the parameter space unfeasible. To combat the adverse effect of many degrees of freedom in the object model, a coarse to fine matching scheme can be used. For example, a strategy may be to first try to determine the position, orientation, and scale only very coarsely, and then to refine the estimate in subsequent matching steps. In a final step, "diffusion" of individual nodes may be allowed to compensate for small "irregular" deformations of the object in the current input image. 2.3. Bunch
Graphs
The bunch graph idea6 was developed to model variability in feature appearance. The natural variability in the jets of corresponding points in several images of the same object or class of objects (e.g. several left eyes of different persons) is captured by labeling each node of a graph with a set or "bunch" of jets (short: bunch jet) extracted from corresponding points in different example images rather than only with a single jet. This method can also be used to model variability due to different backgrounds 18 . For the matching process, we need a similarity function comparing the set of jets attached to each node of the bunch graph with local image information. The similarity of a bunch jet Bn comprising K jets at a node n to a single jet J(x) taken at a point x in an image is defined to be the maximum of the similarities of the K individual jets: S*hs(Bn, J(x)) = max{S a b s (J"(/c), J(x)) ,k = l,...,K}.
(6)
k
5p h a is defined analogously. When a graph G with N nodes is matched to an image i" at the node positions x„, its total similarity is given by the average of the node similarities:
SL(G, I) = jjJ2 S^(Bn, J(*n)) • 71=1
(7)
469
Again, we define 5^ h a in an analogous manner. Modeling the variability in feature appearance by storing a set of prototypes is obviously not the only plausible method and a number of other approaches, e.g. based on principal component analysis, can be used as well. 2.4. Application
to Face
Recognition
Automatic face recognition is an important problem with applications ranging from human computer interaction to surveillance and security. A number of studies have explored the use of EGM for face finding and pose estimation 17 , identification 4 ' 6 ' 33 ' 34 , facial expression recognition, gender estimation and more. Figure 2 shows how a face graph may be constructed from a training image so that the graph can then be used to recognize the face in a novel view. While early systems used simple graphs with a grid topology, substantial improvements can be obtained by using graphs whose nodes are placed on specific fiducial points (or landmarks) on the face. The bunch graph technique can been used to represent a generic face model in a single graph. To this end, jets from corresponding points (say, several left eyes) of several training images from different persons are attached to a single node. The underlying assumption is that feature variation, e.g. variability in the appearance of left eyes, is approximately independent of relational variability, i.e., variability in the relative position of left eyes with respect to other features of the face. The great advantage of this method is that it allows to separate the matching process, where the facial landmarks are found, from the identification process, where features from the found facial landmarkds are compared to a gallery of known persons. A number of other extensions have been made to the basic method described so far that are beyond the scope of this chapter 6,33 ' 34 . 3. Combining Multiple Feature Types While early versions of EGM only used a single shape or texture feature type such as Gabor or Mallat niters, it has also been extended for handling multiple feature types 8 ' 13 . There are several ways of extending EGM for multiple feature types (Fig. 4). First, the feature types used may differ from node to node. For example, a model graph describing a face could employ edge features at the outline of the face and texture features at the inside. Second, all nodes may be labeled with a combination of different feature types which is identical for all nodes. Third, nodes may be labeled with combinations of feature types which may differ from node to node. Here we focus on the second scheme. 3.1. Compound
Jets
In order to allow for multiple feature types at a node we introduce the notion of a compound jet. In a compound jet, several local image descriptions are simply con-
470
Fig. 4. Combining different feature types in graph based object representations can be done in a number of ways. All nodes may be labeled with just a single (a) or multiple (c) features that are identical for all nodes. Or all nodes may be allowed to be labeled with a single (b) or multiple (d) feature types that are different for all nodes.
catenated. When similarities between two compound jets J and J' are computed, first the similarities of their corresponding constituents are considered. These are are then combined by computing a weighted average: S(J,J')
= ^2WFSAJF,J'F), T
$ > j r = l.
(8)
T
The Sj: are similarity functions for comparing jets of a particular feature type T. Figure 5 compares different kinds of features for finding image point correspondences. Combining a number of complementary features into a compound jet can improve the overall performance.
3.2. Compound
Bunch
Graphs
One can easily extend the bunch graph idea to graphs with compound jets as well. Compound bunch graphs are constructed in the following manner: • For every node n of a particular graph of a view of an object the jets of the same feature type T extracted from K different training images are integrated into a bunch jet 5 £ = {Jp-(l), • • • > J?{K)}- With these bunch jets as node labels one bunch graph is created for every feature type, whose geometry is averaged from the constituent graphs. • The bunch jets B1^ of different feature types but from the same node n of the same view are concatenated to a compound bunch jet Bn. With these compound bunch jets as node labels, a compound bunch graph is created, whose geometry is identical to that of the constituent bunch graphs of different feature types.
471
Fig. 5. Finding image point correspondences with local descriptions. Combining different feature types in a node description can be beneficial for establishing image point correspondences. a 5 b : source image and its skin color segmentation. The circle marks a point where a compound jet was extracted. c,ds target image and its skin color segmentation. The goal is to locate the point in c corresponding to the marked location in a. e-f: similarity of marked location in a to all locations in c when using only Gabor jets, only color features, only Gabor jets computed on the skin color segmentation, or compound jets representing a weighted average of the others. Bright pixels indicate high similarity.
For evaluating the similarity «S B (S n , J(JC)) between a compound bunch jet Bn and a compound jet J(x) extracted at a particular location x a weighted average of contributions S B stemming from different feature types is used:
5B(BW, J(x)) = 5>^S£(B£, J H X )), J > * = 1'
(9)
The functions S B compare a bunch jet E^ to a simple jet Jj? of the same feature type T in analogy to (6): S%(Bh M*))
= max{5^(J-(fe), J^(x)),
k = 1 , . . . , K} .
(10)
k
The similarity functions Sj? directly compare two jets of a particular feature type T. The similarity of a compound bunch graph Q with node positions x n to an image I is defined as the average of the similarities of all N nodes, in analogy to the previous section:
sQ(e,i) = jj'£sB{Bn,j{xn)).
(ii)
n
Again5 an additional topological cost term can be introduced to punish large geometric distortions, and nodes can be given different weights.
472
Fig. 6. Person-independent hand posture recognition against complex backgrounds is a challenging problem. Due to differences in hand anatomy and individual performance of a gesture there is substantial variability in the appearance of the hand. In addition, rings, etc. may be present and long sleeves may introduce partial occlusions.
3.3. Application
to Hand
Gesture
Recognition
Gestures are a powerful means of communication among humans. In fact, gesturing is so deeply rooted in our communication that people often continue gesturing when speaking on the telephone. Consequently, the automatic recognition of human gestures is an important topic in human computer interaction. The recognition of hand postures is an important ingredient for building gesture-based interfaces. Although there have been attempts to recognize gestures and sign language without recognizing the specific posture of the gesturing hand it is clear that these approaches are limited. For example, in American Sign Language the same movement can have very different meanings depending on the posture of the gesturing hand. A number of requirements make hand posture recognition a particularly challenging task. First, if a system is intended to be person independent it must cope with shape variability due to different hand anatomy or different performance of the same gesture by different persons. Second, if the system is intended to work in unconstrained environments it must be able to deal with cluttered backgrounds, making bottom-up segmentation of the gesturing hand difficult. Along similar lines, it should not be necessary for subjects to take off rings etc. before interacting with the system but they should be able to just "come as they are". Other desiderata are real-time performance and robustness to varying lighting conditions. Using a deformable feature graph approach such as EGM is a very good starting point for this challenging recognition task because 1) it has an inherent ability to handle geometric deformations, 2) it does not require prior segmentation of the gesturing hand, and 3) it can elegantly handle variability in feature appearance with the bunch graph method. We have used EGM with multiple feature types to address this difficult problem 13 . In particular, we added two kinds of color features to the local im-
473
Fig. 7. Example of a model graph being matched onto an input image, as original graph. b: matched graph, c: skin color segmentation.
age descriptions at the graph nodes. The first additional feature was a local average of the color in the vicinity of the node (in HSI color space), the second additional feature was a Gabor jet extracted from a preprocessed version of the image that emphasized skin tones. The image database consisted of more than 1000 color images of twelve hand postures performed by 19 persons against simple and complex backgrounds with varying amount of skin color (Fig. 6). The images of three subjects signing against uniform light and dark backgrounds formed the training set3 giving 6 training images per posture, the remaining images formed the test set. For the images in the training set, we constructed graphs of 15 nodes. The nodes were manually placed at anatomically significant points. Model graphs extracted from the training images were combined into 12 compound bunch graphs (one for every posture) as described above. The matching procedure had 4 steps: coarse positioning of the graph by scanning it across the image in steps of five pixels; estimating rotation in plane (+/» 15 degrees) while refining the position estimate; estimating scale changes (up to 20 percent); local diffusion of individual nodes to fine tune the matching. An example of a successful match is given in Fig. 7. The nodes usually find their proper positions during the matching, even if the background is very cluttered or contains large regions of skin color.
3.4.
Evaluation
Cross-runs on a test set of 604 images taken against uniform light or dark background and 338 images against complex backgrounds were performed. The results are summarized in Tab. 1. The weighting factors wj? between the three different feature types were systematically varied in an exhaustive manner considering all relative weightings of the types 1:0:0, 4:1:0, 3:2:0, 3:1:1, 2:2:1, 2:1:1, and 1:1:1, and system performance on the test set was measured. The best result was obtained for ^Gabor = ^GoiorGabor = 25%, t^coior = 50%. Since the weighting of cues only intro-
474 Table 1. weighting only only only best
Gabor Color ColorGabor Mixture
Recognition results for gesture recognition. simple background
complex background
82.6% 39.7% 88.2% 92.9%
70.4% 34.6% 76.3% 85.8%
Note: All numbers are percent correct recognition.
duces 2 free parameters, we decided not to use a separate validation set. It turns out that a proper combination of the three feature types outperforms any of them alone. For example, the error for the best combination found is less than half as big as that for using Gabor features alone. Performance does not depend critically on the precise weighting between the features if all are being used. The recognition rates vary smoothly with the weighting and there is a large plateau of weightings yielding recognition rates higher than any of the feature types alone could account for. If only Gabor and colorGabor features but not color features are used, performance is still very good, suggesting that the color features are not essential for the system's performance. An analysis of the system's errors reveals that strong shape variations due to either differences in performance of a posture, different hand anatomy, strong rotation in depth, or combinations of these account for practically all errors. Interestingly, it was found that a greater flexibility of the graph during the final matching step was not effective in reducing the number of errors. The reason seems to be that the geometric distortions are not of a random nature which would be well modeled by a diffusion process, but instead show high correlations. This is due to the hand's kinematics which only allow for certain correlated movements of nodes. A more refined modeling of likely graph distortions thus seems like a good avenue for future research. In its extreme form, this would amount to a full kinematic model of the hand, but this may not always be necessary. 4. Analysis of Cluttered Scenes with Partial Occlusions The analysis of cluttered scenes such as the one depicted in Fig. 8 is one of the most challenging problems in computer vision. Massive background clutter makes it virtually impossible to properly segment the scene into regions corresponding to the objects in a bottom-up fashion, i.e. without using explicit object models. Furthermore, occlusions can render large portions of an object invisible. The ease and efficiency with which the human visual system solves this problem belies the fundamental computational challenges that this task poses. It turns out that for this kind of task, the use of deformable feature graphs as the object representation is again a very good choice. In particular, the use of local image features allows to explicitly discover and model partial occlusions by other objects 19,8 . In this Section
475
Fig. 8. Analysis of cluttered scenes. Given a number of object models in the form of deformable feature graphs (left), the goal is to analyze a cluttered scene with partially occluded objects (middle) and to return a description of the identities, locations, and occlusion relations of all objects in the scene (right).
we will outline an extension of the methods discussed so far to stereo image pairs and apply them to the analysis of cluttered scenes of multiple objects occluding one another. 4 . 1 . Extension
of EGM
to Stereo Image
Pairs
Most work on stereo vision has as its goal the estimation of 3-D structure from 2-D images. In contrast, we are interested in improving object recognition through the use of stereo processing 35 ' 36 . The essence of stereo vision is solving the correspondence problem, i.e. determining which point in the left and right image correspond to the same physical point. Most stereo algorithms consider fairly simple image features when trying to establish the correspondence relations, but in principle features of arbitrary complexity can be used for this purpose — or even entire objects. The basic ideas in our case is that correct object matches in the left and right image are not independent but related through the epipolar geometry of the stereo system. This induces a constraint that can be used to discard false matches if they fall on inconsistent locations in the left and right image (Fig 9). As a side product, the distance of correct matches can be estimated. If absolute size information about the objects is available, then the distance information must in turn be related to the apparent size of the object in the images — an additional constraint that can be used to reject false matches.
4.2. Scene
Analysis
Our goal for object recognition in the presence of partial occlusions is to explicitly detect and discount occlusions when evaluating the quality of a match. One idea for doing so works in a front-to-back fashion. Object models are matched to the image and the object models obtaining good matches enter a set of candidates. Of these candidates, the front-most object, i.e. the one closest to the camera is determined
476
Fig. 9. Simultaneous recognition in stereo image pairs. The exploitation of epipolar constraints can aid recognition in difficult situations with clutter and partial occlusions, a: stereo image pair of cluttered scene with partially occluded object "duck", b : Simultaneous matching of object model duck in both images subject to epipolar constraints manages to correctly locate the object despite the partial occlusion, c: In contrast, independent matching in the left and right image is more prone to errors. In this case, it fails for the right image.
on the basis of stereo disparity, match similarity and possibly other cues. The frontmost candidate is then accepted and removed from the candidate list. Since this object is the closest candidate, it cannot be occluded by other objects other than those recognized on previous iterations. However, it may itself be occluding some objects. To take this into account, the image area covered by the accepted object is labeled as occluded. Now, the matches for other objects are re-evaluated by discounting any nodes that fall into the occluded area. This may lead to additional objects entering the set of candidates. The algorithm now repeats the selection of the front-most candidate until all no more candidates remain. 4.3.
Evaluation
To test the method we have recorded a data base of stereo image pairs of cluttered scenes composed of 18 different objects in front of uniform and complex backgrounds. There is considerable variation in object shape and size, but there are also objects with nearly identical shapes differing only in color and surface texture. The training images comprise one stereo image pair of each object in front of a uniform blue cloth background. The test images comprise scenes composed of 2-5 objects in front of one simple background (80 scenes) and four complex backgrounds (160 scenes). In systematic experiments evaluating the above scene analysis methods we found that the extension to recognition in stereo image pairs and the use of multiple feature types in the form of compound jets can lead to quite dramatic performance
477
Fig. 10. Left: collage of objects. Right: example scenes with simple (a,b) and complex (c-f) backgrounds. Note the variation in size for object "pig".
improvements. In particular, for a fixed level of false positives, a version of the system using both extensions reached roughly twice as many correctly recognized objects as a monocular version of the system employing Gabor features only. An important lesson from these experiments is that the representation of objects as constellations of localized features allows to explicitly discount partial occlusions — a property that is quite appealing. Representations based on "larger" features covering big parts of the object 15 appear less suited for this task because they are more prone to be corrupted by partial occlusions. It may be possible to develop hierarchical architectures that integrate features at different scales 9,21 and combine the advantages of more localized and more global features. Progress in the area of scene analysis has been comparatively slow and we feel that a reason for this is that researchers have usually tried to decompose the problem in the wrong way. A specific problem may have been a frequent insistence on treating segmentation and recognition as separate problems, to be solved sequentially. We believe that this approach is inherently problematic and this work is to be seen as an initial step towards finding an integrated solution for both problems. In the method described above, which is an extension to the work by Wiskott 19 , segmentation is purely model driven, i.e. if an object has been recognized, the corresponding region of the image is labeled as a segment. This need not be the case and current work explores ways of also integrating bottom-up segmentation cues.
5. Discussion This chapter discussed object recognition approaches based on deformable feature graphs — a special case of what we have called direct recognition approaches, that try to recognize objects on the basis of features directly extracted from unsegmented images. Deformable feature graphs represent particular views of objects as spatial constellations of certain image features. Such an object representation has a number of benefits. First, it allows recognition without prior segmentation of the object of interest. Second, robustness to small variations in appearance tends to be very good
478
if features are chosen properly, i.e. there tends to be good generalization. Third, also depending on the features used and the type of recognition scheme employed, it can be quite fast, lending itself to real time implementations with current off-theshelf hardware. Fourth, it allows for explicitly modeling and discounting of partial occlusions. Fifth, the methods described in here have already proved their potential in very different applications ranging from face detection and recognition, over hand gesture recognition, to the analysis of cluttered scenes of partly occluded objects. Overall, these benefits ensure continued interest in these methods.However, there are many fundamental challenges still lying ahead. First, it is desirable to automate the learning process as much as possible. In the gesture recognition example it was necessary to manually label the anatomical landmarks in the training images. How can object models be built without precise feature correspondences but only a coarse localization of the object in the training images? Even better would be a system for which only a label (object is present/not present) has to specified for each image 37 ' 38 . Finally, it should even be possible to discover object categories without any explicit supervision. Human infants seem to be able to spontaneously form categories of visual objects without explicit instruction or supervision. A machine vision system that simply observes its visual environment and learns to cluster objects into the right categories in a completely unsupervised fashion would represent a quantum leap for the field. Second, how can large object data bases be searched efficiently? Methods that engage in elaborate matching procedures for thousands of objects in a sequential fashion seem rather inappropriate. An idea that could lead to benefits in recognition speed is to share partial object descriptions among several models and to search for objects in a bottom-up fashion, where the detection of individual features triggers a more elaborate analysis for all matching candidate objects. In this context, how can partial object descriptions be learned that lend themselves to efficient sharing between objects while still allowing effective discrimination? There may be large benefits in employing hierarchical object representations using features at several levels of complexity. Fairly unspecific lower level features can be shared between object models most effectively while more specific higher level features give the necessary discriminative power for recognition. What are good frameworks for learning such representations? Third, how can recognition and segmentation be integrated within a closed feedback loop to allow for efficient recognition in the presence of clutter and occlusions? These questions are likely to remain of great interest for years to come. Some inspiration may come from the study of biological vision systems, which have mastered object recognition better than any machine vision system that man has ever created. Acknowledgments This work was supported by the National Science Foundation under grants NSF0208451 and NSF-0233200. We thank Christoph von der Malsburg for comments
479 on p a r t s of this chapter.
References 1. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. PAMI, 22(8):888-905, 2000. 2. M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures. IEEE Trans. Computers, 22(l):67-92, 1973. 3. A. Yuille and A. Blake. Deformable templates. In A. Blake and A. Yuille, editors, Active Vision, pages 21-38. MIT Press, 1992. 4. M. Lades, J. C. Vorbriiggen, J. Buhmann, J. Lange, C. von der Malsburg, R. P. Wiirtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42:300-311, 1993. 5. R. Brunelli and T. Poggio. Face recognition: Features versus templates. IEEE Trans. PAMI, 15(10):1042-1052, 1993. 6. L. Wiskott, J.-M. Fellous, N. Kriiger, and C. von der Malsburg. Face recognition by elastic graph matching. IEEE Trans. PAMI, 19(7), 1997. 7. A. Lanitis, C. J. Taylor, and T. F. Cootes. Automatic interpretation and coding of face images using flexible models. IEEE Trans. PAMI, 19(7):743-746, 1997. 8. J. Triesch and C. Eckes. Object recognition with multiple feature types. In L. Niklasson, M. Boden, and T. Ziemke, editors, Proceedings ICANN 98, pages 233-238. Springer, 1998. 9. R. C. Nelson and A. Selinger. Large-scale tests of a keyed, appearance-based 3-d object recognition system. Vision Research, 38(15-16):2469-2488, 1998. 10. M. C. Burl, M. Weber, and P. Perona. A probabilistic approach to object recognition using local photometry and global geometry. In ECCV'98, pages 628-641, 1998. 11. David G. Lowe. Object recognition from local scale-invariant features. In Proc. of the International Conference on Computer Vision ICCV, Corfu, pages 1150-1157, 1999. 12. M. Zobel, A. Gebhard, D. Paulus, J. Denzler, and H. Niemann. Robust facial feature localization by coupled features. In Proc. 4th Intl. Conf. on Automatic Face and Gesture Recognition, pages 2-7, 2000. 13. J. Triesch and C. v.d. Malsburg. A system for person-independent hand posture recognition against complex backgrounds. IEEE Trans. PAMI, 23(12):1449-1453, 2001. 14. P. Viola and M. J. Jones. Robust real-time object detection. Int. J. of Computer Vision, 57:137-154, 2001. 15. S. Belongie, Jitendra Malik, and Jan Puzicha. Shape matching and object recognition using shape contexts. IEEE Trans. PAMI, 24(4):509-522, 2002. 16. D. Gabor. Theory of communication. J. Inst. Elec. Eng. (London), 93:429-457, 1946. 17. N. Kriiger, M. Potzsch, and C. von der Malsburg. Determination of face position and pose with a learned representation based on labeled graphs. Image and Vision Computing, 15:665-673, 1997. 18. J. Triesch and C. v.d. Malsburg. Robust classification of hand postures against complex backgrounds. In Proc. 2nd Intl. Conf. Automatic Face and Gesture Recognition, pages 170-175. IEEE Computer Society Press, 1996. 19. L. Wiskott and C. von der Malsburg. A neural system for the recognition of partially occluded objects in cluttered scenes. Int. J. of Pattern Recognition and Artificial Intelligence, 7(4):935-948, 1993. 20. P. F. Felzenszwalb and D. P. Huttenlocher. Efficient matching of pictorial structures. In CVPR 2000, pages 66-73, 2000. 21. A. Selinger and R. C. Nelson. A perceptual grouping hierarchy for appearance-based
480 3d object recognition. Computer Vision and Image Understanding, 76(l):83-92, 1999. 22. B. W. Mel. SEEMORE: Combining color, shape, and texture. Neural Computation, 9(4):777-804, 1997. 23. C. Schmid and R. Mohr. Local greyvalue invariants for image retrieval. IEEE Trans. PAMI, 15(5):530-535, 1997. 24. B. Schiele and J. L. Crowley. Recognition without correspondence using multidimensional receptive field histograms. Int. J. of Computer Vision, 36(l):31-50, 2000. 25. M. Swain and D. H. Ballard. Color indexing. Int. J. of Computer Vision, 7(l):ll-32, 1999. 26. B. V. Funt and G. D. Finlayson. Color constant color indexing. IEEE Trans. PAMI, 17(5):522-529, 1995. 27. D. Slater and G. Healey. The illumination-invariant recognition of 3d objects using color invariants. IEEE Trans. PAMI, 18(2):206-210, 1996. 28. F. Ennesser and G. Medioni. Finding waldo, or focus of attention using local color information. IEEE Trans. PAMI, 17(8):805-809, August 1995. 29. B. W. Mel and J. Fiser. Minimizing binding errors using learned conjunctive features. Neural Computation, 12(4):731-762, 2000. 30. T. J. Palmeri and I. Gauthier. Visual object understanding. Nature Reviews Neuroscience, 5:291-302, 2004. 31. J. P. Jones and L. A. Palmer. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophysiol., 56(6):1233-1258, 1987. 32. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381:607-609, 1996. 33. K. Okada, J. Steffens, T. Maurer, H. Hong, E. Elagin, H. Neven, and C. von der Malsburg. The Bochum/USC Face Recognition System And How it Fared in the FERET Phase III test. In H. Wechsler, P. J. Phillips, V. Bruce, F. Fogeman Soulie, and T. S. Huang, editors, Face Recognition: From Theory to Applications, pages 186205. Springer-Verlag, 1998. 34. K. Okada and C. von der Malsburg. Pose-invariant face recognition with parametric linear subspaces. In Proceedings of the Fifth International Conference on Automatic Face and Gesture Recognition, pages 64-69, 2002. 35. J. Triesch. Vision-Based Robotic Gesture Recognition. PhD thesis, Institut fur Neuroinformatik, Ruhr-Universitat Bochum, Germany, 1999. Also published as book by Shaker Verlag, Aachen, Germany, ISBN 3-8265-6257-7. 36. Efthimia Kefalea. Flexible object recognition for a grasping robot. PhD thesis, Institut fur Neuroinformatik, Ruhr-Universitat Bochum, Germany, 2001. 37. M. Weber, M. Welling, and P. Perona. Towards automatic discovery of object categories. In Proc. IEEE Comp. Soc. Conf. Comp. Vis. and Pat. Rec, CVPR2000, 2000. 38. H. S. Loos and C. von der Malsburg. 1-click learning of object models for recognition. In H. H. Biilthoff, S.-W. Lee, T. Poggio, and C. Wallraven, editors, Biologically Motivated Computer Vision (BMGV 2002), volume 2525 of Lecture Notes in Computer Science, pages 377-386. Springer Verlag, 2002.
C H A P T E R 4.6 H i e r a r c h i c a l C l a s s i f i c a t i o n a n d F e a t u r e R e d u c t i o n for Fast Face Detection
Bernd Heisele Honda Research Institute USA, Inc. Boston, USA E-mail: [email protected]
Thomas Serre Sam Prentice Tomaso Poggio Center for Biological and Computational Learning at M.I.T. Cambridge, USA E-mail: [email protected] [email protected] [email protected]
We present a two-step method to speed-up object detection systems in computer vision that use Support Vector Machines (SVMs) as classifiers. In the first step we build a hierarchy of classifiers. On the bottom level a simple and fast linear classifier analyzes the whole image and rejects large parts of the background. On the top level, a slower but more accurate classifier performs the final detection. We propose a new method for automatically building and training a hierarchy of classifiers. In the second step we apply feature reduction to the top level classifier by choosing relevant image features according to a measure derived from statistical learning theory. Experiments with a face detection system show that combining feature reduction with hierarchical classification leads to a speed-up by a factor of 335 with similar classification performance.
1. I n t r o d u c t i o n A major task in visual scene analysis is to detect objects in images. This is commonly done by shifting a search window over an input image and by categorizing the object in the window with a classifier. T h e main problem in categorization is the large range of possible variations within a class of objects. T h e system must generalize not only across different viewing and illumination conditions but also across different exemplars of a class, such as faces of different people for face detection. This requires complex, computationally expensive classifiers. Further contributing to the computational load of the task is the large amount of input d a t a t h a t has to be processed. A real-time vision system has to deal with d a t a streams in the range 481
482
of several MBytes/sec. Speeding-up classification is therefore of major concern when developing systems for real-world applications. In the following we investigate two methods for speed-ups: hierarchical classification and feature reduction. In 4 we presented a system for detecting frontal and near-frontal views of faces in still gray images. The system achieved high detection accuracy by classifying 19x19 gray patterns using a non-linear SVM. Searching an image for faces at different scales took several minutes on a PC, far too long for most real-world applications. Experiments with faster classifiers (linear SVMs) gave significantly lower recognition rates. To speed-up the system without losing in classification performance one can exploit the following two characteristics common to most vision-based detection tasks: First, the vast majority of the analyzed patterns in an image belongs to the background class. For example, the ratio of non-face to face patterns in the tests in 4 was about 50,000 to 1. Second, many of the background patterns can be easily distinguished from the objects. Based on these two task-specific characteristics it is sensible to apply a hierarchy of classifiers. Fast classifiers remove large parts of the background on the bottom and middle levels of the hierarchy and a more accurate but slower classifier performs the final detection on the top level. This idea is related to the well known coarse-to-fine template matching 7 ' 2 ' 3 . In 8 hierarchical classification is used to speed-up a face detection system. A candidate selection neural network with increased robustness against translation is added as a first layer to an existing face detection neural network. We present a method for building and training a hierarchy of SVM classifiers given a set of classifiers which operate at different image resolutions. The iterative algorithm starts with the topmost classifier at the highest image resolution and adds a new lower layer. At each iteration, the hierarchies are retrained bottom up and a speed test is performed on a validation set of non-face images to choose the hierarchy with the least number of computations. Another possibility of speeding-up classification is to reduce the number of features by selecting a subset of relevant features. Feature reduction is a more generic tool than the above described hierarchical classification and can be applied to any classification problem. There are basically two types of feature selection methods in the literature: filter and wrapper methods *. Filter methods are preprocessing steps performed independently of the classification algorithm or its error criteria. Wrapper methods attempt to search through the space of feature subsets using the criterion of the classification algorithm to select the optimal feature subset 5 . We present a new wrapper method to reduce the dimensions of both input and feature space of a non-linear SVM. For our final detection system we combine feature selection with hierarchical classification by putting a non-linear SVM with feature selection on top of the hierarchy of linear SVMs. A similar idea of combining feature selection with hierarchical classification has been recently proposed in 12 for frontal face detection. They use AdaBoost to train a cascade of linear classifiers and to select features from an over complete set of Haar wavelet features. In contrast to
483
our approach, however, the complexity of the classifiers in the final hierarchy is only controlled by the number of features and not by the class of decision functions (i.e. class of kernel functions). The outline of the paper is as follows: In Section 2 we give a brief overview of SVM classification. In Section 3 we describe how to build and train the hierarchical system. Sections 4 and 5 describe the feature selection methods for the input and the feature space of the SVM, respectively. In Section 6 we present experimental results of the hierarchical system with feature reduction. The paper is concluded in Section 7. 2. Background on Support Vector Machines 2.1.
Theory
An SVM n performs pattern recognition for a two-class problem by determining the separating hyperplane that has maximum distance to the closest points of the training set. These closest points are called support vectors. To perform a non-linear separation in the input space a non-linear transformation <&(•) maps the data points x of the input space IRn into a high dimensional space, called feature space IRP (p > n). The mapping <&(•) is represented in the SVM classifier by a kernel function K(-,-) which defines an inner product in IRP. Given I examples {(XJ, j/j)}f=1, the decision function of the SVM is linear in the feature space and can be written as: e / ( x ) = w • $(x) + b = J2 «°3A# ( * , x) + b. (1) 2= 1
The optimal hyperplane is the one with the maximal distance in space IRP to the closest points $(XJ) of the training data. Determining that hyperplane leads to maximizing the following functional with respect to a: t 2
i a
W (a) = 2j2 i2=1
Y^ aiajyiyjKixi,xj) i,j
(2)
= l
under constraints J2i=i aiVi = 0 a n d C — ai — 0, * = 1> •••>^- An upper bound on the expected error probability EPerr of an SVM classifier is given by n :
Ep
-AE{w)=\E(R2w2{a0)^
(3)
where M = w}ac>\ is the distance between the support vectors and the separating hyperplane and R is the radius of the smallest sphere including all points $ ( x i ) , ...,$(x^) of the training data in the feature space. In the following, we will use this bound on the expected error probability to rank and select features. 2.2.
Computational
Issues
The only non-linear kernel investigated in this paper is a second-degree polynomial kernel K(x, y) = (1 + x • y) 2 which has been successfully applied to various
484
object detection tasks 6 ' 4 . Eq. (1) shows two ways of computing the decision function. When using the kernel representation on the right side of Eq. (1) the number of multiplications required to calculate the decision function for a second-degree polynomial kernel is: Mk,poiy2 = (n + 2) • s,
(4)
where n is the dimension of the input space and s is the number of support vectors. This number is independent of the dimensionality of the feature space. It depends on the number of support vectors which is linear with the size I of the training data n . On the other hand, the computation of the decision function in the feature space is independent of the size of training samples, it only depends on the dimensionality p of the feature space. For the second-degree polynomial kernel the feature space IRP has dimension p = ^ 2 'n and is given by x* = (y/2xi,--- ,y/2xn,x\,--, x"^, ^/2xix2, • • • , V ^ n - i ^ n ) - Thus the number of multiplications required for projecting the input vector into the feature space and for computing the decision function is: Mf,poly2 =
y(n
+ l)n ; 2
(n + 3)n + ^ 2 ' = (n + 2) • n
(5)
From Eq. (4) and (5) we see that the computation for an SVM with second-degree polynomial is more efficiently done in the feature space if the number of support vectors is bigger than n. This was always the case in our experiments; the number of support vectors was between 2 and 6 times larger than n. That is why we investigated not only methods for reducing the number of input features but also methods for feature reduction in the feature space.
3. Hierarchy of classifiers 3.1. System
Overview
In most object detection problems the majority of analyzed image patches belong to the background class. Only a small percentage of them look similar to objects and require a highly accurate classifier to avoid false classifications. It is sensible to apply a hierarchy of classifiers where the complexity of the classifier increases with each level. By propagating only those patterns that were not classified as background, we quickly decrease the amount of data to process. The main issues in designing such a classification hierarchy are how to choose the input features to the classifiers, how to select the number of levels, and finally how to train the classifiers. We used pixel values as inputs to the classifiers and reduced the number of features from top to bottom by decreasing the image resolution, similar to coarse-to-fine matching approaches. An example of a hierarchical system with five layers is shown in Figure 1.
485
19x19 linear classifier
Fig. 1. Example of a hierarchical system with five levels: Starting with the classifier at the lowest resolution patterns which have been classified as background are successively removed from the image going up the hierarchy. The final non-linear classifier processes the image at maximum resolution.
3.2. Building
the
Hierarchy
In the following we describe an algorithm for automatically determining the architecture of the hierarchy. We begin with a set of classifiers which operate at different resolutions and are each trained over the entire training set. In our experiments the resolution of the linear SVM classifiers ranged from 3 x 3 to 19x19. Given this set, the goal is to find the best hierarchy with respect to speed and recognition performance. The algorithm builds the hierarchy in an iterative top-down fashion starting with the topmost classifier and adding a new layer at each iteration. It consists of three steps: a) A d d i n g a n e w layer. We add a classifier to the hierarchy which operates at a lower resolution than the current bottom classifier. For example, if the current hierarchy consists of a 19x19 and an 11x11 classifier we add classifiers with resolutions ranging from 9x9 down to 3x3, resulting in 7 new hierarchies. b) R e t r a i n i n g . We perform bottom-up training of the new hierarchies. We train the bottom classifier on the whole training set and then shift its hyperplane such that a fixed percentage of positive examples8, is on one side. Then we remove all negative examples which lie on the other side of the hyperplane to generate a new a
Was set to 99% in the experiments.
486
training set for the classifier on the next higher level. This is continued for each layer till we reach the top of the hierarchy. c) Speed test. We set the thresholds of all classifiers in the hierarchies such that a fixed percentage of the faces in the training set b is correctly classified. Then we perform a speed test of the hierarchies on the validation set of non-faces and choose the hierarchy with the least number of computations. The iterative process is continued until the lowest resolution is reached or adding a new layer does not increase the speed of the system. In our experiments we used a set of linear SVM classifiers with resolutions ranging from 3x3 to 19x19 and a non-linear SVM at resolution 19x19 which was chosen to be our top level classifier. The non-linear SVM was not included into the bottom-up learning procedure described above. For a realtime system this computationally expensive classifier can only be applied to a very small percentage of the input patterns. Since our bottom-up training method reflects the runtime data flow, the few training examples that remain would not have been sufficient to train an accurate classifier. The training set for the linear classifiers consisted of 9,662 face images and 33,045 non-face images at resolution 19x19. The validation set consisted of 73,089 randomly collected non-face images at resolution 19x19. We applied histogram equalization as proposed in 10 to decrease the variations caused by changes in illumination. To train the classifiers at lower resolutions we downscaled the training images to the proper size. Applying the above described training procedure resulted in a five level hierarchy with the 19x19 non-linear SVM on top, followed by a 19x19, 11x11, 4x4, and finally 3x3 linear SVM classifier. The number of examples removed in the bottom-up training procedure for the final hierarchy is given in Table 1. Table 1. Number of negative training examples used for training in bottom-up training procedures.
Bottom-up:
3x3 33,045
->
4x4 23,598
-*
11x11 18,149
->
19x19 10,568
As mentioned before, the topmost classifier was trained independently. We used the same training set (2,429 faces and 4,548 non-face patterns) as in 4 to allow for comparisons with the results reported there. In Figure 2 we show the ROC curves of the single second-degree polynomial classifier 4 and our hierarchical system on the CMU set l c . We selected the thresholds of the hierarchical classifiers in layers one to four according to the individual ROC curves on the training set d . The hierarchy performs better than the single classifier for recognition rates below 75%. Above b
Was set to 99% in the experiments. This set 9 consists of 118 images including 479 faces. In our experiments searching over multiple scales and locations resulted in processing of about 57,000,000 19x19 patterns. d W e selected the point on the ROC curve with 97% recognition rate. c
487
that, the single SVM classifier is superior, indicating that some of the difficult face patterns in the test set do not reach the last layer of the hierarchy. Figure 3 shows the data flow through the hierarchy and the time spent in each layer. The hierarchical system is about 260 times faster than the system with the single second-degree polynomial SVM.
Training: 2.429 faces / 4.548 non-facos Test: C M U sot 1 / 1 1 8 I m a g e s 1479 f a c o s / 5 6 , 7 7 4 , 9 6 6 w i n d o w s
O
0.00001
00OQO2
OOM03
000004
0O0006
000000
000007
000X0
000009
OOOOI
False positives / number of windows
Fig. 2. ROC curves of the hierarchical system and the original system with a single second-degree polynomial SVM for the CMU set 1.
% of tola! lime spent al each layer % of original patterns after each layer
11.1%
17.3%
47.9%
13.8%
9.9%
80.0%
14.6%
0.6%
0.2%
0.005%
19x19 linear classifier
19x19 poly classifier
faces (0.005%)
2.3%
80.0% JX.» linear classifier
Remaining
|20.0%
97.7%
Whole image 100%
Background (99.995%)
Fig. 3. Data flow and computing time for the hierarchical classifier tested on the CMU set 1.
In layers one to four most of the computation time (~90%) is used for feature extraction. Further optimizing the classifiers would not lead to a significantly faster system. In the last layer about 95% of the computing time is spent on the classification leaving much room for speed-ups using feature reduction methods. In the following sections we explore methods for feature reduction and apply them to the non-linear SVM at the top level of the hierarchy.
488
4. Dimension Reduction in the Input Space 4.1. Ranking
Features in the Input
Space
13
In a gradient descent method is proposed to rank the input features by minimizing the bound of the expectation of the leave-one-out error of the classifier. The algorithm showed superior performance compared to other feature selection methods (filter methods based on Fisher score, Pearson correlation coefficients and Kolmogorov-Smirnov test) for various classification tasks (face detection, person detection, cancer morphology classification). The basic idea is to re-scale the ndimensional input space by a n x n diagonal matrix a such that -^j is minimized. The new mapping function is then 3><x(x) = <J>(cr • x) and the kernel function is Ka(x,y) = K(a • x , a • y) = ($CT(x) •$ CT (y)). The decision function given in Eq. (1) becomes: e. /(x,<x) = w • $CT(x) + b = Y/a°iyiKa(xi,X)
+b
(6)
The maximization problem of Eq. (2) is now given by: t
W2(a,a)
1
i
= m a x ^ at — — ^ ^ ai^jyiyjKa(xi,Xj)
(7)
subject to constraints 5Z i= i aiVi = ® an<^ C > a.i > 0, i = 1, ...,£. The radius around the data is computed by solving the following maximization problem: R2{f3,a) = m a x ^ / ^ ( x , , X i ) - ^ & / ? , • # „ ( X i , x , )
(8)
subject to Y,iA = 1, f3i>0, i = !,...,£. We solve for a, /?, and <x using an iterative procedure: We initialize a as a vector of ones and then solve Eqs. (7) and (8) for the margin and radius, respectively. Using the values for a. and /? and the bound in Eq. (3) we compute a by minimizing W2(a, <J)R2(/3, a) using a gradient descent procedure. We then start a new iteration of the algorithm using the
489 Training: 2.429 ttcw / 4.548 non-focM. To.t: RMocod CMU tot / I IB ImaoM / 472 lata* / 23.873 wkidowa.
FBIID poriMvat / number of wlndowi
Training: 2.429 tstM / 4.54B nonJMM. Te*t: Rodutad CMU * M 111B Imagoa / 472 ( M M / 21,573 windows.
\) \
Fatso positives' number o( windows
Fig. 4. ROC curves for a) gray features and b) PCA gray features with the 60, 80 and 100 ranked features.
4.2. Selecting
Features
in the Input
Space
In Section 4.1 we ranked the features according to their scaling factors
EP.„<\E{^
(9)
To simplify the computation of our algorithm and to avoid solving a quadratic optimization problem in order to compute the radius R, we approximated 6 R by 2p where p is the dimension of the feature space IRP. For a second-degree polynomial kernel of type (1 + x • y ) 2 we get: EPerr < j 2p E {W2(a0))
< ~ n*(n* + 3) E (W2(a0))
(10)
where n* is the number of selected features f . The estimated bound on the expected error is shown in Fig. 5. We had no training error for more than 22 selected features. The bound drops over the first 30 features, stays about the same between 30 to 60 features, and then increases steadily. This bound is considered to be a loose bound on the expected error. To check if it is of practical use for selecting the number of features we performed tests on the CMU set 1. In Fig. 6 we compare the ROC curves obtained for different numbers of selected features. The results show that using more than 60 features does not improve the performance of the system. This observation coincides with the run of the curve in Fig. 5. However, the error on c
We previously normalized all the data in lR n to be in a range between 0 and 1. As a result the points lay within a p-dimensional cube of length %/2 in IRP and the smallest sphere including all the d a t a points is upper bound by \flp. f As we used a second-degree polynomial SVM the dimension of the feature space p = n * ( n * + 3 ) / 2 .
490 Estimation of Bound on Generalization Error
4«E*W y'
X
a t I
x
- * • PCA gray features
*5
70
85
1M
r145
. 170
. 193
. 220
2*5
N u m b e r of selected features
Fig. 5. Approximation of estimated bound on the expected error versus number of principal components. The values on the y-axis are not normalized by t h e number of training samples.
the test set does not change significantly for more than 70 features although the estimated bound on the expected error shown in Fig. 5 increases. An explanation for this could be that the bound gets looser with increasing number of features because of our approximation of R by the dimensionality of the feature space. Training: 2.429 faces / 4,546 non-faces. Test: CMU set 1,118 images / 479 faces / 56.774.966 windows.
D'l
06
ttf^^^Z^^"
0 7
y
0.6
-»- 30 PCA gray features
08
- • - 60 PCA gray features 0".
-*-100 PCA gray features 0.3
- * - complete sei of features (283 gray features)
0,2 0.00001
0.00002
0.00003
0.0O004
0.00005
0.00006
0.00007
0.00008
O.0O009
0.0001
False positives / number of windows
Fig. 6.
ROC curves for different numbers of PCA gray features.
5. Feature Reduction in the Feature Space In the previous Section we described how to reduce the number of features in the input space. Now we consider the problem of reducing the number of features in
491 the feature space. We use a new method based on the contribution of the features from the feature space to the decision function / ( x ) of the SVM. t
/ ( x ) = w • $(x) + b = £ a%iK(^,
x) + b
(11)
i=l
with w = (wi,...,wp). For a second-degree polynomial kernel with K(x,y) — (1 + x • y) 2 , the feature space Htp with dimension p = ("+ 3 )" is given by x* = (y/2x\,--- ,\/2xn,x\,--,Xn,y/2xiX2,--- , \/2a;„_ix n ). The contribution of a feature x*k to the decision function in Eq. (11) depends on Wk- A straightforward way to order the features is by ranking \wk\- Alternatively, we weighted w by the support vectors to account for different distributions of the features in the training data. The features were ordered by ranking \wkYliVix*i,k^ w n e r e x*,k denotes the fc-th component of support vector i in feature space JRP. For both methods we first trained an SVM with a second-degree polynomial kernel on 60 PCA gray features of the input space which corresponds to 1,891 features in IRP. We then calculated E(S) = ^\f(xi)-fs(xi)\
(12)
i
for all s Support Vectors, where / s ( x ) is the decision function using the S first features according to their ranking. Note that in contrast to the previously described selection of features in the input space this method does not require the retraining of SVMs for different feature sets. The results in Fig. 7 show that ranking by the weighted components of w lead to a faster convergence of E(S) from Eq. (12) towards 0. Fig. 8 shows the ROC curves for 500 and 1,000 features. As a reference we added the ROC curve for a second-degree SVM trained on the original 283 gray features. This corresponds to - — ^ — = 40,469 components in the feature space. By combining both methods of feature reduction we could reduce the dimensionality by a factor of about 40 without loss in performance. 6. Experiments In the final experiments we combined feature reduction with hierarchical classification. In Figure 10 we compare the ROC curves of the hierarchical system and the single SVM classifier with second-degree polynomial kernel. The hierarchy performs better than the single classifier up to a recognition rate of 80%. For higher recognition rates, the single SVM classifier performs slightly better because some of the more difficult face patterns in the test set do not reach the last layer of the hierarchy. The average computing time on a Pentium IV with 1.8 GHz for an image of size 320 x 240 is given in Table 2. We sped-up the detection by a factor of 335 compared to the original system achieving a processing speed of 4 frames/sec. The face detector proposed in 12 is about 4 times faster at runtime. However, it requires several weeks for training while our system can be trained in a day.
492 Partial S u m f o r S u p p o r t V e c t o r s
l
fe p—|
7
W
j!
5 4
weighted w
n,* *-««—j
1
li/w^.
0
1
L _
~r^-t
— Number of selected features
Fig. 7. Classifying support vectors with a reduced number of features. The x-axis shows the number of features, the y-axis is t h e mean absolute difference between the output of the SVM using all features and the same SVM using the S first features only. The features were ranked according to the components and the weighted components of the normal vector of the separating hyperplane.
Training: 2.429 faces / 4.548 non-faces. Test: CMU set 1 n 18 Images / 479 facos / 56.774,966 windows.
000001
000003
000003
0CO0O4
000005
000006
000007
001
000009
00001
Falso positives / number of windows
Fig. 8.
ROC curves for different dimension of the feature space.
In Figure 10 we show the results for an example image from CMU set 1. The original image is shown in Figure 10 a), Figures 10 b) to e) show the outputs of the SVM classifiers at levels two to five, respectively. Dark pixels represent high output values of the SVM indicating the presence of a face, white areas have been removed by previous classifiers in the hierarchy. In Figure 10 f) we show the final detection results as computed by the top level second-degree polynomial SVM classifier. Some more example images from CMU set 1 are shown in Figure 11.
493 Table 2. Average computing time per 320 x 240 image for various face detection systems. Each image was processed at five different scales to detect faces at resolutions between 30 x 30 and 70 x 70 pixels. System Single SVM Single SVM with Feature Reduction Hierarchy of SVMs Hierarchy of SVMs with Feature Reduction
Time / Image [ms] 86,875 23,383 391 259
Speed-up Factor
3,7 222.2 335.4
Training: 2,429 f a c e s / 4,548 n o n - f a c e s T e s t : CMU set 1 / 1 1 8 i m a g e s / 479 f a c e s / 56,774,966 w i n d o w s
000001
060002
00OO0J 000004 000005 O.0O0OB 000007 False positives / number of windows
OOOOOB
Fig. 9. ROC curves of the hierarchical system and the original system with a single second-degree polynomial SVM for the CMU set 1.
7. C o n c l u s i o n a n d F u t u r e W o r k
In this paper we presented speed-up methods for a face detection system based on hierarchical classification and feature reduction. To quickly remove large background parts of an image we arranged five SVM classifiers with increasing computational complexity in a hierarchical structure. We proposed an iterative algorithm for automatically building and training a hierarchical system of classifiers. To further speed-up the detection system we applied feature reduction to the non-linear SVM at the top level. This was accomplished by ranking and then selecting PCA gray features according to a classification criterion that was derived from learning theory. Applying these methods to face detection, we removed 99% of the features without loss in classification performance. Experiments show that the combination of feature selection and hierarchical classification results in a speed-up factor of 335 while maintaining classification accuracy. In the current system image preprocessing and feature extraction account for 90% of the runtime. In future work we will explore alternative feature extraction and preprocessing techniques to further speed-up the detection.
494
Fig. 10. Detections at each level of the hierarchy, a) Original Image, b) to e) Outputs of layers two to five. Dark pixels represent high outputs of the classifier, white areas have been removed by previous layers, f) Final detection result computed by layer six.
Fig. 11.
Face detection with the hierarchical system.
Acknowledgements This article is reprinted from Publication P A T T E R N R E C O G N I T I O N , Vol 36, No 9, B . Heisele, T. Serre, S. Prentice, T . Poggio, "Hierarchical classification and feature reduction for fast face detection with support vector machines", p p . 20072017, 2003, with permission from Elsevier.
References 1. 2. 3.
Blum, A., Langley, P., 1997. Selection of relevant features and examples in machine learning. Artificial Intelligence 10, 245-271. Burt, P. J., 1988. Smart sensing within a pyramid vision machine. Proc. of the IEEE 76(8), 1006-1015. Edwards, J., Murase, H., 1997. Appearance matching of occluded objects using
495
4.
5. 6.
7. 8. 9.
10.
11. 12. 13.
coarse-to-fine adaptive masks. Proc. of the IEEE on Computer Vision and Pattern Recognition , 533-539. Heisele, B., Poggio, T., Pontil, M., 2000. Face detection in still gray images. A.I. Memo 1687, Center for Biological and Computational Learning, MIT, Cambridge, MA. Kohavi, R., 1995. Wrappers for feature subset selection. Artificial Intelligence, special issue on relevance 97, 273-324. Oren, M., Papageorgiou, C , Sinha, P., Osuna, E., Poggio, T., 1997. Pedestrian detection using wavelet templates. IEEE Conference on Computer Vision and Pattern Recognition, San Juan, 193-199. Rosenfeld, A., Vanderbrug, G. J., 1977. Coarse-fine template matching. IEEE Transactions on Systems, Man and Cybernetics 2, 104-107. Rowley, H. A., 1999. Neural network-based face detection. Ph.D. thesis, CMU, School of Computer Science, Pittsburgh. Rowley, H. A., Baluja, S., Kanade, T., 1997. Rotation invariant neural networkbased face detection. Computer Scienct Technical Report CMU-CS-97-201, CMU, Pittsburgh. Sung, K.-K., 1996. Learning and example selection for object and pattern recognition. Ph.D. thesis, MIT, Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Cambridge, MA. Vapnik, V., 1998. Statistical learning theory. John Wiley and Sons, New York. Viola, P., Jones, M., 2001. Rapid object detection using a boosted cascade of simple features. Proc. of IEEE Conference on Computer Vision and Pattern Recognition. Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., Vapnik, V., 2001. Feature selection for support vector machines. Advances in Neural Information Processing Systems 13.
CHAPTER 5.1 TRACKING AND CLASSIFYING MOVING OBJECTS USING SINGLE OR MULTIPLE CAMERAS
Quming Zhou and J. K. Aggarwal Department of Electrical and Computer Engineering The University of Texas at Austin, Austin, TX 78712, USA This chapter presents methods for tracking moving objects using single or multiple cameras. Objects are classified into three categories: single person, people group or vehicle. For the case of the single camera, the proposed method integrates spatial position, shape and color information to track object blobs. Further, it focuses on establishing a correspondence between objects and templates as the objects come into view. For the case of multiple cameras, we fuse data from individual cameras using an Extended Kalman Filter (EKF) to resolve object occlusion. Our results show that integrating simple features makes the tracking effective and that EKF improves the tracking accuracy when long term or temporary occlusion occurs. Successful results for PETS2001 and other image sequences are presented.
1. Introduction The efficient tracking and classification of multiple moving objects is a challenging and important task in computer vision. It has applications in surveillance, video communication and human-computer interaction. Recently, a significant number of tracking systems have been proposed. Some address lowlevel feature tracking while others deal with high-level description, such as recognition, event detection and trajectory interpretation. The success of highlevel description relies on accurate detection and tracking of moving objects and on the relationship of their trajectories to the background. There is a considerable diversity among the trackers proposed by various researchers. Haritaoglu, Harwood and Davis [15] classified the feature trackers into several categories according to their sensor types (single or multiple camera, color or gray-scale) and their functionality (tracking single or multiple objects, handling occlusion). Cai and Aggarwal [6] proposed a tracker to establish feature correspondence 499
500
between consecutive frames under different views. They tracked coarse human models from sequences of synchronized monocular grayscale images in multiple camera coordinates. Stauffer and Grimson [30] developed a vision system extending the traditional tracking tasks, which dealt only with low-level feature tracking, to sequence classification and unusual event detection. Their system passively observed moving objects and learned patterns of action from those observations. Another event detection method applied to video surveillance was proposed in [31]. The most widely used cues in object tracking are spatial position, shape, color, intensity and motion. Many uncontrollable factors such as lighting, weather, unexpected intruders or occlusion may affect the efficiency of tracking when these cues are used in an outdoor environment. One solution is to combine two or more cues. Another solution is to use multiple cameras. We shall consider both of these options in this chapter. In the following, we discuss additional research relevant to two issues of integration of features and multiple cameras. 1.1 Feature Integration Multi-feature integration has been exploited extensively in featured-based tracking applications. Shi and Tomasi [29] classified a tracked feature as reliable or unreliable according to the residual of the match between the associated image region in the first and current frames. A feature should be abandoned when the residual grows too large. This work [29] was extended by Tommasini et al. [34], who introduced a simple, efficient, mode-free outlier rejection rule for rejecting spurious features. Kaneko and Hori [18] proposed a feature selection method based on the upper bound of the average template matching error. The selection criterion is directly related to the reliability of tracking. More complex integration methods using statistical models have been presented in [40, 24]. Triesch and Malsburg [35] presented a system for integrating features in a self-organized manner. In their system, all features agreed on a result, and each feature adapted to the result agreed upon. The self-organized adaptation led to suppression and recalibration of discordant features, i.e., features that did not agree with the overall result. Experiments showed that the system was robust to sudden changes in the environment as long as the changes disrupted only a minority of the currently used features at the same time, although all features might be affected in the long run. Cai and Aggarwal [6] used a Bayesian classifier to find the most likely match of the subject in the next frame. Multiple features were modeled by a joint Gaussian function. The most likely match was the minimum value among the candidates with a joint Mahalonobis distance less
501
than a certain threshold. Rigoll et al. [25] combined two stochastic modeling techniques. Pseudo-2D Hidden Markov models were used to capture the shape of a person within an image frame. A Kalman filtering algorithm was applied to the output of the first model to track the person by estimating a bounding box trajectory indicating the location of the person within the entire video sequence. Both algorithms cooperated in an optimal way. 1.2 Multi-camera Tracking Single camera tracking is hampered by the camera's limited field of view. Occlusion is always a challenge for single camera trackers, while multi-camera trackers can utilize the different views to obtain more robust tracking, especially for wider fields of view or occluded objects. Multi-camera tracking is a correspondence problem between objects seen from different views at the same time or with a fixed time latency. It implies that all views of the same object should be given the same label. Khan and Shah [19] presented a system based on the field of view lines of the camera to establish equivalence between views of the same object in different cameras. These lines were used to resolve any ambiguity between multiple tracks. The authors claimed that no calibration was required. Cai and Aggarwal [5] used only relative calibration between cameras to track objects over a long distance. Dockstader and Tekalp [8] introduced a distributed, real-time computing platform to improve feature-based tracking in the presence of articulation and occlusion. Their contribution was to perform both spatial and temporal data integration within a unified framework of 3D position tracking to provide increased robustness to temporal feature point occlusion. A full calibration was employed to build the 3D model. Lee [20] recovered the full 3D relative positions of the cameras and the domain plane of activity in the scene using only the tracks of moving objects. For a more thorough discussion of modeling, tracking and recognition for human motion analysis, we refer the reader to Aggarwal and Cai [1], Gavrila [13] and Collins et al. [7]. 1.3 Object Classification Despite the significant amount of literature on video surveillance, little work has been done on the task of classifying objects as single person, people group, or vehicle. Tan, Sullivan and Baker [33] developed an algorithm to localize and recognize vehicles in traffic scenes. Lipton, Fujiyoshi and Patil [21] classified
502
moving targets from a video stream into human, vehicle or background regions. They used dispersedness value as a classification metric based on the assumption that humans were, in general, smaller than vehicles and had more complex shapes. This assumption became somewhat tenuous in cases where the humans were closer than the vehicles to the camera, where humans grouped together, or when vehicles were occluded. Foresti [11] described a visual surveillance system to classify moving objects into car, motorcycle, van, lorry or person. In his system, object classification was based on a statistical morphology operator, the spectrum, which was used to index large image databases containing multiple views of different objects. Since the spectrum was invariant to object translation, rotation, and scale variations, a simple distance measure was applied as a matching function to individuate the correct object class in the database. In the classification system designed by Stauffer and Grimson [30], when given one or more observations of a particular object, the system classified it into a set of classes such that the same type of object tended to be put in the same class. The number of classes was not predetermined but depended on the input sequences. Their classification method used on-line vector quantization to generate a codebook of prototype representations. Then, the automatically tracked sequences were used to define a co-occurrence matrix over the prototypes in the codebook. Following that, the co-occurrence data were used to probabilistically break apart the prototypes in the codebook into a binary tree representation. The final result was a hierarchical classifier that could classify any individual representation. 1.4 Overview of this Chapter In this chapter, we address the issues of tracking and classifying moving objects from video sequences acquired by stationary cameras. Our tracking objective is to establish a correspondence between the image structures of consecutive frames over time to form persistent object trajectories. Our processing includes: 1) automatically determining the initial positions of moving objects, 2) extracting feature information from all moving objects within view, 3) tracking detected objects by features, 4) classifying the moving objects into three categories: single person, people group or vehicle, and 5) fusing data by an Extended Kalman Filter (EKF) to resolve object occlusion. Our tracking system integrates spatial position, shape and color. This integration makes the tracker insensitive to background changes, motion interruption and object orientation. We detect motion and background variation to segment the moving object blobs,
503
and then compare the similarity of the object blobs with different templates, thereby tracking the objects. We use simple thresholds instead of comparing the Mahalanobis distances between the templates and objects. In multi-camera tracking, we focus on developing a methodology for tracking multiple objects in the views of two different fixed cameras, whose transformation matrix and focal length are estimated from five coplanar control points. The single camera tracking uses 2D image coordinates, whereas in multicamera tracking we transform to real world coordinates. We apply an Extended Kalman Filter to fuse the separate observations from the two cameras into a position and velocity in real world coordinates. This enables our tracker to switch from one camera's observation to the other when the target object is occluded from view. We initialize tracking by solving a constrained linear least squares problem. The constraint is that a moving object in the real world always has some height. This helps us to remove the shadow on the ground plane, whose height is zero. Recently, the Kalman filter has been used in multisensor data fusion [12, 26]. The Kalman filter can be applied to each sensor's data separately [22, 38] or it can be applied simultaneously to data from multiple sensors, combined through additional logic [9, 27]. Our underlying assumption for using the Kalman filter to fuse data is that there is a mathematical relationship between the target object's image positions in two synchronized cameras [4]. Furthermore, measurements from two synchronized cameras provide enough information to estimate the state variables of the system, the position and velocity in real world coordinates. After obtaining an accurate description of the observed object, we classify the objects and update the templates, taking into account any occlusion. Our chapter presents two robust classification metrics to classify the target object into single person, people group or vehicle, namely the variance of motion direction and the variance of compactness. These two metrics are independent of the target object size and orientation and the camera used. The classification allows our tracker to know what kinds of objects are moving in the scene and to detect when two people or more come together to form a group, or separate from each other, dissolving a group. The remainder of this chapter is organized as follows. Section 2 describes the algorithm for moving object segmentation in an outdoor environment. Section 3 focuses on extracting features from the moving blobs. Section 4 develops a hierarchical approach to compare the blobs with the templates in order of position, shape and color. Classification metrics are derived in Section 5. Section 6 provides experimental results to demonstrate the accuracy and discusses the problems of tracking using a single camera. The associated classification results
504
are also presented in this section. Section 7 applies an EKF to take advantage of multiple cameras. Experimental results using EKF are given in Section 8. Finally, Section 9 presents conclusions. Fig. l(a)-(b) shows the architecture of our proposed systems using a single camera and multiple cameras, respectively.
Centroid Projection in Camera I
1
Single Camera Module I
^ Tracking
w
Centroid
^ W
Extended Kalman
Initialization
Single Camera Module 11
—w
t
Real
^
W
Filter
Centroid
World Position
^
w
Tracked Objects
W Centroid Proje ction in Cam 2raII
Figure 1. (a) Architecture of the proposed system using a single camera, (b) Architecture of the proposed system using multiple cameras
2. Moving Object Detection The first step in tracking objects is to segment the objects from the background. Two common methods are motion segmentation and background subtraction. Motion segmentation is basically a threshold of the difference between the current image and the subsequent image based on the assumption
505
that the background does not change over successive frames [17, 37]. This method is easy and fast in many applications, but some problems appear when tracking multiple objects or when object motion stops. Hence, we turn to background subtraction at the expense of updating the background [14, 10]. A pixel-wise median filter with L frame length is employed to build the background under the assumption that a moving object would not stay at the same position for more than 0.5L frames. L is typically of the order of twenty. If the object were to stay still more than 0.5L frames, it would be incorporated into the background. In general, I(x,y,t) will denote the intensity at a point (x,y) at time t. This notation is shortened to Ik (m, n) to denote the intensity at pixel (m, n) at frame k. The background model for pixel (m, n) with an intensity of Ik (m, n) at frame k using a median filter is Ik (m, n) = median, (Ik_05L (m, n), Ik_0 5L+X (m, «),•••, Ik+05L (m, n)) ,
(1)
which retains the stationary pixels in the background. Here Ik (m, ri) represents the observed intensity and Ik(m,n) denotes the model for the background. A median filter can build the background even when moving objects exist in the scene, but usually requires a large amount of memory to save L frames at a time. So, it is applied only when we detect a large new blob that lasts for r/h frames, jjh = 5 in our examples. Background subtraction is performed in color and in edge density. We subtract the foreground from the background in each of RGB color channels and then take the maximum absolute values of these three differences as the difference value Diffc in color space as Diffc (Hi, n) = max) | Rf -Rh\,\
Gf -G„\,\ Bf -Bb\\,
(2)
where Rh, Gh and Bh are background color channels, Rf, (Jf and Bf are foreground color channel. We do a similar subtraction using edge density, Dens values, instead of color values, and thus obtain the difference value Diffe in edge density. The edge density Dens is defined as (2w + l ) - ( - , t ^ w ^ .
dx
dy
where 2w + 1 is the size of an average window filter. The binary foreground pixels F{m, n) are computed as F(m,n)
[I if Diffc (m, /I) > Thx OR Diffc, (m, n) > Th, =] . , (^ 0 otherwise
(4)
506
where 77?, and Th2 are fixed thresholds determined empirically. Our experiments show that color and texture are complementary when the target object is small in size relative to the entire image.
Figure 2. (a) Original picture, (b) Background extracted by the median filter, (c) Moving blobs
The resulting foreground image contains noise due to the clutter in the background. The 'close' binary morphological operator is applied to remove noise. Now, we segment the foreground F into several isolated blobs by an eightconnected algorithm. We assume that each initial blob has a moving object, which may be a person, a vehicle or a people group. Fig. 2 (a) - (c) shows an example of the above segmentation. The extracted blobs represent the moving objects in the image sequences. 3. Feature Extraction We extract four types of features for each moving blob. These features are generally used in object tracking except for the variance of motion direction, which is used for classification purposes. The centroid of a blob (M, N ) tells us the spatial position of the blob. It is the average position of all pixels in the blob. (w,w) is the position of a given pixel in a blob for a given time. Since an object will not move far from its last position from one frame to the next, its centroid provides us with a strong and useful feature for tracking objects. Shape features provide us with shape-based information about the objects. An object's shape generally does not change much between consecutive frames. We select four features, length and width, area, compactness and orientation. The object orientation is defined as the angle between the major axis and the horizontal axis. An object with a more complex shape is more likely to change its compactness than an object with a simple shape, assuming that the same segmentation method is used. We define these features by:
507
• •
Length and width (L, W) L = max(w) - min(w) ,
W = max(w) - mm(n) .
(5)
Area
(6)
A= XI^'")' (m,n)eR
where F(m,n) is the binary foreground pixels in a blob. •
Compactness c =
4K* Area Perimeter'
where Perimeter is the number of all boundary pixels including the boundaries of inside holes. • Orientation 1 0 = -arctan(
2wLin ^20
—
—) ,
(8)
^02
where upq = ^ ^ ( w - M ) ' ' ( n - N ) "
for /?, = 0 , 1 , 2 .
The use of color as a feature allows us to track objects when shape information is not reliable. Color is independent of the object size and orientation and especially useful when we can detect only partial objects. However, color is likely to change with the lighting. Most previous research [39, 23, 16] has concentrated on how to stabilize color in some color space, e.g. HSV, normalized RGB and YIQ. In our work, we use principal component analysis (PCA) to extract the color feature of the object. PCA is also called eigenspace decomposition and is applied to reduce the number of dimensions of the working space. It projects d-dimensional data onto a lower-dimensional subspace in a way that is optimal under a least square error criterion. We choose the axis that maximizes the variation of the data as our principal axis. This axis is the eigenvector corresponding to the maximum eigenvalue. We define the color similarity measure between objects and templates to be the transformation between their principal axes. All pixel RGB color values are collected from each blob. Three color channels of a pixel are denoted by a 1x3 vector rgb = [R G B]. The average color of all 77 pixels is given as 1 n rgh=-Y,rSbi-
u
(9)
508
Then, a new vector is defined as rgb = rgb - urgb. By concatenating all 77 pixels, an j] x 3 matrix RGB is formed to express the red, green and blue components of the full data set as RGB = [rgbx; rgb2 ;••• rgbn J. Next, a 3x3 covariance matrix E is calculated by 1, = -RGB'RGB
(10)
and the eigenvector and eigenvalue matrices are computed as SO = A®. We take the eigenvector O, corresponding to the largest eigenvalue A, as the principle axis. If two blobs have similar color, their principle axes should be similar. After we extract the above features for each object blob, the feature vector Rlk will represent the blob, whereRlk =[M,N,L,W,A,C,d,^>\]Our tracking is based on the extracted feature vector instead of the blob itself. 4. Object Tracking Our tracking process compares the feature vector Rlk with all templates Tlk_x (i=l, 2... T ). If a match is found, then the template is updated for the next match through an adaptive filter of the form Tlk={\-P)T,k_x+(3R,k
,
(11)
where the learning parameter /? depends upon how fast the template is updated. If no match is found for several successive frames, then a new candidate template TT+] k will be created. A template will be deleted from the candidate template list if it is not matched with any object for successive L frames, the length of the median filer. A hierarchical matching process is performed in the order of centroid, shape and color. As soon as a match occurs, the process of matching terminates for the given blob. Centroid matching In general, an object's next position will be in the neighborhood of its current position. We calculate the distances from a target object to all templates and sort the distances from the minimum to the maximum, namelyd mm ,d 2 ,---d max . Three thresholds, 77?,, Th4 and Th5, are used in our centroid matching. An object is likely to be a new object if itst/min > Th3. A match occurs if a distance is the dmin such that dmm < Th4. Considering the case when occlusion occurs, we add a third constraint to avoid a possible mismatching, the second minimum distance d2 > dma + Th5. This constraint prevents mismatching if there are two or more
509
objects at a distance less than threshold Th4 from the template. If the centroid matching procedure fails to match the target object to any template, then the next matching procedure, shape or color matching, is necessary. To reduce the possibility of a false match, we compare the target object's shape or color only with templates that are the first four minimum distances from the object. Shape matching We compute the distance from the shape feature vector [A, C, 0] to the template as Dis = (A/Atemp - l ) 2 +(C/C,emp - l ) 2 +(0/dlemp - l ) 2 ,
(12)
where A, , C, and tit a r e t n e a r e a ' compactness and orientation of the template, respectively. The object is assigned to the template that yields the least distance if the distance is less than a predetermined threshold. To achieve a good measure of dissimilarity using distance, we normalize the shape feature prior to calculating the distance. Color matching We use a similar function to compare the angles between the principle axes of the color of the object and the template. The normalized inner product
s($„or)=
;
r
en)
P*ll IP r I is used as the similarity function. The value of this metric is larger when blob d>R and template <£r are similar. This measurement, which is the cosine of the angle between
510
5. Object Classification The motion feature is an efficient way to track humans and vehicles [3, 28]. Vehicles are more consistent in their motion because they are rigid objects, whereas humans shift some parts of their bodies backwards when they move forward to maintain balance. So the variance of motion direction is employed to measure motion consistency. We derive this feature from optical flow. Optical flow is widely used to estimate motion based on the assumption that the change in image brightness in the sequence is due only to motion. This assumption leads to a brightness constraint equation as dl, .dx dl, .dyn ,, ^ N dl . — (x,y,t) + — (x,y,t) — + —(x,y,t)-^ = 0 , (14) dt ox dt oy dt where I(x, y, t) is the image intensity as a function of space (x, y) and time t. This equation does not determine the flow field since it only has one restriction for two parameters dx I dt and dyldt. In order to obtain a unique solution, an additional smoothness constraint is imposed as argmin{(|)^(|)2}.
(15)
By Lagrange optimization, the solutions are dl 81 dt dx
dx =VxJ
~dl
ox
dl_dl_ and
dt dy
dy dt ox
oy
oy
where vT, and v v , are the velocities along the x axis and y axis at time t, respectively. The weighted velocity sums of three successive frames with weights, a_x, a0 and a, are v
x,t =a-\vxj-\ +a0vXJ + axvxl+x ,
(17)
Vyl = a_xvyJ_x + a0vvl + axvvl+x .
(18)
Then, a Gaussian filter G is applied, yielding Vl, =VXJ ®G and V*yJ=Vyjt®G
,
(19)
where the symbol ® is a convolution operator. We define motion direction as S = arctm2(Vv,
Vxl) ,
(20)
511 where arctan 2 is the four quadrant arctangent function. We calculate 6 for each pixel in an object blob and compute the variance crj of all 5 ' s in the blob. Our experiments show that crj is an efficient metric to distinguish a single person. The variance of motion direction aj cannot discriminate a people group from a vehicle. We added the variance of compactness cr; into consideration based on the observation that the shape of a people group tends to change dramatically, which is measured by the variance of compactness over frames, denoted as