PROCEEDINGS IWISP '96 4-7 November 1996 Manchester, U.K.
This Page Intentionally Left Blank
PROCEEDINGS IWISP '96 4-7 November 1996, Manchester, United Kingdom Third International Workshop on Image and Signal Processing on the Theme of Advances in Computational Intelligence
Edited by
B.G. MERTZIOS Automatic Control Lab., Dept. of Electrical & Comp. Engineering, Democritus University of Thrace, GR-67 100 Xanthi, GREECE
P. LIATSIS Control Systems Centre, Dept. of Electrical Engineering & Electronics, UMIST, Sackville Street, P.O. Box 88, Manchester M60 1QD, United Kingdom
ELSEVIER AMSTERDAM - LAUSANNE- NEW YORK - OXFORD - SHANNON- TOKYO
1996
ELSEVIER SCIENCE B.V. Sara Burgerhartstraat 25 P.O. Box 211, 1000 AE Amsterdam, The Netherlands
ISBN: 0 444 82587 8
9
1996 Elsevier Science B.V. All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior written permission of the publisher, Elsevier Science B.V., Copyright & Permissions Department, P.O. Box 521, 1000 AM Amsterdam, The Netherlands. Special regulations for readers in the U.S.A.- This publication has been registered with the Copyright Clearance Center Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923. Information can be obtained from the CCC about conditions under which photocopies of parts of this publication may be made in the U.S.A. All other copyright questions, including photocopying outside of the U.S.A., should be referred to the copyright owner, Elsevier Science B.V., unless otherwise specified. No responsibility is assumed by the publisher for any injury and/or damage to persons or property as a matter of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions or ideas contained in the material herein. This book is printed on acid-free paper. Printed in The Netherlands.
Preface The p a p e r s t h a t
are
included
International
Workshop
Computational
Intelligent,
November,
on
in t h i s
volume have been p r e s e n t e d a t
Image/Signal
Processing
(IWISP):
UMIST in a s s o c i a t i o n Electrical
Advances
which was h e l d a t UMIST, M a n c h e s t e r ,
1996. The 3rd IWISP was o r g a n i s e d by t h e C o n t r o l
the 3rd
UK on 4-7
Systems C e n t r e ,
w i t h IEEE Region 8 and c o - s p o n s o r e d by the I n s t i t u t e
Engineers,
the
Institute
of Measurement
in
and C o n t r o l ,
the
of IEEE
S i g n a l P r o c e s s i n g S o c i e t y and the C o n t r o l Technology T r a n s f e r Network, under the
General
Chairmanship
C h a i r m a n s h i p of P r o f .
of
Prof.
Peter
E.
Wellstead
and
the
Programme
B a s i l G. M e r t z i o s .
E v i d e n t l y , a Workshop cannot c o v e r t h e i n t e n s i v e l y d e v e l o p e d a r e a of Image and Signal
Processing.
The t r e n d of t h e 3rd
Computational
IWISP i s
Intelligence',
emphasized by i t s
'Advances
in
referring
efficiency
and c o m p l e x i t y on Image and S i g n a l P r o c e s s i n g .
Workshop f o c u s e s in t h e most modern and c r i t i c a l Processing society.
and t h e i r
related
Specifically,
categorized
the
areas
that
articles
to
theme:
computational
In p a r t i c u l a r ,
the
a s p e c t s of Image and S i g n a l
have a s i g n i f i c a n t
presented
in
the
3rd
impact
in our
IWISP
may be
in t h e f o l l o w i n g f o u r major p a r t s :
I Coding and Compression (image c o d i n g , representation,
image subband,
wavelet
c o d i n g and
v i d e o c o d i n g , motion e s t i m a t i o n and m u l t i m e d i a ) ;
I Image P r o c e s s i n g and P a t t e r n R e c o g n i t i o n (image a n a l y s i s ,
edge d e t e c t i o n ,
segmentation,
systems,
processing,
image enhancement
pattern
and r e s t o r a t i o n ,
adaptive
colour
and o b j e c t r e c o g n i t i o n and c l a s s i f i c a t i o n ) ;
n F a s t P r o c e s s i n g T e c h n i q u e s ( c o m p u t a t i o n a l methods, VLSI DSP a r c h i t e c t u r e s ) ; I Theory and A p p l i c a t i o n s banks,
wavelets
in
(identification
image and s i g n a l
and m o d e l l i n g ,
processing,
multirate
biomedical
and
filter
industrial
applications). The p r o p o s a l s
from each c a t e g o r y were t h e n r e v i e w e d by the members of t h e
International
Programme
Committee
and
numerous
other
reviewers.
We a r e
sincerely grateful
to t h e r e v i e w e r s and to t h e v o l u n t e e r s who a c t e d as i n v i t e d
sessionorganisers
and h e l p e d up to a t t r a c t
r e v i e w p r o c e s s , about t h r e e f i f t h s final
programme c o n s i s t e d
quality
papers.
exceptionally
interesting
of
the
sessions, papers
giving a total
presented
and wide i n t e r n a t i o n a l
c o n t i n e n t s and r e p r e s e n t i n g
In t h e
of t h e s u b m i t t e d p a p e r s were a c c e p t e d . The
of 24 o r a l
The a u t h o r s
high q u a l i t y c o n t r i b u t i o n s .
in
of 152 h i g h
IWISP-96
form an
group coming from the f i v e
t h e f o l l o w i n g 33 c o u n t r i e s :
Argentina,
Armenia,
vi
Australia,
Belgium, B r a z i l , Canada, China, C r o a t i a , Czech R e p u b l i c , F i n l a n d ,
France, Germany, 6 r e e c e , Hong Kong, I n d i a , I s r a e l , Mexico, The N e t h e r l a n d s ,
Poland, R u s s i a ,
Iran, Italy,
Slovakia,
Slovenia,
Japan, Korea, Spain,
Sweden,
Taiwan, Turkey, UK, USA and Y u g o s l a v i a . The f i r s t
and second IWISP have been held in Budapest under the c h a i r m a n s h i p
of
Kalman Fazekas.
Prof.
signifies
a
true
successful
future.
The
internationalisation
and
fertilisation strong
techniques, reduction,
and
Systems,
focus on the
the
3rd
IWISP to
strengthens
and
where
there
interdisciplinary is
of t h e o r y and a p p l i c a t i o n s .
interest
of
Manchester
guarantees
a
The next Workshops w i l l be o r g a n i s e d by an I n t e r n a t i o n a l
S t e e r i n g Committee and w i l l Processing
transition
include
lossless
multiresolution
a
great
systems,
and w a v e l e t s ,
a d a p t i v e systems and f i l t e r s ,
potential
Amongst o t h e r s ,
and o r t h o g o n a l
analysis
a r e a s of Signal for
typical
c a s e s of
linear prediction
model
and
data
c o m p u t a t i o n a l c o m p l e x i t y and n o n - l i n e a r dynamics.
Acknowledgements
and
are
order
2D c o n t r o l systems, l e a r n i n g t h e o r y
and a p p l i c a t i o n s ,
appreciation
cross-
due
to
all
the
contributors
who
s u b m i t t e d t h e i r p r o p o s a l s f o r review to IWISP'96. Needless to say, we could not have such a high q u a l i t y t e c h n i c a l programme w i t h o u t t h e i r c o n t r i b u t i o n s . We a l s o wish to s i n c e r e l y
thank the members of the I n t e r n a t i o n a l
Programme
Committee, the r e v i e w e r s and a l l those t h a t helped in the o r g a n i s a t i o n of the Workshop.
B a s i l G. M e r t z i o s Panos L i a t s i s
vii
IWISP '96 ORGANIZING COMMITTEE
P.E. W e l l s t e a d , UMIST, UK ( General Chair) M. Domanski, TU Poznan, Poland ( T u t o r i a l s Chair) K. Fazekas, TU Budapest, Hungary ( F i n a n c i a l Chair) P. L i a t s i s ,
UMIST, UK
( P r o c e e d i n g s / P u b l i c i t y Chair)
B.G. M e r t z i o s , D e m o c r i t u s Univ. of Thrace, Greece (Program Chair)
. ~ 1 7 6
Vlll
INTERNATIONAL PROGRAMME COMMITTEE
I. A n t . H i . u , J.
Solvay I n s t . ,
Biemond, TU D e l f t ,
Belgium
The N e t h e r l a n d s
Z. B o j k o v i c , Belgrade Univ., Y u g o s l a v i a I. B o u t a l i s ,
Democritus Univ. of Thrace, Greece
M. Brady, Univ. of Oxford, UK V. C a p p e l l i n i ,
Florence Univ.,
G. C a r a g i a n n i s ,
Italy
NTUA, Greece
A.C. C o n s t a n t i n i d e s ,
I m p e r i a l C o l l e g e , UK
T. Cooklev, Univ. of Toronto, Canada J. C o r n e l i s ,
Vrije Universiteit
Brussel,
Belgium
A. Davies, K i n g ' s C o lle g e London, UK I. E r e n y i , KFKI Research I n s t . ,
Hungary
G. F e t t w e i s , R u h r Univ. Bochum, Germany M. Ghanbari, Univ. of Essex, UK S. van H u f f e l , KU Leuven, Belgium G. I s t e f a n o p o u l o , V.V. I v a n . v ,
Bosporous U n i v . , Turkey
JINR, R u s s i a
M. Karny, UTIA, Academy of S c i e n c e s , Czech Republic T. Kida, Tokyo I n s t .
of Technology, Japan
J. K i t t l e r ,
Univ. of S u r r e y , UK
S. K o l l i a s ,
NTUA, Greece
M. Kunt, U n i v e r s i t y of Lausanne, S w i t z e r l a n d C.L. N i k i a s , Univ. of Southern C a l i f o r n i a ,
USA
T. Nossek, TU Munchen, Germany D. van Ormondt, TU D e l f t , K.K. P a r h i , M. P e t r o u ,
The N e t h e r l a n d s
Univ. of M i n n e s o t t a , USA Univ. of S u r r e y , UK
D.T. Pham, Univ. of Wales C a r d i f f , M. S a b l a t a s h ,
UK
Mcmaster Univ., Canada
D.G. Sampson, Democritus Univ. of Thr~ceT-Greece W. Schemmp, Siegen U n i v . , Germany M. S t r i n t z i s ,
Aristotle
J. Turan, TU Kosice,
Univ. of T h e s s a l o n i k i ,
Slovak Republic
G.J. V a c h t s e v a n o s , Georgia I n s t . A. V e n e t s a n o p o u l o s ,
of T e c h . , USA
Toronto Univ., Canada
Greece
ix
Contents
Session A: Image Coding I: Vector Quantisation, Fractal and Segmented Coding Joint optimization of multidimensional SOFM codebooks with QMA modulations for vector quantized image transmission O. Aitsab, R. Pyndiah and B. Solaiman Visual vector quantization for image compression based on laplacian pyramid structure Z. He, G. Qiu and S. Chen Kohonen's self-organizing feature maps with variable learning rate: Application to image compression A. Cziho, B. Solaiman, G. Cazuguel, C. Roux and I. Lovany
11
An efficient training algorithm design for general competitive neural networks J. Jian and D. Butler
15
Architecture design for polynomial approximation coding of image compression C.-Y. Lu and K.-A. Wen
19
Application of shape recognition to fractal based image compression S. Morgan and A. Bouridane
23
Chrominance vector quantization for coding of images and video at very low bitrates M. Bartkowiak, M. Domanski and P. Gerken
27
Region-of-interest based compression of magnetic resonance imaging data N.G. Panagiotidis and S.D. Kollias
31
Scalable parallel vector quantization for image coding applications D.G. Sampson, A. Cuhadar and A.C. Downton
37
Session B: Wavelets in Image/Signal Processing Real time image compression methods incorporating wavelet transforms D.T. Morris and M.D. Edwards
43
Custom wavelet packet image compression design M.V. Wickerhauser
47
Two-dimensional directional wavelets in image processing J.-P. Antoine
53
The importance of the phase of the symmetric daubechies wavelets representation of signals J.-M. Lina
61
Contrast enhancement in images using the 2D continuous Wavelet transform J.-P. Antoine and P. Vandergheynst
65
Wavelets and differential-dilation equations T. Cooklev, G. Berbecel and A.N. Venetsanopoulos
69
Wavelets in high resolution radar imaging and clinical magnetic resonance imaging W. Schempp
73
Wavelet transform based information extraction from 1-D and 2-D signals A. Dabrowski
81
Invited Session C: General techniques and algorithms Computational methods and tools for simulation and analysis of complex processes V.V. Ivanov
89
Rare events selection on a background of dominated processes applying multilayer perceptron V.V. Ivanov and P.V. Zrelov
97
Cellular automation and elastic neural network application for event reconstruction in high energy physics I. Kisel, E. Konotopskaya and V. Kovalenko
101
Recognition of tracks detected by drift tubes in a magnetic field S.A. Baginyan and G.A. Ososkov
105
Session D: Adaptive Systems I: Identification and Modeling A unified connective representation for linear and nonlinear discrete-time system identification J. Fantini
111
Predicting a chaotic time series using a dynamical recurrent neural network R. Teran, J-P. Draye and D. Pavisic
115
A new neural network structure for modelling non-linear dynamical systems A. Hussain, J.J. Soraghan, T.S. Durrani and D.C. Campell
119
xi A neural network for moving light display trajectory prediction H.M. Lakany and G.M. Hayes
123
Recognizing flow pattern of gas/liquid two-component flow using fuzzy logical neural network P. Lihui, Z. Baofen, Y. Danya and X. Zhijie
127
Adaptive algorithm to solve the mixture problem with a neural networks methodology A.M. Perez, P. Martinez, J. Moreno, A. Silva and P.L. Aguilar
133
Process trend analysis and fuzzy reasoning in fermentation control S. Kivikunnas, K. Ibatici and E. Juusso
137
Higher order cumulant maximisation using non-linear hebbian and anit-hebbian learning for adaptive blind separation of source signals M. Girolami and C. Fyfe
141
Session E: Pattern/Object Recognition A robot vision system for object recognition and work piece location W. Min, D. Qizhi and W. Jun
147
Recognition of objects and their direction of moving based on sequence of two-dimensional frames B. Potochik and D. Zazula
151
Innovative techniques for the recognition of faces based on multiresolution analysis and morphological filtering A. Doulamis, N. Tsapatsoulis and S. Kollias
155
Partial curve identification in 2-D space and its application to robot assembly E-H. Yao, G.-E Shao, A. Tamaki and K. Kato
161
A fast active contour algorithm for object tracking in complex background C.L. Lam and S.Y. Yuen
165
The 2-point combinatorial probabilistic Hough transform for circle detection J.Y. Goulermas and P. Liatsis
169
Modified rapid transform features in information symbols recognition system J. Turan, K. Fazekas, L. Ovsenik and M. Kovesi
173
Image data processing in flying object velocity optoelectronic measuring device J. Mikulec and V. Ricny
177
xii
Session F: Texture Analysis Rotation invariant texture classification schemes using GMRFs and Wavelets R. Porter and N. Canagarajah
183
A new method for describing texture D.T. Pham and B. Cetiner
187
Texture discrimination for quality control using wavelet and neural network techniques D.A. Karras, S.A. Karkanis and B.G. Mertzios
191
A region oriented CFAR approach to the detection of extensive targets in textured images C. Alberola-Lopez, J.R. Casar-Corredera and J. Ruiz-Alzola
195
Generating stabile structure of a color texture image using scale-space analysis with nonuniform gaussian kernels S. Morita and M. Tanaka
199
Session G: Image Coding II: Transform, Subband and Wavelet Coding Approximation of bidimensional Karhunen-Loeve expansions by means of monodimensional Karhunen-Loeve expansions, applied to Image Compression N. Balossino and D. Cavagnino
205
Blockness distortion evaluation in block-coded pictures M. Cireddu, EG.B. De Natale, D.D. Giusto and P. Pes
209
A new distortion measure for the assessment of decoded images adapted to human perception F. Bock, H. Walter and M. Wilde
215
Image compression with interpolation and coding of interpolation errors J. Yi and F. Arp
219
Matrix to vector transformation for image processing D. Ait-Boudaoud
223
A speech coding algorithm based on wavelet transform X. Wu, Y. Li and H. Chen
227
Automatic determination of region importance and JPEG codec reflecting human sense R. Hayasaka, J. Zhao, Y. Shimazu, K. Ohta and Y. Matsushita
231
Directional image coding on wavelet transform domain D.W. Kang
235
xiii
Session H: Video Coding I: MPEG An universal MPEG decoder with scalable picture size R. Prabhakar and W. Li
241
The influence of impairments from digital compression of video signal on perceived picture quality S. Bauer, B. Zovko-Cihlar and M. Grgic
245
On scalable coding of image sequences L. Erwan
249
Image transmission problems between IP and ATM networks V.S. Mkrttchian, A.V. Eranosian and H.L. Karamyan
253
A scalable video coding scheme based on adaptive infield/inframe DCT and adaptive frame interpolation M. Asada and K. Sawada
257
Rate conversion of compressed video for matching bandwidth constraints in ATM networks* P. Assunq~o and M. Ghanbari
Session h Image Subband, Wavelet Coding and Representation Unified image compression using reversible and fast biorthogonal wavelet transforms H. Kim and C.C. Li
263
Subband image coding using adaptive fuzzy quantization step controller P. Planinsic, E Jurkovic, Z. Cucej and D. Donlagic
267
EZW algorithm using visual weighting in the decomposition and DPCM L. Lecornu and C. Jedrzejek
271
Efficient 3-D subband coding of color video M. Domanski and R. Swierczynski
277
Adaptive wavelet packet image coding with zerotree structure T. Otake, K. Fukuda and A. Kawanaka
281
Efficiency of the image morphological pyramid decomposition D. Sandic and D. Milovanovic
285
Optimal vector pyramidal decompositions for the coding of multichannel images D. Tzovaras and M.G. Strintzis
289
* Due to unavoidable circumstances this paper has been placed at the end of the book on page 701.
xiv
Session J: Segmentation Multilingual character segmentation using matching rate K.-A. Moon, S.-Y. Chi, J.-W. Park and W.-G. Oh
295
Architecture of an object-based tracking system using colour segmentation R. Garcia-Campos, J. Battle and R. Bischoff
299
Segmentation of retinal images guided by the wavelet transform T. Morris and Z. Newell
303
An adaptive fuzzy clustering algorithms for image segmentation Y.A. Tolias and S.M. Panas
307
Hy2: A hybrid segmentation method E Marino and G. Mastonardi
311
Session K: Image Enhancement/Restoration Efficient computation of the 2-dimensional RGB vector median filter S.J. Sangwine and A.J. Bardos
317
Image restoration for millimeter wave images by Hopfield neural network K. Yuasa, H. SawN, K. Watabe, K. Mizuno and M. Yoneyama
321
Image restoration of medical diffraction tomography using filtered MEM K. Hamamoto, T. Shiina and T. Nishimura
325
Directionally adaptive image restoration X. Neyt, M. Acheroy and I. Lemanhieu
329
Optimal matching of images at low photon level M. Guillaume, T. Amoroux and P. Refregier
333
A method for controlling the enhancement of image features by unsharp masking filters E. Cernadas, L. Gomez, A. Casas, P.G. Rodriguez and R.G. Carrion
337
Image noise reduction based on local classification and iterated conditional models K. Hafts, S.N. Efstratiadis, N. Maglaveras and C. Pappas
341
Session L: Adaptive Systems II: CLASSIFICATION A neural approach to invariant character recognition I.M. Spiliotis, P. Liatsis, B.G. Mertzios and Y.P. Goulermas
347
xv Image segmentation based on boundary constraint neural network F. Kurugollu, S. B irecik, M. Sezgin and B. Sankur
353
A high performance neural multiclassifier system for generic pattern recognition applications D. Mitzias and B.G. Mertzios
357
Application of a neural network for multifont farsi character recognition using fuzzified pseudo-zernike moments M. Namazi and K. Faez
361
Integrating LANDSAT and SPOT images to improve landcover classification accuracy A. Chiuderi
365
Classification of bottle rims using neural networks-an LMS approach C. Teoh and J.B. Levy
369
INVITED SESSION M: Wavelets and Filter Banks in Communications Data compression, data fusion and kalman filtering in wavelet transform Q. Jin, K.M. Wong, Z.M. Luo and E. Bosse
377
Performance of wavelet packet division multiplexing in timing errors and flat fading channels J. Wu, K.M. Wong and Q. Jin
381
Time-varing wavelet-packet division multiplexing T.N. Davidson and K.M. Wong
385
Co-channel interference mitigation in the time-scale domain: the CIMTS algorithm S. Heidari and C.L. Nikias
389
Design and performance of DS/SS signals defined by arbitrary orthonormal functions W.W. Jones and J.C. Dill
393
COFDM, MC-CDMA and wavelet-based MC-CDMA K. Chang and X. Lin
397
Signal denoising through multifractality W. Kinsner and A. Langi
405
Application of multirate filter bank to the co-existence problem of DS-CDMA and TDMA systems S. Hara, T. Matsuda and N. Morinaga
409
xvi
Session N: Edge Detection Multiscale edges detection by wavelet transform for model of face recognition E Yang, M. Paindavoine and H. Abdi
415
Edge detection by rank functional approximation of grey levels J.P. Asselin de Beauville, D. Bi and EZ. Kettaf
419
Fuzzy logic edge detection algorithm S. Murtovaara, E. Juuso and R. Sutinen
423
Topogical edge finding M. Mertens, H. Sahli and J. Cornelis
427
Session O: Video Coding II: Motion Estimation Automatic parallelization of full 2D block matching for real-time motion compensation and mapping into special purpose architectures N. Koziris, G. Papakonstantinou and P. Tsanakas
433
New search region prediction method for motion estimation D.H. Ryu, C.R. Kim, T.W. Choi and J.C. Kim
439
Motion estimation by direct minimisation of the energy function of the Hopfield neural network L. Cieplinski and C. Jedrzejek
443
A modified MAP-MRF motion-based segmentation algorithm for image sequence coding D. Gatica-Perez, E Garcia-Ugalde and V. Garcia-Garduno
447
Unsupervised motion segmentation of image sequences using adaptive filtering O. Pichler, A. Teuner and B.J. Hostika
451
Development of a motion compensated coding system for an enhanced wide screen TV T. Hamada and S. Matsumoto
455
Session P: Biomedical Applications Brain evoked potentials mapping using the diffuse interpolation D. Bouattoura, P. Gaillard, P. Villon and E Langevin
461
Computer-aided diagnosis: detection of masses on digital mammograms A.J. Mendez, P.G. Tahoces, M.J. Lado, M. Souto and J.J. Vidal
465
Model order determination of ECG beats using rational function approximations J.S. Paul, V. Jagadeesh Kumar and M.R.S. Reddy
469
xvii Computation of the ejection rate of the ventricle from echocardiographic image sequences A. Teuner, O. Pichler and B.J. Hosticka
475
Contour detection of the left ventricle in echocardiographic images S.G. dos Santos, E Bortolozzi and J. Facon
479
Identification of a stochastic system involving neuroelectric signals A.G. Rigas
483
Invited Session Q: Signal Processing Theory and Applications Design of m-band linear phase FIR filter banks with high attetuation in stop bands T. Kida and Y. Kida
489
Robustness of filter banks F.N. Kouboulis, M.G. Scarpetis and B.G. Mertzios
493
Design and learning algorithm of neural networks for pattern recognition H. Takahashi and M. Nakajima
497
Statistical comparison of minimum cross entropy spectral estimators R.C. Papademetriou
501
Generalized optimum approximation minimizing various measure of error at the same time T. Kida
507
Determination of optimal coefficients of high-order error feedback upon Chebyshev criteria A. Djebbari, A1. Djebbarri, M.E Belbachir and J.M. Rouvaen
511
Invited Session R: VLSI DSP Architectures Dynamic codelength reduction for VLIW instruction set architectures in digital signal processors M. Weiss and G. Fettweis
517
Implementation aspects of FIR filtering in a wavelet compression scheme G. Lafruit, B. Vanhoof, J. Bormans, M. Engels and I. Bolsens
521
Recursive approximate realisation of image transforms with orthonormal rotations G.J. Hekstra, E.F. Deprettere, M. Monari and R. Heusdens
525
Radix distributed arithmetic: algorithms and architectures M.K. Ibrahim
531
Order-configurable programmable power-efficient FIR filters C. Xu, C.-Y. Wang and K.K. Parhi
535
xviii
Session S: Video Coding III: Multimedia On speech compression standards in multimedia videoconferencing: Implementation aspects M. Markovic and Z. Bojkovic
541
Multimedia communication graphical user interface design principles for the teleeducation J. Turan, K. Fazekas, L. Ovsenik and M. Kovesi
545
Image and video compression for multimedia applications D.G. Sampson, E. da Silva and M. Ghanbari
549
A multilayer image coding and browsing system G. Qiu
553
Switched segmented image coding-JPEG schemes for progressive image transmission C.A. Christopoulos, A.N. Skodras, W. Philips and J. Cornelis
557
Low bit rate coding of image sequences using regions of interest and neural networks N. Doulamis, A. Tsiodras, A. Doulamis and S. Kollias
561
Session T: Image Analysis I Iterated function systems for still image processing J.-L. Dugelay, E. Polidori and S. Roche
569
Sensing Surface Discontinuities via Coloured Spots C.J. Davis and M.S. Nixon
573
Image analysis and synthesis by learning from examples S.G. Brunetta and N. Ancona
577
A stabilized multiscale zero-crossing image representation for image processing tasks at the level of the early vision S. Watanabe, T. Komatsu and T. Saito
581
Finding geometric and structural information from 2D image frames R. Jaitly and D.A. Fraser
585
Detection of small changes in intensity on images corrupted by signal-dependent noise by using the wavelet transform Y. Chitti and P. Gogan
589
Deterioration detection in a sequence of large images O. Buisson, B. Besserer, S. Boukir and L. Joyeux
593
xix
Invited Session U: Color Processing Segmentation of multi-spectral images based on the physics of reflection N. Kroupnova
599
Using color correlation to improve restoration of colour images D. Keren, A. Gotlib and H. Hel-Or
603
Colour eigenfaces G.D. Finlayson, J. Dueck, B.V. Funt and M.S. Drew
607
Colour quantification for industrial inspection M. Petrou and C. Boukouvalas
611
Colour object recognition using phase correlation of log-polar transformed Fourier spectra A.L. Thornton and S.J. Sangwine
615
SIIAC: Interpretation system of aerial color images S. Mouhoub, M. Lamure and N. Nicoloyannis
619
Session V: Industrial Applications Nodular quantification in metallurgy using image processing V.L. Ballarin, E. Moler, E Pessana, S. Torres and M. Gonzalez
625
Image processing in the measurement of trash content and grades in cotton B.D. Farah
629
Automated visual inspection based on fermat number transform J. Harrington and A. Bouridane
633
Segmentation of birch wood board images D.T. Pham and R.J. Alcock
637
Techniques for classifying sugar crystallization images based on spectral analysis and the use of neural networks E.S. Gonzalez-Palenzuela and P.I. Vega-Cruz
641
Large-scale tomographic sensing system to study mixing phenomena M. Wang, R. Mann, EJ. Dickin and T. Dyakowski
647
Session W: Image Analysis lI Structural indexing of infra-red images using statistical histogram comparison B. Huet and E. Hancock
653
XX
A model-based approach for the detection of airport transportation networks in sequences of aerial images D. Sarantis and C.S. Xydeas
657
Context driven matching in structural pattern recognition S. Gautama and J.P.E D'Haeyer
661
An efficient box-counting fractal dimension approach for experimental image variation characterization A. Conci and C.F.J. Campos
665
An identification tool to build physical models for virtual reality J. Louchet and L. Jiang
669
Cue based camera calibration and its application to digital moving image production Y. Nakazawa, T. Komatsu and T. Saito
673
Session X: Signal Processing II A novel approach to phoneme recognition using speech image (spectrogram) M. Ahmadi, N.J. Bailey and B.S. Hoyle
679
Modified NLMS algorithms for acoustic echo cancellation M, Medvecky
683
Matrix polynomial computations using the reconfigurable systolic torus T.H. Kaskalis and K.G. Margaritis
687
Real-time connected component labelling on one-dimensional array processors based on content-addressable memory: optimisation and implementation E. Mozef, S. Weber, J. Jabar and E. Tisserand
691
A 2-D window processor for modular image processing applications and its VLSI implementation P. Tzionas, C. Mizas and A. Thanailakis
695
Session A IMAGING CODE I: VECTOR QUANTISATION, FRACTAL AND SEGMENTED CODING
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
Joint optimization of multi-dimensional SOFM codebooks with QAM modulations for vector quantized image transmission O. AITSAB*, R. PYNDIAH* & B. SOLAIMAN** TELECOM BRETAGNE, B.P. 832, 29285 Brest Cedex, France. (Tel : (33) 98 00 10 70, Fax : (33) 98 00 10 98) *Dept. S.C., **Dept. I.T.I. Email : omar.aitsab @enst-bretagne.fr
Abstract Traditionally, source coding and channel modulation characteristics are optimized separately. Source coding reduces the redundancy in an input signal (information compression), while the modulation adapts the information to the transmission channel characteristics in order to be noise resistant. In this paper, the internal structure of the source coding scheme (a self organized feature map, vector quantizer) is trained in conjunction with a QAM modulation type, in order to increase the tolerance of transmission error effects. Results obtained using the standard Lenna image are extremely encouraging.
I- Introduction The requirements of digital transmission systems are now becoming so severe that it is no longer possible to optimize different functions in the system independently. Today, most transmission systems use the concept of coded-modulation [1] (TCM) which leads to a better spectral efficiency through the global optimization of channel coding and modulation. On the other hand, powerful source coding techniques are used to increase the number of sources transmitted in a given frequency bandwidth. However, the quality of the transmitted sources using these source coding techniques usually depends on the channel bit error rate. To go one step further, one would expect the subjective quality of the transmitted sources (image or speech) to remain acceptable even at a very low channel signal to noise ratio as in an analogue transmission system. In this paper, the joint optimization of image coding (using vector quantization) and modulation is considered in order to minimize the effect of transmission errors on the subjective quality of the received/reconstructed images.
II - Image source coding Recently, vector quantization (VQ) has emerged as an effective tool for image compression (source coding) [2]. In VQ, a data vector X (or a sub-image) to be encoded is represented as one of a finite set of M symbols. Associated with each symbol "i" is a reference vector (sub-image) "Ci" called a codeword. The complete set of M Code,re-orals is called the codebook. The codebook C = {Ci, i=1,2, .. M} is usually obtained through a train'mgprocess using a large set of training data that is statistically representative of the data encountered in practice. In this study, the determination of the codebook is conducted using the Self Organizing Feature Map (SOFM) proposed by T.Kohonen [3]. This model builds up a mapping from the N-dimensional vector space of real numbers ~RN to a two dimensional array "S" of cells. Each cell is given a virtual position in ~ N . This position (given by synaptic weights connecting this cell to the input vector) is in fact the codeword.
The purpose of the self-organization process is to find the position vectors such that the resulting mapping (correspondence between an input vector X and the cell which lies nearest in ~RN ) is a topology-preserving mapping (adjacent vectors in ~RNare mapped on adjacent, or identical, cells in the array "S"). The learning algorithm that forms feature maps selects the best matching (or winning) cell according to the minimum Euclidean distance between its position and the input vector X. All position vectors in the neighborhood of the winning cell are adjusted in order to make them more responsive to the current input. The quantized Lenna image using a 16xl 6 SOFM is given in figure 2 (Image 1). The codebook trained by the SOFM algorithm presents an internal order, which means that the Euclidean distance between codewords increases with the topological distance in the codebook (see figure 1); this order can be employed to increase error tolerance. In the next section, each codeword will be referenced by its topological position (i,j) on the SOFM. III- Image
transmission
In the case of a vector quantized image, the image transmission is done by transmitting the coordinates (i,j) of the different codewords representing the image. At the receiver end, the codewords corresponding to the received coordinates are used to reconstruct the transmitted image. It is clear that the received codeword can be different from the transmitted one when the received coordinates are subject to transmission errors. Furthermore, if we do not take any precautions, these codewords can be completely different, that is a white block may be transformed into a black one and vice-versa ("salt and pepper" noise). This can lead to a very bad subjective quality of the received image with black dots in white zones and vice-versa as illustrated by Image 3 in figure 2. To reduce the effect of transmission errors on the received image, the probability of a transition between two codewords must be a decreasing function of the Euclidean distance between them. To obtain this characteristic, the internal order of the bi-dimensional (16x16) codebook obtained with the SOFM algorithm was used in conjunction with a 256QAM modulation. In this particular case, each codeword is associated to one specific point in the 256QAM constellation (see figure 1). This means that the topology of the SOFM is preserved in the modulation space. Thus, and since the symbol error probability is a decreasing function of the Euclidean distance between the constellation points, the transition probability between two codewords will be a decreasing function of the Euclidean distance between them. The performance of this approach is illustrated by Image 2 in figure 2. We observe that the subjective quality of the reconstructed image is very good for a bit error rate of l 0 -2.
Figure I : Mapping of bi-dimensional (16x16) SOFM codebook and 256 QAM constellation
However, the 256QAM modulation is rarely used in practical transmission systems. So, we propose to transmit the codeword coordinates using a QAM modulation with a smaller number of states, for example 16QAM modulation. In this case, each coordinate is represented by 4 bits and associated with a specific point in the 16QAM constellation by using a Gray mapping. The result of the reconstructed image is shown in figure 2 (Image 3). The degradation of the image is great because the bi-dimensional codebook is not adapted to 16QAM modulation. In order to improve the quality of the received image, we have adapted the SOFM codebook topology to the type of modulation without increasing the complexity of modulation and source coding [4]. The main idea is to minimize the transmission error effects. So, two adjacent codewords must have adjacent points in the QAM constellation. In the best case, the number of codewords must be equal to the number of modulation states. This was the case with the 256QAM modulation and the reconstructed image presented good subjective quality even at a low BER (10-2). However, when the codeword number is greater than the number of modulation states, the SOFM topology must be adapted to the modulation. For 16QAM modulation, a four-dimensional codebook is required, and each codeword has 4 coordinates. Each coordinate takes 4 values, and each specific constellation point is associated with two coordinates. Thus, the four-dimensional codebook is trained for 16QAM modulation. Image 4 in figure 2 shows the reconstructed image by using this ordered codebook for a BER of 10-2. We clearly observe an improvement in the subjective quality: the PSNR is 5.7 dB higher than for the unordered codebook. IV - Simulation results
We simulated the effects of transmission errors and their compensation by joint opimization of the SOFM codebook and QAM modulation in image compression [5][6], using codebooks consisting of 256 codewords for 3 by 3 pixel subimages. The codebooks were trained using two images (boat and bridge) and were tested on the Lenna image. All the images were 512 by 512 pixels, with 256 grey levels. Distortion in the decoded images was measured using a peak signal-to-noise ratio (PSNR) defined as :
PSNR = 10 log
2552 MSE
dB,
where MSE is the mean square error.
V- Conclusion
The optimal association of a two-dimensional code book containing 16x16 elements with a 256QAM modulation is very robust to transmission errors. When using a 16QAM modulation, the overall performance of the system can be improved by using a 4-dimensional codebook specifically trained for 16 QAM modulation. However, we obtain lower performances than with the 256QAM constellation. This is due to the fact that in a 4-dimensional codebook of 256 elements, each codeword has 8 closest neighbors instead of 4. In this case it is difficult to minimize the VQ distortion and reduce the transmission error effect.
Figure 2 : The reconstructed VQ image after transmission through a Gaussian noisy channel. 1 : The reconstructed image without transmission errors PSNR = 30dB. Image 2 : the reconstructed image with ordered codebook for 256QAM modulation (BER = 10 "2) PSNR = 29.1dB. Image 3 : the reconstructed image with u n o r d e r e d codebook for 16QAM modulation (BER = 10 "2) PSNR = 21.12dB. Image 4 : the reconstructed image with ordered codebook for 16QAM modulation ( B E R = 10 "2) PSNR = 26.82dB.
Image
References [1] G.Ungerboeck, "Channel Coding With Multilevel/Phase Signals", IEEE Trans. on Information Theory, vol. IT-28, 1982, pp 55-67. [2] R.M.Gray, "Vector quantization," IEEE Acoustic, Speech and Signal Processing Magazine, vol. 1, pp 4-29, Apr. 1984. [3] T.Kohonen, " Self Organization and Associative Memory, "New York, Springer-Verlag, 1984. [4] J.Kangas, "Increasing the Error Tolerance in Transmission of Vector Quantized Images by SelfOrganizing Map", ICANN 95, pp 287-291, Paris. [5] J. Kangas and T. Kohonnen, "Developments and applications of the Self-organizing map and related algorithms". In Proc. IMACS Int. Symp. on Signal Processing, Robotics and Neural Networks, pp 19-22, 94. [6] D. S. Bradburn, "Reducing transmission error effects using a self-organizing network". In Proc. IJCNN'89, Int. Joint Conf. on Neural Networks, vol.II, pages 531-537, Piscataway, NJ,1989
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
Visual Vector Quantization For Image Compression Based on Laplacian Pyramid Structure tZ. H e , SG. Qiu )~University of P o r t s m o u t h , U . K .
and tS. C h e n SU n i v e r s i t y o f D e r b y , U . K .
Abstract In this paper, we propose a new image coding scheme based on the Laplacian pyramid structure (LPS) and the visual vector quantization (VVQ). In this new scheme, the LPS is used to generate the residual image sequence, and the VVQ is used to code these residual images. Comparing with other block-based coding methods, the new scheme has much less blocking effects on the reconstructed image since coding is performed on the basis of hierarchical multiresolution blocks. The new scheme also has an additional advantage of a much lower computational cost over traditional vector quantization (VQ) techniques since encoding and decoding are based on much smaller dimensional 'visual vectors'. Experimental results show that the new scheme can achieve comparable rate distortion performance to that of traditional VQ techniques, while the computational complexity of the new scheme is only a fraction of that of traditional VQ techniques.
1
Introduction
In recent years, the demand for image transmission and storage has increased dramatically and research into efficient techniques for image compression has attracted extensive interest. Among many coding techniques, the LPS [1] and the VVQ [2] are two efficient coding techniques in terms of compression ratio, fidelity and computational expense. In this paper, we propose a new image coding scheme by combining the LPS and the VVQ, which inherits the advantages of the both techniques. In this new scheme, the LPS is employed to generate the residual image sequence and the VVQ is used to code these residual images. Experimental results show that the new scheme can achieve comparable rate distortion performance to that of traditional VQ techniques, while the computational cost of the new scheme is much lower since the encoding and deoding are based on much smaller dimensional 'visual vectors'. Because the coding operation is performed on the basis of hierarchical multiresolution blocks, the new scheme has much less blocking effects on the reconstructed image than that of traditional VQ techniques. The remaining of the paper is organized as follows. Section 2 summarizes the LPS and the VVQ system for coding Laplacian residual images is described in section 3. Section 4 discusses the image reconstruction. Section 5 presents experimental results and section 6 gives some conclusion remarks.
2
The Pyramid Structure
The generation of the pyramid structure includes the generation of the Gaussian pyramid and the generation of the Laplacian pyramid. The process is illustrated in Fig.1.
Gaussian Pyramid Generation The original image Go of size M • N pixels becomes the level 0 of the Gaussian Pyramid. Upper level images are generated b y using the reduction function R(.)[1] defined in (1), iteratively.
Gl(i,j) -
~ m---2
y ] w(m,n).Gl_l(2i + m, 2j + n)
O< l < i,
O < i < Ml, O < j < Nl.
(1)
n---2
L is the number of levels in the pyramid, Ml and Nl are the dimensions of the lth level, and w(m, n) are weighting kernels. Fig.2 shows a 5-level Gaussian pyramid of "Lena".
Laplacian Pyramid Generation The reverse of the reduction function R(.)is the expansion function E(.)[1] defined in (2). Let the GZ,n be the result of expanding Gl n times. Then
Figure 1: Pyramid Structure Generation
2
Figure 2: 5-Level Gaussian Pyramid of "Lena"
2
Gt,n(i,j)- 4 ~
~
w(m,n).Gl,~_a( i~- ' m j - 2 n )
O
O<_n<_l,
(2)
m---2 n=-2
0 _ i < Ml-n,
O<_j
The Laplacian pyramid is a sequence of residual images Io, Ix, " ", IL, each being the difference of two adjacent levels of the Gaussian pyramid. Thus for 0 _ _ _ l _ L - 1 (3) Ii-Gz-Gl+l,l IL--GL Fig.3 shows a 5-level Laplacian pyramid of "Lena" image generated by Eqn.(3).
Figure 4: VVQ Image Coding System
Figure 3: 5-Level Laplacian Pyramid of "Lena"
3
Visual Vector Quantization
The design of the VVQ coding system consists of the design of the coding-book used in coding phase and the design of the decoding-book used in decoding phase. D e s i g n of C o d i n g - b o o k The residual image Il of size M l x Nz is divided into Pl x Ql blocks of size ml x nz, where Pl = Ml/ml, QI = N~/n~. The horizontal and vertical derivatives [2] of each block are calculated as
1 ~I,(4p+i,4q+j).gh(i,j) Oh(p, q ) - -~ i=o j=o
O<_p
O<_q
(4)
Dv(p,q) - -~ ~
0 <_p < Pl,
0 <_ q < Ql
(5)
Ii(4p + i,4q + j ) . gv(i,j)
i=0
j=0
where the values of the kernels gh and gv can be written collectively in matrix form as 1
Gh --
-1
-1
1 1 -1 1 1 -1 1 1 -1
1
-1 -1 -1
1
Gv -
1
1
1
1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1
The horizontal and vertical derivatives of a block are used to form a "visual vector", (Dh, Dr), to represent the block. The visual vectors of all blocks of residual image Il are partitioned into Nc clusters using the competitive learning [3], and these cluster centers are used to comprise the coding-book for residual image Il.
D e s i g n of D e c o d i n g - b o o k A multilayer neural network is trained using backpropagation [4] to reproduce the residual blocks of Il when the corresponding visual vectors are presented at the input. The decoding-book is obtained from the output of the trained network by feeding the cluster centers to the input. The network has 2 input neurons and ml • nl output neurons. The number of the hidden-layer neurons is decided by experiment. The whole VVQ image coding system is illustrated in Fig.4.
4
Image Reconstruction
The reconstruction of the original image G0 from the decoded residual image sequence I0, I1, "" ", IL is achieved by reversing the operations of the Laplacian pyramid generation as follows:
GL = IL GL-1 -- GL,1 -~" YL-1 .
.
.
.
.
.
.
G1 -- G2,1 -}- 11 G0 - GI,1 -Jr I0
.
.
GL,1 -- E(GL) GL-I,1 -- E(GL-1) .
.
.
(6)
G1,1 -- E(G1)
where E ( . ) i s defined as in (2).
5
Experimental Results
Two 512• 512 monochrome images, "Lena" and "Peppers", with 8 bits per pixel, were used to evaluate the proposed new scheme. The "Lena" image was used to train the system and the "Peppers" image was used to test the system. A 5-level pyramid structure and a 4-level pyramid structure were investigated separately. The highest level residual image of the pyramid structure was coded directly using 8 bits per pixel. All other lower level residual images were coded using VVQ with the 4 • 4 block size for levels 0 and 1 and the 2 • 2 block size for levels 2 and 3, respectively. The coding-book size was chosen as N~ = 9. A 3-layer neural network with 2 input neurons, 10 hidden neurons and 16 or 4 output neurons was used to generate the decoding-book, depending on whether the 4 • 4 or 2 • 2 block size was used. In the test, it was found that a large amount of blocks fell into the class which had least significant edge contents. Hence a variable bit rate coding strategy was used. One bit was used to code the blocks which belong to this class, and 4 bits for each of the other 8 classes. Fig.5 and Fig.7 are the images reconstructed by the proposed coding scheme using 5-level and 4-level pyramids, respectively. As a comparison, Fig.6 and Fig.8 show the images reconstructed by the traditional VQ technique LBG [5] using the codebooks of size 8 and 16, respectively, with a block size of 4 x 4. The performance and the computational cost of the proposed new scheme and the LBG are summarized in Table 1. Experimental results show that the proposed new scheme has much less blocking effects on the reconstructed image compared with the conventional VQ technique. This results in a smother reconstructed image as is evident in Fig.5-8. From Table 1, it can be seen that, at similar bit rate and peak signal to noise ratio (PSNR), the computational requirements of the new scheme is only a small fraction of those of the LBG scheme.
6
Conclusions
A new image coding scheme has been proposed based on a combined LPS and VVQ approach. Since the coding is performed on the basis of hierarchical multiresolution blocks, the proposed new scheme has much less blocking effects on the reconstructed image compared with other block-based techniques. The computational cost of the proposed scheme is also much lower than traditional VQ techniques because the new scheme uses much smaller dimensional visual vectors to represent image blocks.
References [1] P. J. Burt and E. H. Adelson, " The Laplacian pyramid as a compact image code," IEEE Trans. Comm., Vol. COM-31, No. 4, pp532-540, April 1983. [2] G. Qiu, M. R. Varley and T. J. Terrell, "Image coding based on visual vector quantization," Image Processing and Its Applications, IEE Conference Publication No.410, pp301-305, July 1995 [3] T. Kohonen, Self-Organization and Associative Memory. Second Edition, Berlin, Springer-Verlag, 1988.
l0
[4] D. E. Rumelhart, G. E. Hinton and R. J. Williams, "Learning internal representations by error propagation," Chapter 8 in Parallel Distributed Processing, Vol. 1, Cambridge, MA, MIT Press, 1986. [5] Y. Linde, A. Buzo and R. Gray, "An algorithm for vector quantizer design," IEEE Trans. Comm., Vol. COM-28, No. 1, pp84-95, January 1980. Table 1: Comparison of Proposed Scheme and LBG Coding Scheme Performance Computational Cost Scheme Parameters bitrate(bpp) PSNR(dB) Addition No. Multiplication No. Proposed 5-L Pyramid 0.18 24.71 153,600 25,600 LBG codebook size 8 0.19 25.60 8,388,608 8,388,608 Proposed 4-L Pyramid 0.27 27.49 137,216 24,576 LBG codebook size 16 0.25 26.63 16,777,216 16,777,216
Figure 5: New scheme reconstruction (5-L pyramid) Figure 6: LBG reconstruction (codebooksize 8)
Figure 7: New scheme reconstruction (4-L pyramid) Figure 8: LBG reconstruction(codebooksize 16)
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and,P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
11
Kohonen's Self Organizing Feature Maps with variable learning rate. Application to image compression A . C z i h 6 *#, B . S o l a i m a n * , G . C a z u g u e l * , C . R o u x * and I.Lov~inyi # *Ecole Nationale Sup6rieure des T616communications de Bretagne, FRANCE D6pt. Image et Traitement de I'Information (I.T.I) B.P.832, 29285 Brest Cedex #Technical University of Budapest, HUNGARY Department of Process Control, Budapest XI, Muegyetem rkp. 9, 1521 I. I n t r o d u c t i o n The most important task encountered in digital image transmission systems is image compression aiming at reducing the amount of information to be transmitted. The overall goal is to represent an image with the smallest possible amount of bits, and thereby, to speed up transmission and minimize storage requirements. In the past decade, a promising compression approach using vector quantization (VQ) [1-2] has received great attention. In this approach, images to be encoded are first divided into small n• blocks. Each block is considered as a N-dimensional vector (with N=n 2) in the fitN vector space. VQ is a mapping from 9~N into a finite subset f2 ofg~ N, where f~ ={Wl . . . . wi . . . . WM} is a set of prototype vectors. The set f2 is generally called the VQ codebook and each vector w i in f2 is called a codeword or codevector. For each source block x, a codeword w i in f2 is selected as the representation of x and only the address (or the index) of this codeword in the codebook has to be transmitted instead of the transmission of the whole source block x in ~N. The effectiveness of the VQ is mainly determined by the set of codevectors. Therefore, the codebook design is a key question in this approach. This is generally performed by applying a learning method on a training set. This data base is formed by a certain number of blocks issued from images that are supposed to be representative of the images to be encoded. Recently, the use of neural networks for codebook design has been investigated [3]. Kohonen's Self Organizing Feature Map (SOFM)[4] is one of the most promising neural networks for this type of application. This is mainly due to its ability to form ordered topological feature maps in a self organizing fashion. In this paper, a codebook design approach based on the use of a variable learning rate Kohonen learning algorithm is proposed. It will be referred to as Distance-Dependent Learning Rate (DDLR) model. Simulation results show that the proposed approach is extremely promising.
lI. K o h o n e n n e t w o r k and c o d e b o o k design The Self Organizing Feature Map (SOFM) introduced by T. Kohonen [4] is one of the most successful vector quantization neural models to form clusters in the input space using an unsupervised learning approach. This model builds up a mapping from the N-dimensional input vector space of real numbers ~ N to a two dimensional array "S" of cells. Each cell is associated with a synaptic weight vector connecting the cell to the input vector (i.e. the codeword) in ~N. The purpose of the self organization process is to find the weight vectors such that the resulting mapping (correspondence between an input vector x and the cell which lies nearest to in 9~N) is a topology preserving mapping (adjacent vectors in ~ N are mapped on adjacent, or identical, cells in the array "S"). The basic idea behind this model is to move weights to the centroids contained in learning set by updating weights on each input value. The learning algorithm selects the best matching cell according to the minimum Euclidean distance between its weight vector Wk and the input vector x. This cell is referred to as the winning cell. All the weight vectors in the neighborhood of the winning cell are adjusted in order to make them more responsive to the current input. The neighborhood decreases in size with time. The updating rule is: w~(t) = w i ( t - 1) + ~(t)" ( x - w i ( t - 1))
Vi ~ Nk(t)
(1)
where wi is the weight vector of the i-th neuron, [3 is the learning rate that decreases with time, x is the input pattern, Nk is the neighborhood of the winning neuron and t is the time (the number of accomplished iterations).
12 This model has shown promising results in image compression. In this case the training set consists of sub-image blocks issued from some learning images, and at the end of the learning process the synaptic weights of each cell represent a reproduction block (codeword). The topologypreserving property makes possible the exploitation of the map in the so-called finite state VQ scheme. This feature allows fast codeword searching as well. The purpose of the SOFM algorithm described above is to find position vectors that approximate the probability density function (p.d.f) of the training set while preserving a topology. However, some blocks in a typical training set are much more frequent than others. For example, homogeneous blocks occupy typically a large portion of the image area, while blocks with large variation, such as edges, are few. Consequently, in the training set what is created from such images, the p.d.f in the area of homogeneous blocks is much more important than around blocks representing edges. This provides a quite unbalanced codebook: the SOFM creates too many homogenous codewords and not enough edge blocks. However, edges - and other parts of images occupying relatively small portion of the image area - are visually very important. This suggests that the SOFM learning algorithm in VQ codebook generation is not visually optimal. We propose to introduce a modification to the updating rule (1) by varying the learning rate according to the distance between the input pattern and the winning cell. Rather than approximating the p.d.f in the input space, we try to avoid creating many similar blocks while finding codewords that are visually very important, even if they are ill-represented in the training set. This idea is applied by inserting a parameter o~in the updating rule as follows: wi(t) = w i ( t - 1) + 0~(X,Wk). I](t). [X-- wi(t-- 1)] Vi eNk(t ) (2) This parameter must be low if the actual pattern of the training set is near to the winning neuron. If the distance is large, (x becomes higher. One possibility is to set o~ to 0 or 1 depending on a distance threshold. However, the following choice seems to be more adequate: t-1 (3) a ( x , w k) = f ( d ( x , w k) /dma x) where d0 is the Euclidean distance and dtm-1 is the maximum value of the Euclidean distances between each input sample and the corresponding winning neuron, and f0 is the function:
f (Y) - {Yl
if
y<-l}y>l
With this definition the parameter a has always a value between 0 and 1. The more the presented vector is close to the winning cell, the less the modification concerning the weights of the winning cell as well as its neighborhood is important. However, if the training vector is far from the winning cell, a greater modification is applied. In this way, the effect of the weight updating rule is guided in order to better represent the whole training set. Including d t-1 into the parameter definition provides a simple normalization and allows to adapt the DDLR rule to the training set.
III. Simulation results In order to justify the proposed DDLR model two tests are proposed. First, this approach is tested using an artificial training set of two-dimensional vectors given in Fig.la, where 12 gaussian clusters containing 200 points and 4 clusters containing only 20 points are created. Figs. lb and lc show the synaptic weights of a 4x4 SOFM using the classical and the DDLR models, respectively. It is clearly shown that in the second case the map has found also the four poorly represented clusters.
Fig. 1. Simulation results with the artificial training set
13
The second simulation was done using medical images. Training vectors were issued from four ultrasonographic endoscopy images. The image of this type presented in Fig.2 was further on compressed. The VQ was done using both the classical and DDLR approaches. Since we used 4x4 block size and 16x 16 maps, the codebooks contain 256 16-dimensional vectors. This means that the compression ratio is 0.5 bit/pixel, without applying any entropy code. A comparison of the two constructed SOFM with quantitative data is given on Table I. The resulted codebooks and the reconstructed images are presented in Fig.3 and 4 respectively.
Fig. 2. Original image
The effect of our proposed modification is clearly shown on Fig.3. The classical learning rule provides unnecessarily many completely dark and homogenous codewords, which is due to the large ratio of this type of blocks in the training set. The DDLR model avoids this problem and permits to obtain light blocks as well as large variation codewords.
Fig. 3. Generated codebook provided by the classical as well as the DDLR rules
Classical model DDLR model
Max wins
Min ED
Max ED
1241 2838
0.38 4.48
164.45 124.10
Table I. Comparison of the two SOFMs
The way the created SOFMs represent the training set is also illustrated in Table I through some numeric data. These data are: the number of winning case of the most active cell (i.e. the cardinality of the cluster containing the most of the 13464 training vectors, Max wins), and the minimum and the maximum Euclidean distance between each training block and its winning codeword (Min ED and Max ED). These results show that the most frequent codeword represents a cluster that contains much more training blocks in the DDLR case. In fact, this codeword is the completely black one. Representing many similar black block with one codeword permits the creation of small, but visually important clusters. This also means that the worst case (maximum distortion) is better represented in cost of degrading the quality of reproduction in the better cases. The compressed images are shown in Fig.4 and a comparison in terms of objective measures is given in Table II. Even though our aim was to improve visual compression quality, objective distortion criterion such as peak signal to noise ratio (PSNR) is also moderately increased. Its definition is:
14 2552 PSNR= 10 l~176_1 2.,[T'xi-
x' i
)2 dn
T i=1 where 255 is the peak signal value, T denotes the total number of pixels and xi and x'i denote the original and reproduction pixels, respectively. The subjective image quality is enhanced when applying the DDLR method (see Fig.4): the blocking artifact is reduced (see the smallest circle in the middle or the contour of the oesophagus wall) and the details are less masked in the visually important areas. However, it is also visible that the quality of the second and fourth circle is slightly degraded. This is because these parts are quite dark and therefore these blocks are clustered with other dark blocks. However, these circles do not belong to the oesophagus and therefore this degradation is an acceptable cost considering the quality improvement in other important areas of the image. Another interesting effect of our model is that the block entropy of the compressed images is decreased (see Table II). The entropy is definedas 256 E = - ~ Pi log Pi i=1 where Pi is the occurrence probability of the i-th block. The smaller entropy indicates that, applying an entropy coding method (e.g. Huffmann coding) on the block indexes, a greater compression ratio could be provided while using the DDLR approach. Entro[gy 7.39 6.25 ! Table II. Image compression quantitative results
I Classical model I DDLRmodel
PSNR 27.78 27.91
Fig. 4. Reconstructed images using the classical (a) and the DDLR (b) codebooks
IV. Conclusions In this paper we proposed a modification of the Kohonen's learning algorithm in order to improve the visual quality of VQ codebooks. Using the Distance-Dependent Learning Rate, the codebook becomes more balanced and contains a larger variety of codewords. This was showed through an artificial as well as through a real training set. The significant effect of the proposed approach concerns the improvement of the visual quality of compressed images while decreasing the entropy which was illustrated with the application on medical images. [1] N.M. Nasrabadi and R.A. King, "Image coding using vector quantization: a review", IEEE Trans.Com., Vol.36, pp. 957-971, Aug. 1988. [2] R.M. Gray, "Vector Quantization", IEEE ASSP Mag., pp. 4-29, Apr, 1984. [3] R.D. Dony and S. Haykin, "Neural network approaches to image compression", Proceedings of the IEEE, Vol.83, No.2, February 1995. [4] T. Kohonen, "Self-Organization and associative Memory", 3d ed, 1989, Springer-Verlag.
Proceedings I W I S P '96; 4-7 N o v e m b e r 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
15
An Efficient Training Algorithm Design For General Competitive Learning Neural Networks J. Jiang
Department of Computer Studies, Loughborough University, United Kingdom D. Butler School of Engineering, Bolton Institute, United Kingdom Abstract:
This paper presents an efficient algorithm design for all competitive learning based image compression neural networks. Similar to conventional vector quantization algorithms, this type of LVQ neural networks compute Euclidean distance to select a winner each time a block of image pixels is processed. The proposed algorithm introduces a simplified distance definition as a pre-test to exclude most of the neurons which are unlikely to be the winner throughout the training cycle in the competitive learning process. Thus provides significant efficiency improvement over the standard algorithm.
Keywords: Neural networks, image compression and algorithm design.
1. Introduction Vector quantization has been used as a major technique in image coding and compression for many years. It attracts numerous publications and research interests each year in which one area of research is to design various fast algorithms to improve its computing efficiency and running speed in constructing the optimised code-book. Straightforward implementation of all VQ algorithm uses full-search method which involves an exhaustive search of the distances for all available centroids. In this way, the complexity of processing each input vector is proportional exponentially to its dimension N and the overall bit rate. Hence, numerous ideas and methods have been developed to address the issue under one principle of the so called nearest neighbour search[I-4]. In summary, all previous work can be classified into two basic groups. One is to seek a sub-optimal solution which is almost as good as the full search in terms of mean square-error(MSE), instead of solving the neighbour Search problem itself. The other is to use a treestructured code-book search to divide the full search into a number of stages and each stage excludes a substantial subset of those candidate vectors by a relatively small number of operations. Based on general competitive learning algorithm, a family of LVQ neural networks have been developed for direct image vector quantization to achieve data compression[5-7]. The basic idea of the neural network image compression system can be illustrated in Fig. 1. Input images are split into blocks of 4 x 4, 8 x 8 or 16 x 16 pixels which construct input vectors for neural networks. A number of sample images are often used to train the network and obtain the best possible code-book represented by neuron coupling weights {wij: i = 1, 2 .... N; j = 1, 2 .... M}, where N is the dimension of each codeword and M is the code-book size. The code-book can also be described by M codewords of N dimensional vectors {Wj: j = 1, 2 .... M}. No matter how different each individual neural network is developed from each other, the basic training algorithm can be summarised as follows: Input v e c t o r ~
Step 1 Initialisation of Neuron Weights: The M neuron coupling weights are initialised as the starting codebook: Wj(O); j = 1, 2 .... M.
Step 2 Competition:
Computing the distance:
:i=l 2...N j=12 M
Dij = d(X~, Wj(t)) j = 1, 2 .... M The winning neuron k is selected with:
Step 3 Learning and Updating:
Output Dik = min Dij Figure 1 Image compression neural network J Wk(t+ 1) = Wk(t) + (I, (t) Ik(t)[Xi - Wk(t)}
where o; (t) - learning rate at iteration t; and Ik(t) - scaling function specifying the sign and magnitude of the difference vector being updated for the winning neuron k.
Step 4 Termination: Repeat steps 2-3 until the terminating criterion is met.
16
Without efficient algorithm being developed, the training often takes long computing time. As a fundamental problem for all VQ algorithms whether they are conventional VQ or LVQ neural networks, the Euclidean distance calculation or its modified version can always be identified as the major bottleneck: N
d(Xi, Wj ) = Z (Xni - -
Wnj
n=l
)2
(1)
According to the standard definition of Euclidean distance given in (1), its computation requires N multiplication and (2N-1) additions/subtractions. Hence for each input vector, the distance calculation in training the neural network takes M x N multiplication and M x (2N-1) additions/subtractions. For one image sample of size 256 x 256, the number of input vectors will be 4096 with N=4 x 4. Considering that the training of neural networks often take a multiplication and group of T images, the total computation cost at minimum will require 4096 x T x M x N 4096 x T x M x (2N-1) additions without considering those repeated training cycles. It is clear that efficiency in neural network training is an important issue. Although various fast searching algorithms[I-4] have been developed for conventional vector quantization algorithms as discussed above, direct introduction of those techniques often make the whole training algorithm more sophisticated as all the neurons have to be organised into tree structures. In this paper, we propose a substantially simplified distance calculation as the solution to the problem without involving any tree structures of reorganising the neurons. The rest of the paper is simply arranged as follows: the section 2 describes the new algorithm design and the section 3 reports the experimental results and give conclusions.
2. Algorithm Design To improve the efficiency of neural network training, we expand the equation (1) into: N
N
N
N
d(Xi, Wj) -- Z xni2 4- Z Wnj2 - 2 Z XniWnj "~-Ilxill + n=l
n=l
IIwII -%
n=l
x.,w.~
(2)
n=l
where Xi = [x,i, x2i, ... x,i} and Wj = [wq, w2j.... w,,i} are N-dimensional vectors corresponding to input image block and neuron coupling weights. The first term, IIx, II= in the above equation, is derived from the input vector. It is a constant and has nothing to do with the competitive learning operation. Hence the equation can be further arranged as: N
d(X,, wj)- Ilxill = -IIw~ll =- 2Zx~,w~,
(3)
n=l
To select a winning neuron, we only need to calculate equation (3) rather than (1). Further analysis shows that the term,
, can be pre-calculated and stored in the network prior to the competitive learning search since it is only
related to the neurons. The last term in equation (3),
Z
n_.]x.iw.j, relates to both input vector and the neuron
weights. It plays an important role in selecting the winner at each iteration. However, when all the vector elements, x.i and w.j, are positive, we have: N
N
(4)
ZXniWnj~XmaxZWnj n=l
n=l
where X,,ax corresponds to the maximum element in the input vector Xi. This can also be pre-selected before the training starts. Therefore, if we define: N
w )-IIx,
=
o(x,.
Iiw ll -
=Zx.,w., n=l
(5)
17
as the modified distance between Xi and Wj, the equation (4) can be used to define an approximate distance
D(Xi, Wj ) as follows: N
-D(Xi, % ) = IIw~II= - 2Xmax Z Wnj <- O(Xi, Wj )
(6)
n=l
II II and Zn=lWnj, can be pre-calculated and stored in the network, equation (6) can be further
Since both terms, simplified as: m
(7)
O( Xi, % ) = Aj - Xmax nj where Aj =
and Bj = 2
n=l
w,~j. They only relate to the neurons in the network. Hence they are constants as
far as the competitive learning is concerned. After all the modifications, an efficient algorithm can be designed below to complete the step 2 in the general competitive learning of neural networks given in the first section:
Step 2:
min_distance =A l-Xmaxnl; for (j = 2; j <= M; j++){ distance = Aj- XmaxBj; if (distance < min_distance){ distance = Aj- 2
X n=l xniwnj "
if (distance < min_distance){ winner = j; min_distance = distance;
1
For most of the neurons, only the approximate distance, D(Si, % ), is calculated which requires 1 multiplication and 1 subtraction. The real distance, D(Xi, Wj), will not be needed until the condition, D ( X i , Wj )< min_distance, is satisfied. Under this circumstance, the calculation will take N + I multiplication and N additions/subtractions. In the real distance calculation, further efficiency can be achieved when the partial distortion technique is incorporated[8]. In step 3, learning and updating occurs to Wj as well as Aj and Bj. Since this updating only concerns one winning neuron, the overall training algorithm is more efficient than the standard one. Finally, to make equation (3) correct, all the elements in X and W are required to be positive. The simplest solution is to add a positive offset to the two vectors[3]. Hence the new vector becomes:
X'ni-" Xni +
Ip[
wo + Ip[
n = 1 , 2 .... N
It is not difficult to prove that the modification will not affect the calculation of distances between X and W, namely: N
d(X'i, W j) -- Z (x' ni -W' nj )2 __d(Xi Wj)
(7)
n=l
3. Experimental Results and Conclusions To assess the efficiency improved, we tested the proposed algorithm by training the general competitive learning neural network with Lena (256x256) as shown in Fig. 2. When M=128, the total number of multiplication and
18
additions/subtractions are obtained as given in Table I. The experimental results are compared with the straightforward implementation with full Euclidean distance calculation which is given as standard algorithm in Table I. The efficiency improvement is expressed in percentage obtained in dividing the figures for standard algorithm by the figures for the proposed algorithm. It is clear that the advantage of the proposed algorithm is significant compared with the full Euclidean distance one.
Table I ExperimentalResults Standard Algorithm .... Efficiency Improvement
.....Computing Cost ...... Pr0p0sedA!gorithm multiplication 3119513 addition 5714739
8388608 16252928
................................
37% 35%
In this paper, we have designed an efficient algorithm for training competitive learning neural networks with image compression examples. The efficiency improved is significant in comparison with straightforward full search implementation without affecting the optimisation of the final code-book. The basic idea introduced is to define an approximate version of Euclidean distance and use this approximation to exclude most of the neurons for being the possible winner in the training process, and the approximated Euclidean distance only requires one multiplication and one subtraction which is substantially simplified in comparison with both the standard definition and other simplified versions[4].
References Figure 2 Imagesample [ 1].Linde Y., Buzo A. and Gray R. 'An algorithm for vector quantizer design', IEEE Trans. On Communications, Vol Com-28, (1), pp 84-95, 1980. [2].Lee C.H. and Chen L.H. 'Fast closest codeword search algorithm for vector quantization' IEE Proceedings-Vision, Image and Signal Processing, Vol 141, No 3, June 1994, pp143-148. [3].Katsavounidis I., Kuo C.C.J. and Zhang Z. 'Fast tree-structured nearest neighbor encoding for vector quantization', IEEE Trans. on Image Processing, Vol 5, No 2, February, pp 398-404, 1996. [4].Torries L. and Huguet J. 'An improvement on codebook search for vector quantization' IEEE Trans. On Communications, Vo142, No 2/3/4, 1994, pp 208-210. [5].Ahalt, S.C. Krishamurthy, A.K. et.al 'Competitive learning algorithms for vector quantization, Neural Networks, Vol. 3, pp 277-290, 1990. [6].Chung, F.L., Lee, T. 'Fuzzy competitive learning', Neural Networks, Vol. 7, No 3, pp 539-551, 1994. [7].Fang W.C., Sheu B.J. et al 'A VLSI neural processor for image data compression using self-organization networks' IEEE Trans. On Neural Networks, Vol. 3, 1992, No. 3, pp 506-519. [8].Bei C.D. and Gray R.M. 'An improvement of the minimum distortion encoding algorithm for vector quantization', IEEE Trans. on Commun., Vol Com-33, October 1985.
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
19
Architecture Design for Polynomial Approximation Coding of Image Compression Chung-Yen Lu and Kuei-Ann Wen Abstract Polynomial approximation coding (PAC) is well-known for image compression. 2-D natural polynomial is chosen to approximate the shape of block images, and the polynomial coefficients are completed by regressive method. A fast method for the calculation of the regressive coefficients of block size 8x8 is derived, and the encoder design for PAC algorithm could achieve not only smallarea but also high-speed advantages. The architecture of PAC is presented in this paper with its performance analysis. It will be shown that PAC is well suited for very low bit rate transmissions.
I. Introduction Polynomial approximation coding (PAC) is derived from polynomial regressive technique. A set of polynomial coefficients are obtained by regressive technique from a set of image data, and the coefficients are then used to represent the set of image data. In comparison, the bits for the coefficients for the coefficients are less then the bits for the original image. Although polynomial regression (PR) [1]-[2] is a well-known science technique, it is not widely applied in image compression. This is because that the high frequency components of block images will be abandoned automatically by PR method, it turns out that the coding quality by PR is worse than transform-based algorithm, e.g. JPEG [3]. While for low bit-rate compression applications, most of the high frequency components will be quantized to be zero as in transform-based algorithm, the performance degradation of transformbased algorithm will be close to that of PR. Since, a simplified computation method of PR is derived, the architecture of PR is much simplified compared to most fast DCT algorithms. Hence, high speed and low-cost encoding process could be provided by PAC.
II. PAC system There are three processes in the PAC encoding system, including regression coefficients estimator (RCE), quantization and variable-length coding. An overall system model of PAC is illustrated in Fig. 1. .... f(x,y) ~ /
coefficients
Quantizer
~
length
estimator
encoder
Regression
Variable
_ _~ ,
!channel .f(x, y) ~
surface
generator
Dequantizer ~
length
__
Decoder
Fig. 1 The overall system of PAC At the input to the encoder, source image samples are grouped into square blocks of size 8x8, and fed into the regression coefficient estimator (RCE). At the end of the decoder, the regression surface generator (RSG) output 8x8 sample blocks to form the reconstructed image. Each 8x8 block of source image samples is effectively a 64-point discrete signal which is approximated by a function of two spatial dimensions x and y. The RCE takes such a signal as its input and obtain the parameters of the surface by regression method. The output of RCE is a set of regression coefficients whose values are uniquely determined by the particular 64-point input signal.
20 A 2-D polynomial regression equation in PAC is expressed as" p ( x , y ) = [30 + ~ lX + ~ 2Y + ~ 3X2 +[~4y 2 + ~ 5xy + ~ 6x2 y + ~ 7xy 2 + [38x2y 2
(1)
The regressive model is expressed as matrix form as the following equation: (2)
F = XBY' + e
where
f(0,0)
f(0,1)
9. 9 f(0,7)l
F = f(1.,0)
f(1,1) 9
... .
9
.
Lf(7,0)
f(1.,7)[ is the image data of size 8x8
",
9
f(7,1)..,
9
f(7,7)]
I -1 [30
B = [32
/
[~1
[~3
[35
[36 is the polynomial coefficients vector.
137
X --
e"-
1
xo
x o2
1
x,
Xl2
X7
X7
.
.
and Y =
eoo
eo~ ...
eo7
e~o .
e~.
el7 .
I
,... .
1 Yo Yo2
is
1
Yl .
Yl2 . are position vectors.
Y7
Y72
the matrix of error terms.
/e70 e71 "" e77 The least square normal equations for the linear regression model is X ' F Y = (X'X)B(Y'Y) The least square estimating of polynomial coefficients is completed by B :
[(xtx)-I x t ~ ~ ( y t v ) -1
(3) (4)
Let x~ = y~ = i - 3.5, i= 0, 1, ..., 7, then we observe X = Y. And let G = ( X ' X ) - l x ' , w e could get the simple form from Eq. (4) B = GFG' (5) where G is the generator matrix which is used to compute 1-D polynomial coefficients. From Eq. (5), we could compute the 2-D polynomial coefficients by row-column decomposition method. The block diagram is illustrated in Fig. 2. block image data 1-D Regression
(row operation) Transpose Buffer
1-D Regression
(column operation) 2-D polynomial coefficients
21
Fig. 2 The row-column architecture of RCE Each row vector of the generator matrix G is used to compute one of the 1-D polynomial coefficients, they are equivalent to the multiplication of scaling factors Si and kernel matrices Wt, i =0, 1,2. G=[S 0
where
S~
fw01
1
1 S1 = 8 4 '
$2
(6)
S1 8 2 W 1
1
168'
W 0=[3
-3
W 1 =[7
5 31--1--3--5--71
W2
-7
[-7-1
-9
-9
-7
3 5 5 3
-3
3]
1"-7
Due to the scaling operation could be merged with quantization process, only the computation of W need to implement. The weighting coefficients of W is easily implemented by adding and shifting operations. The architecture of row-computation for RCE is illustrated in Fig. 3.
!
Ir Ir
J A_].
II ;-_
,I
i
,_L
J
1
I
R2
t
I
Ro
t
I
I L ].Latch
R
t
Fig. 3 The pipelined architecture for the row computation of the polynomial coefficients III. Performance and architecture complexity of PAC
PAC is an approximation technique for image representation. It is not the same as orthogonal transform, e.g. DCT, which could preserve high-frequency coefficients. While for low-bit rate compression, the coding performance is comparable to transform coding. We illustrate the coding gain and the peak signal-to-noise ratio (PSNR) in Table 1. PSNR = 101og~0
-
t#x, y) -/(x, y))
, where fix,y), f ( x , y ) are original image and coded
image respectively. Table 1. The coding performance of LENA by JPEG and PAC Rate (bits/pixel)
0.5 bpp
0.4 bpp
0.25 bpp 0.2 bpp 0.16 bpp
JPEGimages
34.8dB
33.1 dB
30.4dB
PAC images
31.2 dB
30. 7 dB
29.5 dB 28.2 dB 26. 7 dB
28.1dB
26.3dB
22 It is shown from Table 1 that PAC is a good choice for replacement of JPEG in very low bit rate applications. First, when the low bit-rate compression is required, the performance of PAC is close to JPEG. Second, the simple and fast computation process provides high-speed and low-area implementations. The architecture complexity of PAC is made comparison with DCT-based algorithm. The kernel issue is the number of operations in DCT and RCE. For 1-D 8-point DCT, there are at least 12 multiplication operations and 26 addition operations. But for 1-D 8-point RCE, there are only 22 addition operation. The numbers of multiplication and addition in several DCT algorithm and RCE will be illustrated in Table 2. Table 2. The comparisons of architecture complexity in DCT and RCE 1-D 8-point DCT Chert' s [4]
Lee' s [5]
RCE
Chan' s [6]
No. of Multipliers
16
12
12
0
No. of Adders
26
29
29
22
IV. Conclusions
The PAC is well suited for low bit-rate image compression, because its performance is close to that of JPEG standard under the condition of high compression ratio. Besides, the complexity of PAC is much simplified than DCT-based algorithm. High-speed and small-area encoder will be provided by PAC. References
[1 ] M. Eden, M. Unser, and R. Leonardi, "Polynomial representation of pictures," Signal Process., Vol. 10, No. 4, 1986, pp. 385-393. [2] M. Kocher and R. Leonardi, "Adaptive region growing technique using polynomial functions for image approximation," Signal Processing, Vol. 11, No. 1, July 1986, pp. 47-60. [3] G. K. Wallace, "The JPEG still picture compression standard," IEEE Trans. Consumer Electronics, Vol. 38, No. 1, Feb. 1982, pp. xviii-xxxiv. [4] W. H. Chen, C. H. Smith, and S. C. Fralick, "A fast computational algorithm for discrete cosine transform," IEEE Trans. Commun., Vol. COM-25, pp. 1004-1009, Nov. 1977. [5] B. G. Lee, "A new algorithm to compute the discrete transform," IEEE Trans. Acoust., Speech, Signal Process., Vol. ASSP-35, pp. 1243-1245, Dec. 1984. [6] Y. Chan and W. Siu, "A cyclic correlated structure for the realization of discrete cosine transform," IEEE. Trans. Circuits and Systems - Analog and Digital Signal Process., Vol. 39, No. 2, pp. 109-113, Feb. 1992.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
23
Application of Shape Recognition to Fractal Based Image Compression S. Morgan, A. Bouridane Department of Computer Science The Queen's University of Belfast Belfast BT7 1NN Northern Ireland
Keywords: fractal, i m a g e c o m p r e s s i o n , shape r e c o g n i t i o n , S o b e l operators. ABSTRACT This paper describes the application of fractal geometry and shape recognition to still-frame image compression. The technique is based on edge detection using Sobel operators to identify the high frequency components of an image. Using this information image pieces are classified according to detail and the Partitioned Iterated Function System (PIFS) computed by searching the appropriate class. Further exploitation of redundant image data is achieved by identifying shapes within each range block and compressing, using only a subsample of that range, when the block contrast is low. The result is a decrease in encoding time while maintaining image fidelity. This technique has been implemented successfully using a set of test images of various contrast levels.
INTRODUCTION With the advent of technologies such as multimedia systems and the Internet there is a practical and necessary need for image compression tools. These tools can improve effective bandwidth by providing rapid transmission speeds and reduced storage requirements for digital images. There are many still-frame compression algorithms available, however methods such as JPEG and Wavelets rely on storing only the low frequency components of a signal and eliminate the high components. This results in the loss of sharp edges causing a blurring effect when applied to high contrast images [1]. Fractal compression algorithms overcome this problem by applying the concepts of Iterated Function Systems (IFS) theory and focusing on the self similarity in real-world images [2]. Images are viewed as a collage of self-similar parts that can be mapped onto each other using concepts of IFS theory. An IFS consists of the set of N contractive affine transformations, denoted by { al, a2 .... aN } on a subset of points in the plane R 2 ( i.e. an image F ). This set of N transformations defines a map A(F) which approximates the original image. N
F = A(F)
Uai(F)
where
a i ' R 2 ---> R 2
......... (1)
i=l
Each transformation a/is contractive, thus A is also contractive, determining the structure of the unique attractor[4]. The encoding process involves partitioning an image into a collection of range blocks ri e R and domains d/~ D with certain restrictions applied [3]. Then for each range the IFS code of the domain block that most resembles it is computed. The problem is finding a minimum metric distance between each range and domain in an acceptable time period, known as the "Inverse Problem". Current fractal compression algorithms consist of a partitioning scheme coupled with some classification method. Though effective these approaches do not exploit the redundancy within range blocks, which are limited in size by the partition scheme. A technique is required to identify and take advantage of this redundancy by using only a subset of the range block information, thus improving encoding times. In this paper we describe an image block classification routine based on shape recognition using the cumulative measure of a blocks gradient magnitude by applying Sobel operators. This provides a measure of the block contrast along with the degree of redundancy. In the case of low contrast range blocks a subsample of the pixels can be using for compressing the image block. This improves compression speeds as the number of computations is reduced.
24
THE TECHNIQUE IN DETAIL When determining the PIFS codes to encode an image the problem exists of finding the range-domain block combination with a minimum or acceptable metric distance dmetrie within a reasonable timespan, where
dmetric ( F ~ ( r i
E R),ai(F)
) < L
......... (2)
The variable L can be a fixed constant, a minimum value relative to the problem domain or a combination of both. The mapping of domain to range blocks is achieved using contractive affine transformations. The inference that the set of maps { al, a2 .... aN } is contractive means there is a single fixed point solution Xw which is independent of the original image, as proved by the Contractive Mapping Fixed-Point Theorem which follows. xW
= A (xw)=
lim
A o
n (x)
......... (3)
n---~ oo
During decompression the image A(F) is successfully reconstructed with the help of the Collage Theorem as defined in equation 4 h (S ' x w )<
1 --1 S h ( S ,
......... (4)
a (S))
given metric space (E,h) with contractivity factors s and fixed point Xw, such that S E E and W(S) the collage of the image [3]. This ensures that the contractive affine mappings for all range block, when recursively applied to any arbitrary valued initial image, will piece together forming A( F ) = F . The process of finding A( F ) is computationally intensive and improvements are constantly being sought to address this problem. The essence of the proposed technique is the reduction in size of the set of domain blocks D by the application of an edge or "Shape" based classification scheme using Sobel gradient operators and multiple thresholds [5]. The basic algorithm consists of a quadtree partitioning scheme with a maximum block size of 2 6 and minimum size of 22 pixels with a domain blocks spacing of 2 pixels. The compression process involves convolving each domain block with both horizontal and vertical Sobel operators. The combined gradients Vf provide a single measure of the strength of the edge and are cumulated for all edges with each block.
V f = I Gxl + [ G y l
where Gx is
/!2i/ 0 2
and Gy is
/l~ _2
0
-1
0
......... (5)
The cumulative gradient values of each block of a particular size are then sorted in order of gradient magnitude and divided into N equal classes. For the purpose of this paper N was chosen to be four. The class thresholds t i E T are the cumulative gradient values at positions .25, .5, .75 in the ordered list of gradients. The thresholds are sensitive to the image being compressed and more importantly to the other image blocks of the same size. This ensures automated classification relative to the image in question. The computational overhead of sorting the gradients is minimal when using a quicksort routine and is justified later when the size of the set D being searched is reduced by N. As each range ri is being compressed it is firstly analysed to determine if it contains any objects using edge detection. If there are objects then its cumulative gradient value is checked against the table of thresholds t i E T and the range ri is classified and then evaluated against all di within the same class and size. Each domain being evaluated undergoes 8 transformation consisting of 4 rotations, a flip and a further 4 rotations. If a satisfactory match can not be found the range is divided in 4 and each quadrant evaluated once more. If however, there are no objects then the block is of low contrast and only a subsample of the range block pixels are used for compression. The ratio of the subsample size to the block size is 4:1 providing the reduction is not below the minimum partition size acceptable (i.e. 4x4) in which case the reduction defaults to the minimum size. Next the range is classified and the domain pool searched as described previously. The size of the domains being searched is that of the subsample size and not the original block size. This helps improve compression times for low contrast images. After finding the best range-domain match the affine map is written to file. Each mapping consists of the domain position Xi , brightness Si , contrast Oi and finally the transformation code Ri. T h e partition structure is coded by writing a single bit to file, where 1 signifies the block mapping follows and 0 that the block was divided. Both Si and Oi are quantised to 5 bits and 7 bits respectively as recommended by Fisher [5].
25
RESULTS To evaluate the similarity of the original and reconstructed images required a quantifiable measure of success and this was achieved using objective fidelity criteria which allow data loss to be represented as a function of the both images. The functions or metrics used are RMSE(root-mean-square-error), and PSNR(Peak-signal-to-noise-ratio). Both are referred to in evaluation table below. The RMSE is an induced metric of the 12 norm which states that given two images of size MxN, an original image F(x,y) and a reconstructed image G(x,y), the distance or error e(x,y) between them for any point x,y, is
e(x,y) = G(x,y)- F(x,y)
......... (6)
thus, the RMSE can be calculated as the square root of the squared error averaged over all the pixels (MxN) and is defined as ......... (7) e ....
=
[G
(x,y)
-
F
(x,y)] 2
x=0 y=0 The PSNR, in decibel units (dB), gives the ratio of the peak signal and the difference between two images. It includes the RMS metric and is defined as
PSNR
= 2 0 1 ~ 1 7 6 )Tr M ms e
......... (8)
where max is the maximum value gray level value. To determine the effectiveness of this technique extensive experiments were performed using a number of 265x256 gray-scale PGM image files with varying contrast levels. The experiments involved varying the number of image classes and for illustration purposes only two pictures are shown. Figures 1, 2, and 3 overleaf show the reconstructed Goldhill image using 4 classes, 16 classes and the original image. Figures 4, 5, 6 show the reconstructed Peppers image using 4 classes, 16 classes and the original image.
Table 1: Evaluation figures
From the evaluation figures it is evident that varying the number of classes has a negative effect on the compression ratios. The greater the number of classes then the lower the probability of finding a good match, which results in an image block being divided and each quadrant evaluated once more. The RMSE values were higher than expected and this is due to the thresholds being fixed once they are initially calculated. A more robust method would define a subset of domains to be searched using a standard deviation based on the current range block gradient. ( e.g. evaluate 25% of the domain blocks whose cumulative gradient value lie either side of the range gradient). Some sharp edges of each image are lost because of the domain averaging and also the limited number of domains per class. In addition to this spurious pixels show up at sharp edges. This is caused by the least squares regression, where the minimum of the combined brightness and contrast variables are found. This is not the true minimum of each variable independently, but rather of both together. Finally, this research was carried out cross-platform using Sun Sparc stations and Pc's. For this reason encoding times are not included at this stage but will be disseminated in due course.
CONCLUSION
AND FURTHER
WORK
This paper describes a novel technique based on shape recognition, which reduces the encoding time of fractal based image compression. A classification scheme using Sobel gradient operators is proposed and the concept of shape
26
recognition is applied to exploit the redundancy within low contrast range blocks. The technique has been successfully applied to images of varying contrasts. Work is currently under way to extend this technique allowing automatic determination of image block size based on object recognition. Also block classification based on entropy values is being investigated.
REFERENCES
[1]
Woods, A., Gonzalez, X., "Image Processing", Addison-Wesley Pub. Co., 1992.
[2]
Peitgen, H-O., Jurgens, H., Saupe, D., "Chaos and Fractals, New Frontiers of Science", Springer-Verlag, 1992.
[3]
Fisher, Y., "Fractal Image Compression", Springer-Verlag, 1995.
[4]
Barnsley, M. F., "Fractals Everywhere", 2nd Edition, Academic Press Professional, 1993.
[5]
Morgan, S., "Fractal based Coding of Still-Frame Video", MSc Thesis, Faculty of Engineering, The Queen's University of Belfast, UK, 1995.
Figure 1: Goldhil1256x256 4 Classes
Figure 2 : Goldhil1256x256 16 Classes
Figure 2 : Goldhil1256x256 Original
Figure 4 : Peppers 256x256 4 Classes
Figure 5 : Peppers 256x256 16 Classes
Figure 6 : Peppers 256x256 Original
Proceedings IW1SP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
27
Chrominance Vector Quantization for Coding of Images and Video at Very Low B itrates Maciej B a r t k o w l"a k * , M a r e k D o m a f i s k i * and Peter G e r k e n 0
9Politechnika Poznafiska, Instytut Elektroniki i Telekomunikacji, Poznafi, Poland 0 Universit~t Hannover, Institut ftir Theoretische Nachrichtentechnik und Informationsverarbeitung, Hannover, Germany
Abstract
The paper describes vector qunatization in the two-dimensional space of the chrominance coordinates U and V. This vector quantization results in one scalar signal which is then processed independently of the luminance in any coder, e.g. JPEG or H.263 coder. The coder processes only one chrominance component instead of two. Nevertheless this scalar chrominance signal exhibits usually broader spectrum than each of the two input chrominance signals U and V. Proper ordering of the codebook decreases the frequency band of this scalar chrominance signal. The paper describes also an efficient and noniterative technique to generate the codebbok which is then used to encode the chrominance pairs.
1. INTRODUCTION" Vector quantization [1] is well-known as a powerful technique for image and video data compression. There are many pratical possibilities to create vectors in an image. Vector quantization, where vector components are the luminances of some neighboring pixels, is often used. Another approach is to use the color coordinates of a pixel as the vector value which is then quantized. Here, we deal with this approach, which in some other context has been already considered in [4-7,14]. A color video sequence is in fact a vector signal where each pixel of an individual frame is represented by a threeelement vector. This representation is highly redundant because usually only a small part of all possible combinations is present in an individual frame and even in a whole sequence [8,9]. The commonly used techniques process the both chrominance channels almost separately taking no advantage of their mutual dependencies. Our idea is to use vector quantization of chrominance for image and video coders where very high compression ratios are required. Before input to the source coder (for example DCT-based still image or video coder [12,13] or objectbased analysis-synthesis video coder [3,11]) an image or a video sequence shall be preprocessed in that way that the two chrominance components of the sequence are converted to a scalar signal being a stream of chrominance labels. The technique includes automatic codebook design and, in case of video coding, update according to changes of the frame content. The assumption for the work is that there is as little as possible interaction with the following coder, so that any type of source coder can be used. Similarly, at the decoder side only some postprocessing shall be performed for recalculation of the actual color coordinates. 2. CODING STRATEGY The approach is based on two-dimensional vector quantization with the nearest neighbor search using the Euclidean distance in the UV plane. The vector quantizer transforms the two input chrominance components into the scalar signal being a stream of labels of (U,V) pairs. At first, a basic codebook is designed and then it is enlarged by inserting interpolated entries. The size of the basic codebook has been fixed to 32 because experimental results show that for many natural images and sequences this is a reasonable value that does not lead to visible degradation of the picture quality. Note that the codebook entries represent chrominance values only and that even with this small number of entries, there is still the possibility to generate lots of colors in combination with the individual luminance values. In the cases of video sequences, the basic codebook is computed for each frame and it has to be transmitted at least for the first frame. For the next frames, all the entries are compared with those from the previous frame. The new codebook is sent if a dramatic change of the scene is detected. Otherwise, the same codebook is used for consecutive frames. The codebook entries are losslessly encoded and transmitted as side information. The 32 basic codebook entries are ordered and mapped onto the range from 16 to 240 in order to be compliant with standard video input data formats. The order of the entries is very important because it deeply affects the performance of the system. The differences between the values representing the consecutive codebook entries are proportional to the distance on the chrominance plane used. The intermediate values represent chrominances which can be calculated by linear interpolation between two neighboring codebook entries. These intermediate values together with the basic codebook entries form the enlarged codebook. The pixel chrominances are mapped using the nearest neighbor search to a chrominance labelled by an integer which range from 16 to 240.
28 Due to quantization in the video coder, the values of chrominance labels can be changed. The decoder assignes to the decoded chrominance labels those pairs of interpolated chrominance values which are represented by them. For this purposes the basic codebook is interpolated at the decoder side in the same way as at the coder side.
u
v
.
i
VECTOR
'1
OR VIDEO CODER
QUANTIZER
....
ii HBASIC[ ]BASIC INTERPOLATION [
DESIGN
.q,,. :
]
H A
I
NE
.
CODEBOOK
ORDERING
I
I
~ !
[
Hoo OOK INTERPOLATION
DECODER
information
i
i
control
DECODER
L
,J ENTROPY
| basic
CODEBOOK
bitstream
I RECEIVER
TRANSMITTER Fig. 1. General structure of the transmitter and receiver.
3. DESIGN OF BASIC C O D E B O O K
There exist several techniques to design codebooks in the color spaces [5-7,14]. A common approach is to start with a relatively poor (in the sense of total square error) codebook and improve it using the Linde-Buzo-Gray (LBG) algorithm [10]. On the contrary to those techniques some algorithms based on splitting of the color space have been developed [5]. The method proposed here is a binary-split technique. As a measure of a vector distance, the Euclidean norm in the chrominance plane is used. At the beginning, the codebook has only one entry being a vector with its components calculated as the mean values of all chrominances in a picture. In the first step, the set of all chrominances is optimally divided into two subsets according to the rule described in [8]. This procedure is repeated recursively. At each step, only the set of chrominances with the highest total square error of its vector representation is divided into two subsets. In general, n steps of this algorithm result in an (n+l)-element basic codebook (Fig.2). Consecutive frames of video sequences usually show similar chrominance histograms and therefore produce codebooks with similar tree structure. This property makes the frame-to-frame comparison of the codebooks easy. This comparison is necessary for the decision whether the codebook (or optionally a part of it) must be retransmitted. This algorithm results in very good codebooks in the sense of total square error (cf. fig. 3).
i
9
,
9
9
~ @
fx~--~"zCk~')l~ O ~
9
" division line
, "for step No. 1
",N
"
starting 9
division h'ne , for step No.4 ,
x~ ,
"
|
9
O K~..x.
"for step No 3 .
Fig.2. Binary splitting example.
4. BASIC C O D E B O O K O R D E R I N G
Random ordering of the basic codebook would lead to very broad power spectrum of the signal fed into the coder. This signal is an image with pixel values being the labels of the codebook entries. The goal of proper ordering should be to keep this image as low-frequent as possible. In our approach we use a strategy of simultaneous basic codebook generation and ordering. At each step one codebook entry is substituted by two new entries as described above. The set of codebook entries is already ordered and the two new entries are put onto the place of the removed one. The order that minimizes the distances to the neighboring entries is chosen out of the two possibilities of ordering the two new entries. In fact, this strategy leads to relatively good results (Fig.4). 5. CODEBOOK INTERPOLATION In order to get finer quantization, the basic codebook is augmented with interpolated codebook entries. The ordered set of the basic codebook entries is mapped onto the set of integers in the range of 16 to 240. The integers assigned to the codebook entries are henceforth called labels. A label difference of two consecutive basic codebook entries is set proportional to the respective distance in the chrominance plane. All other integers from the above mentioned range are assigned to the interpolated codebook entries. Therefore, the longer the distance between two
29
consecutive codebook entries is, the more interpolated entries are inserted in between. The chrominances corresponding to the interpolated codebook entries are equally distributed on the straight lines between consecutive basic codebook entries (cf. Fig.5). Note that only integer values of the U and V coordinates are allowed.
60
i,
|
,
....::
50
4o
40
!
~
,
i
30
.... i::9 +:i:.... i i ~::ii i~i!i i ~+ii +!~::+,:(: :
,
V
W
........
2O I0
A -20
.-:,~!.,...............::.. -20 -5o ~ o - ~ o - ~ o
-30
Fig.3. Basic codebook entries (black dots) of the first frame of the test sequence "CLAIRE" shown on the set of all chrominances present in this frame (grey points).
:io
o
, ~ to 20
3o
+
5o
Fig. 4. Ordered entries of the whole codebook of the first frame of the sequence "CLAIRE".
The above rule is known on the receiver side, therefore there is no need to transmit the information about interpolated codebook entries. Only, in the case of video coding, frame-to-frame updates of the basic codebook entries have to be transmitted as side information, if needed. 6. APPLICATION TO IMAGE CODING We use our technique as a preprocessing stage for DCT-based image coding in the area of very high compression ratios. In our experiments, the standard JPEG coder is used as the image coder both within our coding scheme and as the reference for comparison. Fig.5 shows some results for test images LENA and CLOWN. Note that for very high compression (0.02 bpp and less) we have a significant gain in the SNR. There are very strong color distortions introduced by the JPEG coder, even when optimized quantization coefficients are applied. With our scheme we achieve visible improvement in the subjective quality of the decoded images. 24
v
23 sr,m [dBj
/22
22
'
7 - " "-"-'--~/CLAIRE --
our scheme
21 20
/J/ LENA
JPEG
_]~SSA
our s e h e m e ~ , , ~
19
18 17
JPEG
16 our
s c ; ~ m ~ ~ s s i o N [bpp] "
I~).01
i
0.02
OJ03
0;04
0105
OJ06
0.07
Fig.5. Signal to noise ratio versus compression ratio for still test images LENA and CLOWN
COMPRESSION [bppl I
0.04
i
0.045
0105
t 0.055
, O.O6
Fig.6. Signal to noise ratio versus compression ratio for average frames of test sequences CLAIRE and MISSA
Similar experiments are performed with single frames from standard video sequences MISSA and CLAIRE in QCIF format (176 x 144 pels). In this case we have much less correlation observed in the pictures, which results in lower compression. Nevertheless, our approach gives even better performance (cf Fig.6)
30
7. VERY LOW BITRATE CODING OF VIDEO For the verification of our technique in the field of video sequence coding at very low bitrates we perform a series of experiments with a standard H.263 video coder. In order to obtain the desired output bitrate we have to apply a control mechanism in the stage of quantizing the DCT coefficients. For this task we scale the quantization factor for interframe mode. The control loop keeps the scaling factor on such level which gives similar bitstream as in the case of typical operation of the H.263 coder, i.e. below 64kbps. Both output sequences from our system and the standard Coding scheme are compared in the means of SNR averaged over 50 frames of the sequence. The results are promising.
ACKNOWLEDGEMENTS The work has been supported under the KBN Reasearch Grant No. 8 $504 002 06 and the NATO Linkage Grant HTECH.LG 941338. Some of the computations were performed using the resources of the Poznafi Supercomputing and Networking Centre.
REFERENCES [1] H. Abut (ed.), Vector quantization, IEEE Press, 1990. [2] G.Wyszecki, W. Stiles, Color science, Wiley 1982. [3]H.G.Musmann, Object based analysis synthesis coding, IEEE Int. Symposium on Circuits and Systems, Tutorials, eds.: C.Tommazou et al., London 1994. [4] J.Barilleaux, R.Hinkle, S. Wells, Efficient Vector Quantization for Color Image Encoding, Proc. ICASSP 1987, vol.2, pp.740-743. [5] R.S.Gentile, E.Walowit, J.P.Allebach, Quantization and multilevel halfioning of color images for near original image quality, J.Opt.Soc.Amer.-A, Vol.7, No.6, 1990 [6]M.T.Orchard, Ch.A.Bouman, Color Quantization of Images, IEEE Transactions on Signal Processing, December 1991 [7] M.Domaflski, M.Bartkowiak, Color image archivization for medical purposes, Journal on Communications, vol. XLV, pp.66-68, July-August 1994. [8] M.Bartkowiak, M.Domafiski, Palette Representation for Data Compression of Color Video, Proc. XVIII Nat. Conf. Circuit Theory Elec. Syst, pp. 473-478, Polana Zgorzelisko 1995. [9] M.Bartkowiak, M.Domafiski, Color statistics in image sequences and their implementations for VLBC, 2nd Int. Workshop on Image Processing, pp. 68-72, Budapest 1995. [10]Y.Linde, A.Buzo, R.Gray, An algorithm for vector quantization design, IEEE Trans. on Commun., vol. COM28, pp. 84-95, 1980. [11] P.Gerken, Object-based analysis-synthesis coding of image sequences at very low bit rates, IEEE Trans. Circuits Syst. Video Techn., vol 4, pp. 228-235, 1994. [12] ISO International Standard 10918, Digital compression and coding of continuous-tone still images, Geneva 1994. [ 13] ITU, Draft recommendation H.263, Video coding for narrow telecommunication channels at < 64kbit/s, April 1995. [14] E. Roytman, C.Gotsman, Dynamic color quantization of video sequences, IEEE Trans. Vizualization and Computer Graphics, vol. 1, pp. 274-286, 1995. [15] I.H.Godlove, Improved Color-Difference Formula, with Applications to Perceptibility and Acceptability of Fadings, J. Opt..Soc. Am., 41, 11, pp.760-772 (Nov.,1951)
Proceedings IWISP '96, " 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
31
Region-of-Interest Based Compression of Magnetic Resonance Imaging Data Nikos G. Panagiotidis and Stefanos D. KoUias National Technical University of Athens Department of Electrical and Computer Engineers, Computer Science Division. Zografou 15773, Athens, Greece. Tel" +30-1-7722491,7722488, Fax: +30-1-7722459,7722492 e-mail :
[email protected],
[email protected]
1. Introduction Current picture processing, archiving and communication systems in hospitals and medical care centres deal with large volumes of information, obtained every day from a variety of disciplines such as chest, breast or bone X-rays, magnetic resonance imaging, tomography or other radiology, pathology and cardiology examinations. Therefore rises the need for efficient coding and compression of this information. Until recently, the requirement for retaining every detail of the encoded medical images has restricted the interest in lossless compression techniques, which allow exact recovery of the original image from the compressed version of it. However these methods achieve minimal compression ratios of approximately 3:1. In this paper we propose an efficient coding scheme which takes advantage of the difference in visual importance between areas of the same image, classifies them in distinct categories and reproduces the image with variable spatial reconstruction quality. The scheme is based on the fact that most medical images consist of areas of minimal contribution to the perceived information and of regions which are of extreme interest (regions of interest, ROD to medical experts. Depending on the percentage of ROI to low importance (background) regions, substantial saving can be achieved both for storage and transmission use. It is shown in the paper that progressive DCT coding can be smoothly interwoven with the use of regions of interest so as to increase the compression ratios obtained in a variety of cases, while causing user-acceptable degradation in image quality. A ROI-JPEG coder is implemented, which provides the means for encoding regions of low/high interest in the image by differentiating the quantisation tables among these areas. This goal is achieved by using different quantisation quality factors (QF), as defmed in the baseline JPEG algorithm, for each category of image regions. Thus, while blocks belonging to important regions are coded with high quality quantisation tables, a substantial gain in bitstream volume is achieved by quantising the rest of the image blocks with low quality quantisation tables. Further reduction in the volume of information transmitted or stored may be achieved by further filtering of the low importance regions. This is achieved through the DCT transform, and does not affect the perceived quality of these regions, since coarse quantisation already incorporates a low-pass filtering process. Additionally, for off-line storage applications, visually optimal quantisation tables on a bits/pixel target rate basis can be computed for both the high interest and background regions of the image. The classification of image regions into ROI and background regions can be implemented either through unsupervised automated procedures or by user interactivity.
2. Regions of Interest in Medical Imaging Medical images can be segmented into the following two discrete categories of regions: -
Regions of Interest (ROI) which are areas containing parts or features of the image having a maximal contribution to the perceived information. These regions are typically rich in high frequency content and thus require a fine quantisation process in order to maintain an acceptably high reconstruction quality. If coding is extended to moving image sequences, ROI usually correspond to moving parts of the image. As a rule of thumb, if only the areas of the image corresponding to ROI were to be transmitted or stored, at least 70% of the image information could be perceived.
-
Background Regions (BR) which contain information of reduced importance as is the case of statistically uniform image background or texture. These regions contribute to the perceived information by acting as placeholders or boundaries for the ROI, especially by depicting information concerning the relative location of ROI within the image frame. In the case of moving image sequences, BR present minimal or no motion. Consequently, in a
32 multiple quality coding scheme, BR can be coded in medium fidelity, thus yielding a significant compression in the volume of information to be transmitted or stored. The aforementioned image modelling allows the implementation of particularly attractive coding schemes. More specifically, particularly high compression ratios with simultaneous high image quality can be achieved by using lossy coders to encode regions of interest with high fidelity, while reducing the representation fidelity of background regions. The classification of image regions into regions of interest and background can be achieved either interactively through user intervention or by the use of an automated classification procedure. In the simplest case, an expert end user chooses the regions of interest in every image through a graphical user interface (GUI). The information gathered by this process as well as the actual image data are fed to the coding unit that performs the actual processing. A more sophisticated approach consists of utilising an automated classification system based on appropriate feature extraction (e.g. edge detection, contour following), statistical processing (e.g. presence of certain coefficients or couplings in the frequency domain), or non-linear artificial neural networks. Neural network classifiers, even though requiring a complex and computationally demanding supervised training process, can yield the most satisfactory results, since neural network architectures are able to classify images even in adverse or noisy environments with particularly high success ratios. Regions of interest can be represented in two different modes: The first uses a set of points and vertices corresponding to one or more polygons that closely follow the shape of ROI in the image. Even though this representation is most precise, it presents the following disadvantages. First, non uniform convex polygon regions are difficult to be represented and additionally, if this image is to be processed by any block based coding scheme, a preprocessing step transforming edges and vertices to blocks is required. The second mode initially segments the image into square blocks of 8x8 or 16x16 pixels; these are subsequently tagged as blocks corresponding to ROIs or background regions. The result of this process is a bitmap image of size N/8 • N/8 or N/16 x NIl 6 pixels (where N x N is the size of the original image). This bitmap image will be referred, in the following, as a classification map (CM) and will be coded using run length coding in addition to the original image at a generally negligible coding cost. Each white pixel of the classification map marks a ROI block in the corresponding region of the original image.
3. Incorporationof R01 in DCT-based Coding Schemes The JPEG standard'for coding of still, grey-scale or colour images is based on the Discrete Cosine Transform (I)CT). Colour images represented in the R,G,B colour space are transformed to the Y,Cr,Cb luminance-chrominance colour space and are subsampled into a 4:1:1 format prior to coding. According to the baseline encoder model, the input image is divided in blocks of 8x8 pixels which are transformed in the DCT domain. The coefficients obtained are then quantised using either standard JPEG quantisation tables or user-defined quantisation tables. In the latter case it is possible to improve the quality of the encoded picture in specific cases by selecting appropriate quantisation tables that take into account image-dependent information. Subsequently, the quantified coefficients go through a lossless coding procedure, using either the standard Huffinan coding method, or an arithmetic coder. All of the above procedures are, however, applied on a global basis, since quantisation matrices are fixed within the whole image. Therefore, one cannot exploit advantageous properties that appear locally on certain regions of the image. In contrast, the ROI-DCT based coder presented in this paper provides the means for encoding regions of low/high interest in the image by differentiating the quantisation tables among these areas. This goal is achieved by using different quality factors (QF), as defined, for example, in the baseline JPEG algorithm for each category of image regions. Thus, while blocks belonging to important regions are coded with high quality quantisation tables, a substantial gain in bitstream volume is achieved by quantising the rest of the image blocks with low quality quantisation tables. The image is coded on a per block basis, as in the case with, say, baseline JPEG. Horizontal and vertical sampling factors are defined for each colour component, specifying the number of samples of the component compared to the other colour components; all sampling factors are equal to one in the case of grey scale images. Blocks of different colour components, which correspond to the same physical area of the image, are grouped into minimum coded units (MCU),which are'very similar to the macroblock (MB) entity defined in MPEG and H.261 coding standards. On a higher syntax level, MCUs are grouped into slices. Each MCU may contain as many as 10 blocks, belonging to any colour component, depending on the colour sampling factors. By default, for a component YUV 4:2:0 image, an MCU consists Of 4 luminance blocks and two chrominance blocks. Given the architecture described above and the pipelinelike operation of~the JPEG model, both coder and decoder may only be aware of the total number of blocks already coded/deco~led, irrespectively of the colour component to which these blocks belong. It is, however, imperative for the proposed ROI coding scheme to establish whether the currently decoded block corresponds to a ROI or not. This
33 information is obtained through the classification map (CM); high quality quantisation tables are used for each block matching a white pixel in the classification map, while coarse quantisation is applied to the remaining areas. Progressive or hierarchical implementations of the JPEG standard are considered within the ROI-JPEG coder proposed in the next section, for further increasing the achieved compression ratios, while introducing imperceptible reconstruction errors.
4. A Progressive ROI.JPEG Coding Scheme A ROI-JPEG coder is described next, which provides the means for encoding regions of low/high interest in the image by differentiating the quantisation tables among these areas. This goal is achieved by using different quantisation quality factors (QF), as defined in the baseline JPEG algorithm, for each category of image regions. Thus, while blocks belonging to important regions are coded with high quality quantisation tables, a substantial gain in bitstream volume is achieved by quantizing the rest of the image blocks with low quality quantisation tables. Further reduction in the volume of information transmitted or stored may be achieved by further filtering of the low importance regions. This is achieved through the DCT transform, and does not affect the perceived quality of these regions, since coarse quantisation already incorporates a low-pass filtering process. Additionally, for off-line storage applications, visually optimal quantisation tables on a bits/pixel target rate basis can be computed for both the high interest and background regions of the image. The proposed ROI-JPEG procedure proposed includes the typical components of the JPEG system, i.e., the DCT transform, the quantizer and the entropy coder (Huffmann or arithmetic), applied to each block of the image. A decision step is added, which classifies the image block either to the (ROI) category requiring high reconstruction quality, or to the category of relatively low importance. Using more than two categories is possible; however in most applications of interest, two categories seem to be enough for achieving high compression ratios. Let us assume first, that the decision is based on information which an expert interactively gives to the system; as already mentioned, this can be, for example, performed by marking the important areas on the captured image, before applying the compression procedure to it. The selection of regions of interest results in a classification map of the image blocks; this map uses one bit (when classifying blocks in two categories of high/low importance) per block to denote whether it belongs to a ROI or not. The encoder stores this classification map in the image header bit stream according to the JPEG standard, so that the decoder is capable of recognising the category of each decoded block. The quantisation tables, which are used for coding and reconstructing blocks belonging to ROIs, are also stored in the image header, according to the JPEG specifications. Different quantisation tables can be defined by letting the user specify a quality factor value (QF) for low importance regions, and a window quality factor (WQF) for high quality regions (ROI); the use of more than two categories of regions is possible, if respective quality factors are defined for each category. The properties of QF and WQF are similar to those of the standard JPEG QF; both are used for the derivation of quantisation tables from the standard templates incorporated into the JPEG baseline. In general QF and WQF lie in the intervals [30,~i0] and [70,85] respectively. Progressive, or hierarchical, implementations of the JPEG standard are of great importance for transmission, storage and retrieval of medical images. Such implementations can be produced, by further filtering, i.e., separating the DCT coefficients of each image block into groups, which are subsequently processed, using conventional zig-zag scanning~ sequentially. A frequently met case is to generate three groups of coefficients, correspon~ng to low, medium and high frequency content of the image block. The boundaries of each group can be adapted so as to describe a corresponding frequency band. In general, coefficients from the latter groups which correspond to higher frequencies can be set to zero, yielding imperceptible errors while achieving significant increase in the compression ratios.
5. Medical Applications- MRI The proposed ROI-based coding scheme can be applied to a variety of medical applications, where regions of interest can be effectively defined and reconstructed with very good quality. Such applications include X-rays, where specific parts of the chest, breast, bones or skulls of patients are of major importance for medical diagnosis or monitoring, pathology imaging, radiology examinations, as well as ultrasonography or angiography images used for cardiosurgical applications. In the following we illustrate the performance of the proposed methods by applying the proposed coding schemes on MRI data. Magnetic Resonance Imaging (MRI) is a non invasive imaging technique based on a combination of a magnetic field and an RF (radio frequency) excitation field. Under these circumstances, certain nuclei behave in a manner that provides information about their chemical nature and environment in the tissues of the human body in vivo. In principle, magnetic resonance imaging consists of submitting the region of the body to a broadband RF-magnetic excitation. This results in a situation where the protons of the nuclei in the body tissues absorbe energy, which is
34 radiated in the form of electromagnetic waves later, when the external RF-magnetic excitation is terminated. The transformed values of the spectrum of the resulting electromagnetic waves expressed in integer values correspond to 256 grey levels (8-bit depth), composing the resulting MR image. Typical sizes of such images are in the range of 2n x 2 n pixels, where n = [5,10]. The MRI field is particularly suited for imaging sensitive regions of the human body such as the brain and the spinal cord. Additionally, the ability to generate sagittal, coronal, oblique and transverse views, as well as excellent softtissue contrast representation makes MRI a perfect complement to both anatomical and physiological diagnostic tools.
6. Simulation Results-Conclusions Pictures with size of 256x256 pixels were encoded by the proposed ROI-JPEG coder using quality factors in the range of [50,75] and [75,95] for background regions and ROI respectively. A wide variety of cases were examined first for determining the percentage of important, from a medical point of view, regions in the source images: in the worst case, ROI represented 43% of the image area, whereas in the simplest case only 4% of the image was of any particular interest. It should be noted that the above observations stand under the condition that no pre-processing is performed on the original images. Such a measure could be advantageous in MRI images, where a black coloured background always surrounds the important image data. In contrast, in pathology imaging, the whole of the image consists of pixels contributing to the perceived information. The available image data set consisted of 50 MR images and 16 frames taken from pathology examinations.
These results indicate that the proposed approach constitutes a powerful tool for compressing images in medical experiments, provided that the expert performing the experiment, or receiving the visual information, defines the regions of interest in the image before compressing and storing it. The proposed algorithm can be accommodated into modern PACS as well as used as an extension to the widely accepted DICOM 3.0 standard. We are currently extending our approach for compressing medical video, obtained from ultrasonic measurements in a hospital environment. The MPEG-1 coding scheme, which is also DCT-based, is combined with the proposed ROI and classification map definitions, for effective capturing, storage, retrieval and visualisation of the medical video information.
35
7. Acknowledgement The authors wish to thank PHILIPS Medical Systems, Greece for providing the MR images, as well as expert consultation on raw data.
References [1]
Avinash C. Kak, Malcolm Slaney, "Principles of Computerized Tomographic Imaging" IEEE Press, 1988.
[2]
A.K.Jain, "Fundamentals of Digital Image Processing", Englewood Cliffs, Prentice Hall, 1989.
[3]
A. Netravali and B. Haskell, "Digital Pictures: Representation and Compression", Plenum Press, 1988.
[4]
S. Wong, L. Zaremba, D. Gooden and H.K. Huang, "Radiologic Image Compression-A Review", Proceedings of
the IEEE, vol.83, no 2, pp. 194-219, 1995. [5]
W.B.Pennebaker and J.L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand - Reinhold, 1993.
[6]
S. Elnahas, K. Tzou, J. Cox R. Hill and R. Jost, "progressive Coding and Transmission of Digital Diagnostic Pictures", IEEE Trans. on Medical Imaging, vol. 4, pp. 73-83, 1986.
[7]
N. Panagiotidis, D. Kalogeras, S. Kollias and A. Stafylopatis, "Neural Network Assisted Effective Image Classification", Proceedings of the IEEE, accepted for publication.
[8]
N. Ahmed, T. Natarajan and K. Rao, "Discrete Cosine Transform", IEEE Trans. Computers, vol. 23, pp. 90-93, 1974.
[9]
M. Tekalp, Digital Video Processing, Prentice Hall Signal Processing Series, 1995.
[ 10] "Principles of Magnetic Resonance Imaging", PHILIPS Medical Systems, 1984. [11] M.T. Vlaardingerbtoek, J.A. Boer, "Magnetic Resonance Imaging", Springer Verlag N.Y., 1996. [12] ACR/NEMA, "ACR-NEMA Digital lmaging and Communicatibns Standard", NEMA Standards Publication No. 300-1985, Rosslyn, VA.
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
37
SCALABLE PARALLEL VECTOR QUANTIZATION FOR IMAGE CODING APPPLICATIONS
D. G. SAMPSON
A. C UHADAR
A. C. DOWNTON
Demokritus University of Thrace, GREECE
University of Gaziantep, TURKEY
University of Essex, ENGLAND
Abstract In this paper, we show that the encoding complexity of vector quantization can be conveniently distributed as sub-codebooks over general purpose MIMD parallel processors, to provide almost linearly scalable throughput and flexible configurability. A particular advantage of this approach is that it makes feasible the use of higher dimensional image blocks and/or larger codebooks, leading to improved coding performance with no penalty in execution speed compared with the original sequential implementation. As an example, we show that an implementation with 32 transputers using 8x8 blocks and 4096 codebook entries reduces the bit-rate by a factor of 2.625 and rtms 79% faster than a sequential implementation based upon 4x4 blocks and 256 codebook entries, while producing a similar PSNR.
1. Introduction Vector quantization (VQ) has been extensively investigated for audio, speech, image and video coding applications [1 ]. VQ offers a simple decoding process, where the index of the selected code vector is used to produce the output vector through a look-up table operation. On the other hand, the selection of the bestmatched code vector typically involves expensive computations. The encoding complexity of full-codebook searched VQ increases exponentially with the vector dimension and the coding rate. The main drawback of VQ is the fact that the complexity of the encoder imposes restrictions on the size of the codebook that can be used in practice. This can restrict the efficiency of VQ-based compression systems due to two main reasons: (i) only blocks of small dimension (typically, 4x4) can be used. However, operating on vectors of larger size (e.g. 8x8) can result in higher compression ratios due to the fact that the dependencies between neighbouring vectors can be exploited. (ii) moreover, a large codebook is essential for applications where high quality coding (e.g. super high definition images) is required, or in image sequence coding where the VQ codebook should be able to respond to the changes in input statistics. Different methods have been suggested to reduce the encoding complexity at the expense of suboptimal coding performance. Typically, these techniques involve imposing a certain structure on the VQ codebook, so that unconfined access to all effective code vectors is restricted [ 1]. An alternative approach reported in literature has been to exploit parallelism in special-purpose VLSI implementations of VQ [2]. The approach described in this paper is to employ general purpose Multi-Instruction Multi-Data (MIMD) parallel processors in a pipelineprocessor farm (PPF) configuration [3] which utilises a form of VQ codebook parallelism. The advantage of using general purpose processors is that they perform the encoding task of full-codebook search VQ, so that a high throughput optimal vector quantizer can be realised, but at the same time provide the flexibility to allow any desired trade-off to be made between algorithm speedup, PSNR and bitrate. Furthermore, it is relatively straightforward to apply fast codebook search algorithms to processor farms (which exhibit automatic load balancing between processors) to achieve further speedups, whereas this is oRen impractical for synchronised dedicated VLSI implementations. Parts of the work described in this paper have been published in references [4] and [5].
2. Approach adopted for parallelisation 2.1 The Pipeline Processor Farm (PPF) design model. The PPF design model is part of a parallel design methodology which can be used to decompose existing sequential applications onto any type of Multi-Instruction Multi-Data (MIMD) parallel processor network. The design model emerges from the observation that embedded signal processing applications with continuous data flow may be decomposed into a series of independent stages. The sequential application algorithm is then mapped onto a generalised multiprocessor architecture based upon a
38 pipeline of stages with well-defined communication patterns between them. The parallelism within each stage is exploited in the most appropriate way, for example data parallelism or algorithm parallelism can be applied at various levels, or temporal multiplexing can be applied to complete input data sets, or a combination of these approaches can be implemented as appropriate. In an homogeneous MIMD processor implementation, processor farming is used to implement all these forms of parallelism, because it allows indefinite incremental scaling, provides automatic load balancing and results in a single tractable design model. 2.2 Parallelisation schemes for VQ. The design strategy for the parallelisation of the VQ algorithm should be capable of meeting the requirements for execution speed-up as well as efficient codebook storage. Two different schemes are possible for parallelising the VQ encoding algorithm: Sub-codebookAJ 9 Applying image data parallelism, the entire image is x,lx x l... I ,, , partitioned into a number of sub-images which are distributed over worker processors. Each worker processor then needs to perform an exhaustive search of the entire codebook to select the best-matched available code vector.
i
9 Applying codebook parallelism, each worker processor can perform the encoding process on its own portion of the codebook. Upon receiving the same image block, each worker processor then needs to search a smaller part of the entire codebook to select the closest codevector in the corresponding sub-codebook. However, the partial encoding results from the worker processors need to be compared at a final stage where the best-matched available codevector is computed according to the minimum distortion criterion. Figure 1 illustrates this scheme assuming four worker processors.
i ~ ....}/~,,,~l"'ave~1~~!.----- ..---!
I
Sub'c~176176CI/
' x
!s,.,ve,V"
i I Sub-codebookDI
, __'~'
'"':
Figure 1" Block diagram of image-parallel approach
The first approach is straightforward to apply since there is no need to further process the encoding results received from the worker processors, but it has the disadvantage that the entire codebook needs to be stored at each worker processor. This can impose a limitation on the size of the codebook that can be employed for the particular application. In order to alleviate this drawback, we have implemented the second parallelisation scheme. To achieve further speed-up, the selection of the final codevector (through comparison of the intermediate encoding results) is assigned to a separate processor, referred to as the collector. This offers the advantage that the encoding task of the next input vector is overlapped with the final comparison process of the current input vector. Hence, the parallelisation of the VQ encoding algorithm comprises three processes which are mapped onto a 3stage pipeline configuration as follows: 9 Distributor. This process partitions the input image into rectangular blocks of m • n size (e.g. 4 • 4 or 8X 8) and sends each block (input vector) to every worker processor. 9 Worker. This process performs the encoding task on the received input vector using its own sub-codebook and sends the index of the selected codevector and the corresponding distortion value for the particular input vector to the process collector. The worker process is duplicated S-2 times where S is the total number of processors in the configuration. 9 Collector. This process receives the indices and the corresponding distortion values for the particular image block from the worker processors and compares the partial results to find the best-matched coding index according to the minimum distortion criterion. 3. E x p e r i m e n t a l results The VQ encoder was parallelised in two steps. In the first step, the sequential Sparc2 implementation was ported to a single T800, running on a Meiko Computing Surface. Then, the implementation was decomposed into three different processes as outlined in the previous section. The parallel application is designed such that the number of the processors in the configuration is defined by the user as a rtmtime argument. Hence, the user does not need to modify the application as the size of the transputer network is altered. Although the method described in the parallelisation of VQ is applicable to any data compression application that employs vector quantization, the results reported in this paper are based on the encoding of still images. The
,, ~
....
39
spatial resolution of the test images used was 512X 512 pixels. For our experiments, three different codebook populations, namely N=256, 1024 and 4096 for vector dimensions of 4 X 4 and 8X 8 were used to evaluate the performance of the parallel implementation. Figure 2 illustrates the speed-up performance of the algorithm as the number of worker processors is increased when the vector dimension is set at 4 X 4. As can be seen, the performance of the implementation increases fairly linearly up to the point where the communication links become saturated. Saturation occurs when there are 10 workers and 20 workers in the parallel configuration for codebook populations of 256 and 1024 codevectors, respectively. Since the communication requirements are fixed, but computations increase linearly with the codebook size (for the same vector dimension) as the codebook size is increased, the load at the worker processors becomes larger and hence better speed-ups are obtained. The maximum speed-up achieved with the codebook population of 4096 is 25.6 Further increases in execution speed could be achieved for this codebook if more transputers were available, as the implementation does not yet have saturated communication links. 30,,
30-
259 .~
20
-
N=256
- --.o.---
N=102,
/
(-
t
r --
11501
7. ~ ---o--
20
.~ ,.I f
/J/
N=25q
S:---
N=40!6 N=10;14
I
J
\ ~
10
\
0
2
4
8
10
14
16
20
23
0
26
2
4
8
10
14
16
20
23
i
26
Total number of worker processors
Total number of worker processors
Figure 3: Speed-up graphs for k=8 X 8
Figure 2" Speed-up graphs for k=4 X 4
When the vector dimensions are increased to 8X 8, the corresponding speed-up performance of the implementation as the number of worker processors is increased for different codebook populations is shown in Figure 3. The graphs exhibit similar characteristics to the case of 4X 4 block size, however, better speed-up performance figures are obtained for 8 X 8 block size due to the increased work load. As the task size increases, the execution time required to perform sub-codebook search by the workers increases, whereas the cost of transmitting the intermediate results to the collector remains the same. For a codebook population of 4096 codevectors, a maximum speed-up of 27.75 is obtained. 500(
"
v
~ ~
I =256
--
r
P1=1024
~
~
1=4096
3000
E
._
.5
2000
UJ
1000
\
400(
\
3001
I=256 I=1024
~
=4096
\
2001
8
-- ~ -- r
~x I0(~
2
4
8
10
14
16
20
23
26
Total number of worker processors
Figure 4: Execution timings for k=4 X 4
30
;
4
~
;0
,4
;6
~0
23
2~
Total number of worker p r o c e s s o r s
Figure 5" Execution timings for k=8 X 8
Figures 4 and 5 illustrate the execution timings obtained by the parallel implementation for the 4 X 4 and 8X 8 block size cases respectively. By selecting the points where the execution time performance is a minimum for the particular codebook population, the effect of increased codebook population on the execution time performance of both sequential and parallel implementations can be examined. Figures 6 showes the execution time of the encoding process as the codebook population is increased for the sequential and the 32-processor parallel implementation. It can be seen that the execution time of the parallel encoding process even for the largest (e.g. N=4096) codebook population is still well below the execution time of the sequential implementation with the smallest (e.g. N=256) codebook population. In general, a larger VQ codebook population results in better quality of the compressed image at the expense of extra bit rate. There are applications, such as super high definition TV and medical images where perceptually transparent quality is essential. In our experiments, for the test image LENA 512X 512X 8 and vector dimension k=4X 4, using N=4096 rather than N=256 codevectors provided peak signal-to-noise ratio (PSNR) of 33.78dB instead of 30.11 dB.
40
10000
10000. 9164.80
uquentlal
---qn----
parallel
.j
i
4oo0 2000 o
f
i
/
~ 5 9 1 y
_~-.~2
0 2'56
,3~u
, 1024
. Codebook population
369.20
! 4096 ,
I
4ooo,
~,~.32/
2OO0. ~ 0 " ~ ~ 2 $75.$6
o
_
0 2~
.
I~4
.
9 Codebook population
328. I 40tO
Figure 6: Comparison between the sequential and the parallel implementation for k=4 X 4 and k=8 X 8 Finally, Table 1 illustrates the advantage of using large dimensional blocks in low bit rate coding. In the table, two vector quantisers operating on blocks of different size are compared in terms of PSNR, compression ratio and execution time. It can be seen that, the vector quantiser which operates on 8x8 blocks and N=4096 codevectors gives similar PSNR results to the one operating on 4x4 blocks and N=256 codevectors. However, the former leads to compression ratio 42:1 rather than 16:1 of the later. This corresponds to a reduction of 2.625 in the total amount of data required to represent the compressed image. Yet, although the sequential implementation of VQ8x8 is 15.49 times slower than the one of VQ4x4, the parallel VQSx8 is 1.79 times faster than the sequential VQ4x4. Hence we can conclude that parallel processing can be used to enhance the overall performance of VQ-based compression systems, as well as to speed up their execution, and that by trading off between image compression, PSNR and speedup, improvements to all three parameters can be achieved simultaneously. 1(-8 X 8 k-4X 4 Codebook Population N=256 N=4096 PSNR (dB) 30.114 29.105 Bit Rate (bit per pixel) 0.50 0.1875 Compression Ratio 16:1 42:1 Execution Time (sec) -Sequential 591.36 9164.80 Execution Time (see) - Parallel 88.32 328'96 Table 1: Performance Evaluation of parallel VQ for still image coding 4. Conclusions Parallelising the VQ encoder aims to alleviate the encoding complexity and allow the practical implementation of vector quantisers which operate on large block sizes and/or codebook populations. We have presented a scalable parallel approach to vector quantization. A three-stage pipeline implementation of the VQ encoder which offers the advantage of both increased execution speeds and efficient storage of large codebooks was described. Simulation results for still image coding applications demonstrated that parallel implementation of a vector quantiser operating on large codebooks (e.g. N=4096) and large vector dimensions (e.g. k=SX 8) can be faster than the sequential VQ for smaller codebooks (e.g. N=256) and block sizes (e.g. k=4X 4). This is very encouraging, since it indicates that parallelising VQ can offer an improvement to the overall efficiency of VQbased coding systems. 5.References [1] A. Gersho and R.M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, New York, USA, 1991. [2] K. Dezhgosha, M.M. Jamali amd S.C. Kwatra, "A VLSI architecture for real-time image coding using a vector quantization based algorithm," IEEE Trans on Signal Processing, vol. 40, no. 1, pp. 181-189, January 1992. [3] A.C. Downton, R.W. Tregidgo and A. Cuhadar, "Generalised parallelism for embedded vision applications," in Parallel Computations: Paradigms and Applications, A. Zomaya, Editor, Chapman and Hall, 1995. [4] A. Cuhadar, D.G. Sampson and A.C. Downton, "A scalable parallel approach to vector quantization," to appear in Journal of Real-Time Imaging, Academic Press, 1996. [5] A. Cuhadar, D.G. Sampson and A.C. Downton, "Scalable vector quantization architecture for image compression," IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, Singapore, 11-13 June 1996.
Session: B
WAVELETS IN IMAGE/SIGNAL PROCESSING
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
43
Real-Time Image Compression Methods Incorporating Wavelet Transforms
D. T. Morris and M. D. Edwards Department of Computation, UMIST, PO Box 88, Manchester, M60 1QD, United Kingdom Abstract The aims of the work described here are to develop and implement new generic image compression methods, based on the use of wavelet transforms and vector quantisation techniques, to achieve very high compression results whilst maintaining satisfactory image quality. A major problem with current wavelet-based compression methods is the large amount of computation required to process a single image. This problem is exacerbated with the requirement to process video sequences in real-time. We intend to use previously developed hardware/software co-design techniques to partition the compression and decompression algorithms between hardware and software implementations in order to maximise the compression performance at acceptable cost. The paper describes the different software and hardware architectures we propose to investigate.
Introduction There is an increasing use of multimedia computing techniques in a wide range of diverse application areas, for example, manufacturing and commerce, publishing, education, and leisure services. Central to these techniques is the processing, storage, and transmission of both still photographs and video sequences. It is well known that the volume of data required to describe such images in their raw form makes storage prohibitively expensive and greatly slows transmission. For example, a standard 35 mm digitised photograph requires nearly 20 Mbytes of storage and a one second high resolution video sequence about 30 Mbytes giving a required data transmission rate of at least 240 Mbits/sec. It is, therefore, evident that the information contained in images must be compressed in some w a y - usually by eliminating redundant information and encoding the remaining entropy. The goal is to reduce the bit rate for storage and transmission whilst maintaining acceptable quality when the images are subsequently de-compressed. It is possible to achieve relatively high compression ratios using current international standard techniques, for example, JPEG [1] and MPEG [2]. These standards are normally used in commercial multimedia applications with medium resolution images, for example, 352 by 288 pixel video sequences at 25 frames per second. Unfortunately, very high compression ratios, in excess of 100, can only be achieved at the expense of significant losses in image quality. In addition, the compression/decompression of video information in real-time can only be realistically performed using expensive special-purpose video processing components [3]. The major objective of the work described here is to develop and implement new generic image compression methods, based on the use of wavelet transforms and vector quantisation techniques, which will achieve very high compression results whilst maintaining satisfactory image quality. A major problem with current wavelet-based compression methods is the large amount of computation required to compress/decompress single images- this is especially true when processing video sequences in realtime. We believe a key issue in the realisation of feasible compression methods is the partitioning of the algorithms between hardware and software implementations in order to maximise compression performance at acceptable cost. Previous research in the area of hardware/software codesign has indicated that "software acceleration" methods [4, 5] can be used to enhance the performance of software-based systems, using relatively inexpensive programmable hardware as a special-purpose coprocessor. We intend to investigate the implementation of these compression algorithms as coprocessors for conventional microprocessor systems. This will give us a range of implementation alternatives for image compression systems with differing cost/performance characteristics. This work represents the first stage of our ongoing research into the design and implementation of distributed multimedia systems using novel technologies and methods. It is hoped that the results from this work will allow high-quality images to be stored in a computer system using less disk memory, and will permit images to be transmitted on computer networks using cheap Ethernet-based communications media.
44
Previous Work The discrete wavelet transform has recently received considerable attention in the context of sub-band coding in image compression. An image is decomposed into a set of sub-images with different resolutions corresponding to different frequency ranges in the original image, A number of researchers [6, 7, 8] have used wavelet transforms and vector quantisation techniques to compress still images. They identify two key areas of research (i) choice of suitable wavelet filters for image compression and (ii) methods for encoding wavelet coefficients using scalar and vector quantisation techniques [9]. Both these areas will be addressed in our research. There have been software implementations of MPEG decoders, for example, 160 by 120 video sequences were processed in real-time on a RISC-based workstation [10]. It is estimated that 320 by 240 sequences could be processed at 10-15 frames per second on a powerful state-of-the-art workstation. Wavelet techniques can also be employed to decompose a video frame into multiple layers with different resolutions and frequency bands [11]. A motion estimation scheme can then used to track motion activities at the different layers across multiple frames. Good compression results at real-time frame rates were achieved using different variations of this motion-compensated wavelet video compression system. We intend to investigate the use of 3-D wavelet transformations to assist with the compression of video sequences. A range of VLSI devices have been developed for performing JPEG, MPEG-I, and MPEG-II image compression tasks [ 12]. It is possible to generate a range of hardwired and programmable architectures, which provide a range of cost/performance trade-offs. A video co-processor for a conventional microprocessor can prove to be a viable option. Our proposed use of programmable hardware (FPGAs) to act as a co-processor will allow us to explore different implementations in an efficient manner. Some researchers have developed effective VLSI architectures [13] for implementing the discrete wavelet transform. We intend to take account of this work in the design' and implementation of our coprocessors.
Proposed Work Our ultimate aim is to encode full resolution colour video data at real-time frame rates, reduced by a factor that allows transmission over public networks whilst allowing high quality decoded data to be derived. We anticipate that encoding still images and video sequences using the 2-D discrete wavelet transform will have the same relationship as the JPEG and MPEG compression methods using the 2-D discrete cosine transform. Therefore, we shall initially investigate using the wavelet transform to compress still images in a manner similar to that employed in the JPEG method. That is, processing the image by transforming sub-blocks and encoding the wavelet coefficients. We intend to investigate the interplay between various coding parameters, for example, size of sub-block and choice of wavelet filter, together with the quality and degree of compression that may be achieved. Whilst we realise that much of this work has already been performed, it is our intention to implement these new algorithms on a Pentium-based workstation to obtain cost/performance benchmarks for future implementations of the algorithms using special-purpose programmable co-processors. Subsequently, we will investigate methods of encoding sequences of related images. The MPEG standard suggests that a sequence of images are encoded essentially by interpolating between JPEG encoded frames. The MPEG standard specifies that the JPEG versions of a number of frames are derived (Iframes), together with 'predicted' frames - known as P-frames and B-frames - which allow the motion of objects between I-frames to be taken into account. This approach could be used in our scheme simply by replacing the JPEG encoded frames by wavelet encoded ones. However, Hilton et al. [14] suggest that temporal redundancy is better exploited by encoding the difference between two sequential frames, as shown in Figure 1. Since images in a sequence are highly correlated, the difference between any adjacent pair of images will contain much less information than either of the original images. Encoding these difference images will, therefore, be highly efficient. The 'support analysis' operation is concerned with the identification of those wavelet coefficients that are required by the inverse wavelet transform to reconstruct an approximation of the ith difference image. By altering the threshold value different compression ratios can be achieved. The decoder can rapidly reconstruct the difference image by computing the inverse wavelet transform for only those pixels that are influenced by the coefficients sent by the encoder.
45
frame
i+ 1
-
frame
i
Aframe i
=
~~kenc~176 Figure 1
~
Video compression using frame differencing
Prior to this stage of the project we have been concerned with achieving efficient image sequence encoding/decoding algorithms (efficient with respect to compression rates and quality of decoded images), their execution times have been of secondary interest. We do not anticipate that a purely software implementation of the wavelet transforms will achieve our goal of real-time operation. We shall, therefore, examine methods for speeding up the sequential execution of the compression/decompression algorithms using special-purpose co-processors. At two extremes of performance are software and hardware implementations of video encoding. A pure software execution has been shown to be cheap but slow, whilst a pure hardware implementation is fast but can be prohibitively expensive. We aim to investigate compromises between pure hardware and pure software solutions. Previous work [4, 5] has indicated that by identifying the performance critical regions of a sequential algorithm and transferring them to a special-purpose hardware implementation a speedup of about three times the execution time of the software-only implementation can be achieved. We envisage that such a performance enhancement will allow us to achieve close to real-time performance. The system hardware architecture will take the form of a conventional Pentium-based workstation and includes the 'software acceleration architecture', which is interfaced to the system's PCI bus as shown in Figure 2.
Cache memory
Pentium processor
I
I video__~ Frame store input video I output Figure 2
System hardware architecture
Bridge
DRAM
I I
I
PClbus
J I
I
software acceleration architecture
46 In the software acceleration architecture the processor (P) will execute the main components of compression and decompression algorithms. The associated code and data will reside in the memory (M). The FPGA will execute the identified time critical software components in hardware. By having programmable hardware - the F P G A - it will be possible to experiment with different hardware/software tradeoffs in the implementations of the algorithms in order to optimise the overall system performance. Finally, we plan to perform an extensive series of experiments where we shall be able to compare various wavelet transformation and vector quantisation methods with the industry standard JPEG and MPEG schemes. This will permit us to determine the best choices with respect to execution speed, volume of compressed data, quality of decompressed data, and cost. We shall further be able to report on the cost/benefit of various hardware and software compromises. Conclusions
In this paper we have proposed a plan of work which will allow us to develop and evaluate new generic image compression methods, based on the use of wavelet transforms and vector quantisation techniques, which will achieve very high compression results whilst maintaining satisfactory image quality. The novel aspects of our work include the ability to explore tradeoffs between hardware and software implementations in order to maximise performance and achieve close to real-time processing of video sequences at acceptable cost. Future work will include the implementation of these new algorithms using a multicomputer system. We will investigate methods for partitioning the algorithms - based on an analysis of 'spatial' and 'temporal' parallelism requirements- for execution on a network of T9000 transputers. We shall also develop special-purpose hardware that will be integrated into the T9000 network and thereby speedup the execution of the algorithms as necessary. References
[1] [2]
[3] [4] [5] [6] [7]
[8] [9] [10] [11]
[12] [13]
1-14]
G. Wallace, "The JPEG Still Image Data Compression Standard", Communications of the ACM, vol. 34, no. 4, pp. 30-44, 1991. D. LeGall, "MPEG: A Video Compression Standard for Multimedia Applications", Communications of the ACM, vol. 34, no. 4, pp. 45-68, 1991. CL450 MPEG Video Decoder User's Manual. C-Cube Microsystems, Milpitas, USA, 1995. M. D. Edwards and J. Forrest, "Software Acceleration using Programmable Hardware Devices", IEE Proceedings - Computers and Digital Techniques, vol. 143, no. 1, pp. 55-63, 1996. M. D. Edwards and J. Forrest, "A Practical Hardware Architecture to Support Software Acceleration", Microprocessors and Microsystems, vol. 20, no. 3, pp. 167-174, 1996. M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, "Image Coding Using Wavelet Transform", IEEE Transactions on Image Processing, vol. 1, no. 2, pp. 205-220, 1992. J. D. Villasenor, B. Belzer, and J. Liao, "Wavelet Filter Evaluation for Image Compression", IEEE Transactions on Image Processing, vol. 4, no. 8, pp. 4-15, 1995. A. Averbuch, D. Lazar, and M. Israeli, "Image Compression Using Wavelet Transform and Multiresolution Decomposition", IEEE Transactions on Image Processing, vol. 5, no. 1, pp. 4 15, 1996. W. Li and Y. Q. Zhang, "Vector-Based Signal Processing and Quantization for Image and Video Compression", Proceedings of the IEEE, vol. 83, no. 2, pp. 317-335, 1995. K. Patel, B. C. Smith, and L. A. Rowe, "Performance of a Software MPEG Video Decoder", Computer Science Division, University of Califomia at Berkeley, USA, 1993. Y. Q. Zhang and S. Zafar, "Motion-Compensated Wavelet Transform Coding for Color Video Compression", IEEE Transactions on Circuits and Systems for Video Technology, vol. 2, no. 3, pp. 285-296, 1992. P. Pirsch, N. Demassieux, and W. Gehrke, "VLSI Architectures for Video Compression - A Survey", Proceedings of the IEEE, vol. 83, no. 2, pp. 220-246, 1995. K. K. Parhi and T. Nishitani, "VLSI Architectures for Discrete Wavelet Transforms", IEEE Transactions on Very Large Scale Integration Systems, vol. 1, no. 2, pp. 191-202, 1993. M. L. Hilton, B. D. Jawerth, and A. Sengupta, "Compressing Still and Moving Images Using Wavelets", Multimedia Systems, vol. 2, no. 3, 1994.
Proceedings IWISP '96,"4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
47
Custom Wavelet Packet Image Compression Design* Mladen Victor Wickerhauser t
July 11, 1996 Abstract
This tutorial paper presents a meta-algorithm for designing a transform coding image compression algorithm specific to a given application. The goal is to select a decorrelating transform which performs best on a given collection of data. It consists of conducting experimental trials with adapted wavelet transforms and the best basis algorithm, evaluating the basis choices made for a training set of images, then selecting a transform that, on average, delivers the best compression for the data set. A crude version of the method was used to design the WSQ fingerprint image compression algorithm.
1
Introduction
No single image compression algorithm can be expected to work well for all classes of digital images. The sampling rates, frequency content, and pixel quantization all influence the compressibility of the original data. Subsequent machine or human analyses of the compressed data, or its presentation at various magnifications, all influence the nature and visibility of distortion and artifacts. Thus compression standards like those of the JPEG committee [1], established for a "natural" images intended to be viewed by humans, do not satisfy the requirements for compressing fingerprint images intended to be scanned by machines. In that particular example, it was necessary to develop a new algorithm WSQ [2]. Both JPEG and WSQ are examples of transform coding image compression algorithms. That class provides a rich selection from which custom compression algorithms may be chosen. This paper presents a meta-algorithm for rationally and automatically choosing one of them to suit a particular application. It focuses on the transform portion of the compression algorithm: the best basis method is used to optimize it to provide the best average compression of a representative set of images, subject to speed constraints. A crude version of the method was used to design the WSQ fingerprint image compression algorithm.
2
T r a n s f o r m coding image c o m p r e s s i o n
The generic transform coding compression scheme is depicted in Figure 1. It consists of three pieces: 9 Transform: Apply a function, which is invertible or lossless in exact arithmetic, which should decorrelate the pixels in the image. It does this by decomposing the image into a superposition of independent patterns; it produces a sequence of floating-point amplitudes which are the intensities of the new components. 9 Quantize: Replace the transform amplitudes with (small) integer approximations. This is the lossy or non invertible part of the algorithm, where all the distortion is introduced. 9 Code: Rewrite the integer stream of quantized transform coefficients into a more efficient alphabet, so as to approach the information-theoretic minimum bit rate. This operation is akin to a table lookup, and is invertible. These three steps are depicted in Figure 1. *Research partially supported by NSF, AFOSR, and Southwestern Bell Corporation tDepartment of Mathematics, Washington University, St. Louis, Missouri, 63130 USA
48
I
Scanned image
Transform
Quantize
Code
Figure 1: Generic transform coding image compression device.
Decode
Unquantize
Untransform
.•Restored image
Figure 2: Inverse of the generic transform coder: the decoder.
To recover an image from the coded, stored data, the steps in Figure 1 are inverted as shown in Figure 2. The first and third blocks of the compression algorithm are exactly invertible in exact arithmetic, but the Unquantize block does not in general produce the same amplitudes that were given to the Quantize block during compression. The errors thus introduced can be controlled both by the fineness of the quantization (which limits the maximum size of the error) and by favoritism (which tries to reduce the errors for certain amplitudes at the expense of greater errors for others). The compression ratio produced by such an algorithm is computed by dividing the size of the input file by the size of the output file. It thus takes into account all of the side information stored with the output file that is needed for reconstruction. Roughly speaking, if the coding step is perfectly efficient, the compression ratio is maximized for a given distortion when the transform and quantize steps produce a sequence with minimal entropy. However, since minimal entropy is hard to characterize and harder achieve, it is better to aim at a broader target: a sequence with almost all of the values being zero. Such a sequence will have a low, if not minimal, entropy, since its value distribution with be highly peaked at zero. This paper concentrates on the Transform operation. The goal is to choose, from a large family of wavelet, wavelet packet, and local trigonometric transforms, the one which can be expected to yield the largest fraction of negligible amplitudes on data represented by a training set. Those will be quantized to zero in exchange for a given degree of distortion, yielding the biggest peak at zero in the value distribution and resulting in the best compression. It will assumed that the transforms are orthogonal or nearly orthogonal, so that their condition number is close to 1 and they introduce no significant redundancy.
3
Custom transforms
There are two fast ways to decompose images at the transform step: splitting into small blocks of pixels and then applying some fast transform to the blocks, or splitting the whole image into frequency subbands by convolving with short filters. Both methods cost O(P log P) operations for an P-pixel image. Detailed formulas and a proof of the complexity statement can be found in Reference [5], so only a brief summary will be presented here. In the pixel splitting scheme, the image is cut into blocks, either of fixed or variable size, but small enough so that the intensities of all pixels contained within a block are correlated. This cutting is depicted in Figure 3. Then decorrelation is performed by applying the two-dimensional discrete cosine transform (DCT) to the blocks. This method is used in the JPEG still picture image compression standard [3]. The resulting amplitudes represent spatial frequency components in the blocks. Because digitized images are often limited in their spectral content, most of the amplitudes in each block will be negligible. To maximize the proportion
49
II Figure 3: Division of a 128 x 128 pixel image into 8 x 8 blocks, as in JPEG, or into blocks varying from 4 x 4 to 32 x 32. of negligible amplitudes, the blocks should be chosen as large as possible subject to the constraints that (1) only a few spatial frequencies are present in each block, and (2) describing the block boundaries does not create too much side information. In the subband splitting scheme, a low-pass and a high-pass filter are used along rows and columns to split the image into four subimages characterized by restricted frequency content. This process is repeated on the subimages, down to some maximum depth of decomposition, resulting in a segmentation of frequency space into subbands. Two such segmentations are depicted in Figure 4; the one on the right is used in the WSQ fingerprint image compression algorithm [2]. The resulting amplitudes again represent spatial frequency components, computed over portions of the picture determined by the depth of the subband and the location of the amplitude in its subband. Again, for images of limited spectral content, most of these amplitudes will be negligible. The two example subband decompositions are approximately radial with respect to the "origin" in the upper left hand corner; this works well for isotropic images, i.e., where no direction is favored over any others.
4
The joint best basis
Both splitting schemes can be organized as quadtrees to a specified depth, with the selected transform determined by the leaves of a subtree like the one depicted in Figure 5. To choose the subtree and thus the transform, each member of a representative training set of images is decomposed into the complete quadtree of amplitudes. Then the squares of these amplitudes are summed into a sum-of-squares quadtree. Using an information cost function such as "number of nonnegligible amplitudes", the sum-of-squares quadtree is searched for its best basis, which is the one that minimizes this cost ([5], p. 282). Figure 6 depicts this algorithm. The best basis for the E quadtree is the joint best basis for the training set of images 1, 2 , . . . , N. That is the transform which produces, on average, the largest number of negligible output coefficients. To find the best basis requires examining each coefficient in the quadtree and examining each subband or pixel block at most twice, which means that the complexity is O(P log P) for P-pixel images. To find the joint best basis requires building the sum-of-squares tree first, which dominates the total complexity with its O(NPlog P) cost for a training set of N P-pixel images. Of course, the joint best basis transform is only optimal within its own class, and the class is determined by the technical details and mathematical properties of the splitting algorithm. If these constraints were removed and the search performed over all orthonormal transforms, then the joint best basis will be the
50
-H I
Figure 4: Division of an image into orthogonal wavelet subbands to level 5, or into the WSQ subbands. Frequencies increase down and to the right.
Figure 5: Splitting schemes produces quadtrees; custom bases are determined by the leaves of a subtree such as the one shown here, shaded for emphasis.
51
Sample image 1
S u
m
Sample image 2
I Best-basis
search]
-P -HI -HI ! I I
!
Sample image N Figure 6: A joint best basis from a class of splitting algorithms is determined by a sample set of N images.
class 1 Fi class 2 rl
Subband
sphn~ Training set
~
: classnsra
"~
P sphttmg
class 1 17
!
x
e
~
l
~ : classnpra
Costs of the joint best basis from the class
Least cost [] determines the winning transform
Figure 7: A meta-algorithm for deciding which splitting algorithm to use with a particular class of images.
Karhunen-Lo~ve (KL) or principal orthogonal basis [4], which is known to be the minimizer of the number of nonnegligible amplitudes. With the constraints, whose purpose is to speed things up, the chosen transform is just an approximation to KL.
5
C h o o s i n g t h e b e s t t r a n s f o r m f r o m m u l t i p l e classes
There is a meta-algorithm for relaxing the constraints a bit while preserving the speed. Namely, a custom transform can be chosen by checking many classes of splitting algorithms in order to further increase the expected number of negligible coefficients. This scheme was first proposed by Yves Meyer, and is depicted in Figure 7. At the end of each path is a cost figure, the expected number of nonnegligible coefficients for the training set of images. The path that leads to the lowest cost determines which algorithm should be used to find the custom transform for compressing the images represented by the training set. Examples of different classes are the different subband splitting schemes associated to different conjugate quadrature filters ([5], Chapter 5 and Appendix C), or the adapted local trigonometric bases determined by different windows ([5], Chapters 3 and 4).
52
6
Conclusion
Given a training set of images, a transform coding image compression algorithm may be rationally chosen from a class of fast splitting algorithms. The choice criterion is a cost function that, when low, yields high compression ratios for transform coding image compression. The method works for wavelet packet and local trigonometric transforms and thus produces well-conditioned compression and decompression methods of complexity O(PlogP) for P-pixel images. Searching for the best choice itself costs O(NPlogP), where N is the number of training images.
References [1] ISO/IEC JTC1 Draft International Standard 10918-1. Digital compression and coding of continuous-tone still images, part 1: Requirements and guidelines. Available from ANSI Sales, (212)642-4900, November 1991. ISO/IEC CD 10918-1 (alternate number SC2 N2215). [2] IAFIS-IC-0110v2. WSQ gray-scale fingerprint image compression specification. Version 2, US Department of Justice, Federal Bureau of Investigation, 16 February 1993. [3] Gregory K. Wallace. The JPEG still picture compression standard. 34:30-44, April 1991.
Communications of the A CM,
[4] Mladen Victor Wickerhauser. Fast approximate factor analysis. In Martine J. Silbermann and Hemant D. Tagare, editors, Curves and Surfaces in Computer Vision and Graphics II, volume 1610 of SPIE Proceedings, pages 23-32, Boston, October 1991. SPIE. [5] Mladen Victor Wickerhauser. Adapted Wavelet Analysis from Theory to Software. AK Peters, Ltd., Wellesley, Massachusetts, 9 May 1994. With optional diskette.
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
53
Two-dimensional directional wavelets and image processing Jean-Pierre Antoine Institut de Physique Th~orique, Universit~ Catholique de Louvain B - 1348 Louvain-la-Neuve, Belgium E-mail:
[email protected] Abstract The two-dimensional continuous wavelet transform (CWT) is characterized by a rotation parameter, in addition to the usual translations and dilations. This enables it to detect edges and directions in images, provided a directional wavelet is used. In this paper, we review the general properties of the 2-D CWT, with special emphasis on the directional aspects. We discuss, in particular, the problem of wavelet calibration and we present several applications of directional wavelets. 1.
The
Continuous
Wavelet
Transform
in two dimensions
Both in 1-D (signal analysis) and 2-D (image processing), the wavelet transform (WT) has become by now a standard tool (see [1]-[4] for a review). Although the discrete version, based on multiresolution analysis, is probably better known, the continuous W T (CWT) plays a crucial role for the detection and analysis of particular features in a signal, and we will focus here on the latter, with particular emphasis on the directional aspects. Indeed the CWT is a very efficient tool for detecting oriented features in a signal, provided one uses a directional wavelet, that is, a wavelet which has itself an intrinsic orientation. We refer the reader to [5, 6] for a detailed analysis.
1.1. Mathematical properties By an image, we mean a 2-D signal of finite energy, represented by a function s E L2(R2,d2~). In practice, a black and white image will be represented by a bounded n0n-negative function: 0 _ s(s _ M, V s E ]~2 (M > 0), the discrete values of s(s corresponding to the level of gray of each pixel. A wavelet is a function r E L2(]~2) which is admissible, that is: d2~
~r = (2~r)~ f
I,~(~)12 < ~.
(1.1)
If r is regular enough, the admissibility condition simply means that the wavelet has zero mean:
r
r
fd2~r
(1.2)
A
In practice, both r and its Fourier transform r are supposed to be well localized, and, in addition,the wavelet r is often required to have a few vanishing moments, as in the 1-D case [2]. This condition improves the capacity of the W T to detect singularities. Let now s E L=(R 2, d=~) be an image. Its continuous wavelet transform with respect to the fixed considered as a wavelet r S - Wcs, is the scalar product of s with the transformed wavelet r function of (a, 0, b) (for simplicity, we assume r to be normalized by c O = 1)"
s(~,o,~) =
- a -~ / r = a f eig'g r J
s(~) d2:F. "g(f~) d2k.
(1.3)
54 In these relations, b 6 R 2 is a translation, a > 0 a dilation, and r_0(0 < 8 < 27r) denotes the usual 2 x 2 rotation matrix. The parameter space G = { (a, 8, b)} is in fact the similitude group of R 2 . Indeed the CWT, including the admissibility condition (1.1), originates from group theory, namely the natural representation of G in the Hilbert space L2(R 2 , d2~). The main properties of the wavelet transform We 9s ~ S may be summarized as follows [5, 6]" 9 Since the wavelet r is required to have zero mean, We provides a filtering effect, exactly as in l-D, i.e. the analysis is local in all four parameters a,8, b, and it is particularly efficient at detecting discontinuities in images. 9 We is
linear, contrary, for instance, to the Wigner-Ville transform, which is bilinear.
9 We is
covariant under translations, dilations and rotations.
9 We conserves energy:
///~33dSd2b ,S(a,8, b").2= / d2~.s(~), 2,
(1.4)
i.e. it is an isometry from the space of signals into the space of transforms, which is a closed subspace of L2(G, dg), where dg - a-3dad8d2b is the natural invariant measure on G. 9 As a consequence, We is invertible on its range and the inverse transformation is simply the adjoint of We. Thus one has an exact reconstruction formula:
s(~) =///~33dSd2b Ca,o,~(~) S(a,8, b).
(1.5)
In other words, the 2-D CWT provides a decomposition of the signal in terms of the analyzing wavelets Ca,e,~, with coefficients S(a, 8, b). 9 The projection from L2(G, dg) onto the range of We is an integral operator, whose kernel K is the autocorrelation function of r (also called reproducing kernel)"
K(a',O',E'la,O,g)= (r Therefore, transforms satisfy the
[r
(1.6)
reproduction property:
S(a',e',~') =///~dOd2~ K(.',e',~'I~,e,~) S(~,O,~).
(1.7)
1.2. Interpretation and implementation: The various representations The first problem one faces in practice is one of visualization. Indeed S(a, 8, b) is a function of 4 variables: two position variables b 6 R 2 , and the pair (a, 8) 6 ] ~ x [0, 27r) ~_ ] ~ . This splitting has an intrinsic geometrical meaning [5, 6]. Indeed, the pair (a -1 , 8) plays the role of spatial frequency, expressed in polar coordinates, exactly as a -1 defines the frequency scale in the 1-D case [7, 8]. Thus the full 4-D parameter space of the 2-D W T may be interpreted as a phase space, in the sense of classical mechanics. Now, to compute and visualize the full CWT in all four variables is hardly possible. Therefore, in order to obtain a manageable tool, one must restrict oneself to a section of the parameter space {a, 8, bx, by} . The geometrical considerations made above indicate that two of them are more natural: either (a, 8) or (bx, by) are fixed, and the W T is treated as a function of the two remaining variables. The corresponding representations have the following characteristics. (i) The position representation, where a and 8 are fixed and the CWT is considered as a function of position b alone. This is the standard representation, and it is useful for the general purpose of image processing: detection of position, shape and contours of objects; pattern recognition; image enhancement by resynthesis after elimination of unwanted features (such as noise). Alternatively, one may use polar coordinates, in which case the variables are interpreted as range [b[ and perception angle a, another familiar representation of images.
55
(ii) The scale-angle representation: for fixed b, the CWT is considered as a function of scale and angle (a, 0), i.e. of spatial frequency. In other words, one looks at the full CWT from b, and observes all scales and all directions at once. The scale-angle representation will be interesting whenever scaling behavior (as in fractals) or angular selection is important, in particular when directional wavelets are used. In addition to these two familiar representations, there are four other ones, corresponding to twodimensional sections. Among these, the angle-angle representation might be useful for applications [10]. Here one fixes the range Ibl and the scale a and considers the CWT at all perception angles a and all anisotropy angles 0. For the numerical evaluation, discretization of the WT in any of these representations, and systematic use of the F F T algorithm will lead to a numerical complexity of 3NIN2 log(NiN2), where N1, N2 denote the number of sampling points in the two remaining variables. The natural discretization is linear for the position variables (bx, by) and the angles 0, ~, but logarithmic in the scale variable a.
2. Choice of the analyzing wavelet The next step is to choose an analyzing wavelet r At this point, there are two possibilities, depending on the problem at hand, namely isotropic or directional wavelets.
2.1. Isotropic wavelets If one wants to perform a pointwise analysis, that is, when no oriented features are present or relevant in the signal, one may choose an analyzing wavelet r which is invariant under rotation. Then the 9 dependence drops out, for instance, in the reconstruction formula (1.5). A typical example is the isotropic 2-D mexican hat wavelet, which is simply the Laplacian of a Gaussian: ~bH(:~) - - ( 2 - [~[2) exp(-1~1x1~2).
(2.1)
This is a real, rotation invariant wavelet, introduced by Marr [11]. The anisotropic version is obtained by replacing in (2.1) ~ by A~, where A = diag[e -1/2,1], e > 1, is a 2 x 2 anisotropy matrix. However, this wavelet is of little use in practice, because it still acts as a second order operator and detects singularities in all directions. Indeed it is not a directional wavelet, in the technical sense defined below. Hence the mexican hat will be efficient for a fine pointwise analysis, but not for detecting directions.
2.2. Directional wavelets When the aim is to detect oriented features (segments, edges, vector field,... ) in an image, for instance to perform directional filtering, one has to use a wavelet which is not rotation invariant. The best angular selectivity will be obtained if r is directional. By this we mean that the effective support of its Fourier transform r is contained in a convex cone in spatial frequency space {]~}, with apex at the origin, or a finite union of disjoint such cones (in that case, one will usually call r multidirectionaO. According to this definition, the anisotropic mexican hat is not directional, since the support of CH is centered at the origin, no matter how big its anisotropy is, and, indeed, detailed tests confirm its poor performances in selecting directions [5]. Typical directional wavelets are the 2-D Morlet wavelet or the Cauchy wavelets [6]. A
2.2.1. The 2-D Morlet wavelet This is the prototype of a directional wavelet: CM(Z) -- exp(if%. Z) e x p ( - ~~IA~I 2) ,
(2.2)
r
(2.3)
= V~ exp ( - l i c k 2 + (ky - ko)21).
The parameter ko is the wave vector, and A the anisotropy matrix as above. As in l-D, we should add a correction term to (2.2) and (2.3) to enforce the admissibility condition CM(0) = 0. However, since it is numerically negligible for [f%[ _ 5.6, we have dropped it altogether. The modulus of the (truncated)
55 wavelet GM is a Gaussian, elongated in the x direction if e > 1, and its phase is constant along the direction orthogonal to ko. Thus the wavelet GM smoothes the signal in all directions, but detects the sharp transitions in the direction perpendicular to k"o. The angular selectivity increases with Ifcol and with the anisotropy e. The best selectivit.y will be obtained by combining the two effects, i.e. by taking k-'o = (O, ko). The effective support of GM is centered at ko and is contained in a convex cone, that becomes narrower as e increases.
2.2.2. The Cauchy wavelet Let g _= g(0, a,/3) = {fr E R 2 [a < r _3} be the convex cone given by the directions a and/3. The dual cone C(0, &,/~) = {fr e R 2 I ft./0 > 0, V/0 e g(0,a,/3)} is also convex. Given afixed vector r7 e C(0,&,/~), we define the Cauchy wavelet in spatial frequency variables [6]:
,•(e)(•) rn
/ (~" r l, 0,
(~. ~), ~-~.,~, fi e c(0,~,Z) otherwise.
(2.4)
where e~ (resp. e~) denotes the unit vector in the direction c~ (resp. /3). The Cauchy wavelet r162 is strictly supported in the cone g(0, c~,/3) and the parameters m, 1 E N* give the number of vanishing moments on the edges of the cone. An explicit calculation then yields the following result:
(c)-.
Gtrn (x) = const.
(~. ~. )-l-~ 9
(E. g~)
-~-~
,
(2.5)
where we have introduced the complex variable E = E + ir~ E R 2 + iC. We show in Figure 1 the wavelet 4 (k) for C -- C ( - 2 0 ~ 20~ this is manifestly a highly directional filter. ~4(C)
,,
\
"..',,
Figure 1" Two directional wavelets, in spatial frequency space: (left) The Morlet wavelet (e = 5, 0 = 45~ (right) The Cauchy wavelet (a = 20~
3.
Evaluation
of the performances
of the CWT
Given a wavelet, what is its resolving power, in particular what is its angular and scale selectivity ? What is the minimal discretization grid for the reconstruction formula (1.5) that guarantees that no information is lost? The answer to both questions resides in a quantitative knowledge of the properties of the wavelet at hand, that is, the tool must be calibrated. To that effect, one takes the WT of particular, standard signals. Three such tests have proven useful [5], and in each case the outcome may be viewed either at fixed (a, ~) (position representation) or at fixed (scale-angle representation).
9 Point signal: for a snapshot at the wavelet itself, one takes as signal a delta function, i.e. one evaluates the impulse response of the filter: <~]l/)a,0,~) __ a-1 1/)(a-lr_8(_~)).
(3.1)
57
This yields the effective support of r and from there one may define the resolving power of r 9 Reproducing kernel: taking as signal the wavelet r itself, one obtains the reproducing kernel K, which measures the correlation length in each variable a, 0, b: K(a,O,b[1, O,O)
(r162
= a-1 / d2 s 1 6 2
s
b))r163
(3.2)
A detailed analysis of K also yields the resolving power of the wavelet r in each variable. 9 Benchmark signals: for testing particular properties of the wavelet, such as its ability to detect a discontinuity or its angular selectivity in detecting a particular direction, one may use appropriate 'benchmark' signals.
3.1. T h e scale a n d angle resolving power Suppose the wavelet r has its effective support in spatial frequency in a vertical cone of aperture A~p, corresponding to ko - (0, ko). The width of r in the x and y directions is given by 2w~, resp. 2wy"
Then the wavelet r is concentrated in an ellipseof semi-axes w~, wy, and its radial support is ko - wy <_
p <_ ko + wy. Thus the scale width or scale resolving power (SRP) of r is defined as:
SRP(r
= ko + w_________~y. ko - wy
(3.4)
In the same way, one defines the angular width or angular resolving power (ARP) by considering the tangents to that ellipse. Then a straightforward calculation yields: ARP(r
= 2 cot -1 V/k~ - w~ _ A~.
(3.5)
Wx
For instance, if r is the (truncated) Morlet wavelet (2.2), one obtains: SRP(r
kov~ + 1 = k o v ~ - 1'
ARP(r
= 2cot -1 ~/e(k 2 - 1)
(3.6)
and, for ko >> 1" ARP(r
= 2 cot-l(kov~).
(3.7)
This last expression coincides with the empirical result of [5]: the angular sensitivity of CM depends only on the product kov/-e. Notice also that the SRP is independent of the anisotropy factor e. If r is the Cauchy wavelet (2.4) with support in the cone C ( 0 , - a , a), the ARP is simply the opening angle 2a of the supporting cone. 3.2. T h e r e p r o d u c i n g kernel and the resolving power of the wavelet A natural way of testing the correlation length of the wavelet is to analyze systematically its reproducing kernel 9 Let the effective support of the wavelet r in spatial frequency be, in polar coordinates, Ap and A~. Then an easy calculation [6] shows that the effective support of K is given by a 'nin = (Ap) -1 < a _ a max = A p for the scale variable, and - A ~ _ 0 ___A~ for the angular variable. Thus we may define the wavelet parameters (or resolving power) Ap, A~ in terms of the parameters Aa, A0 of K, as: = ~/amaz/amin; scale resolving power (saP). A p = ~ 9angular resolving power (ARP)" A~ = 89 This result may be exploited for determining the minimal discretization grid needed for the numerical evaluation of the reconstruction integral (1.5). In particular, one may design a wavelet filter bank {r (f~)}, which yields a complete tiling of the spatial frequency plane, in polar coordinates [6, 9]. Clearly this analysis is only possible within the scale-angle representation. Thus it requires the use of the CWT, and it is outside of the scope of the DWT, which is essentially limited to a Cartesian geometry.
58 3.3. C a l i b r a t i o n of a w a v e l e t w i t h b e n c h m a r k signals The capacity of the wavelet at detecting a discontinuity may be measured on the (benchmark) signal consisting of an infinite rod (see [5] for the full discussion). The result is that both the mexican hat and the Morlet wavelet are efficient in this respect. For testing the angular selectivity of a wavelet, one computes the W T of a segment, as a function of the difference in orientation, Ar between the wavelet and the segment. The conclusion is that the Morlet wavelet is highly sensitive to orientation, but the mexican hat is not. For an excentricity e -- 5, CM detects the orientation of a segment with a precision of the order of 5 ~ That is, the W T reproduces the segment if Ar < 5 o, but the latter becomes essentially invisible for Ar > 15 o, except for the tips. In the end, the image of the segment reduces to two peaks corresponding to the two endpoints.. The same test performed with an anisotropic mexican hat gives a result almost independent of Ar Another way of comparing the angular selectivity of the two wavelets is to analyze a directional signal in the angle-angle representation (~, 8) described above. The result confirms the previous one.
4. A p p l i c a t i o n 1- Directional filtering As a consequence of its good directional selectivity, the Morlet wavelet is quite efficient for directional filtering. A good illustration is the analysis of a pattern made of rods in many different directions [6]. Applying the C W T with a fixed direction selects all those rods with roughly the same direction, wheras the other ones, which are misaligned, yield only a faint signal corresponding to their tips. The same two operations are then repeated with various successive orientations of the wavelet. In this way, one can count the number of objects that lie in any particular direction. This method yields an elegant solution to a standard problem in fluid dynamics, namely, to measure the velocity field of a 2-D turbulent flow around an obstacle [12]. The directional selectivity of a wavelet may also be used for evaluating the symmetry of a given object. Let S(a, 8, b) be the wavelet transform of such an object with respect to the Cauchy wavelet. Define the following positive valued function, called the angular measure of the signal: , s ( a , e) = f d~bIS(~,e, b)l ~. This is different from using the scale-angle representation, where the position parameter b is fixed [6]. Here, on the contrary, #s averages over all points in the plane, thus eliminating the dependence on the point of observation. For any signal of finite energy, it is clear that #s is a continuous bounded function of a and 8. Let us fix the scale, a = a0 and consider #s(ao, 8) as a function of the rotation angle only. In general, it is a 27r-periodic function of 8. But when the analyzed object has rotational symmetry n, that is, it is invariant under a rotation of angle --~-, 2~ the angular measure is in fact ~-periodic. To give a simple example, we consider three geometrical figures [13], a square, a rectangle and a regular hexagon. The square has symmetry n = 4, its angular measure #s(ao, 8) is thus ~-periodic and shows four identical peaks at ~ = 0 ~ 90 ~ 180 ~ 270 ~ The width of these peaks is simply the aperture of the cone defined in (2.4). The rectangle has symmetry n = 2, and indeed its angular measure has two large peaks corresponding to the long edges and two smaller peaks corresponding to the short ones. Finally the hexagon has symmetry n - 6, and its angular measure shows six equal peaks. The same technique allows also to identify the symmetry of a lattice or a quasi-lattice.
5. A p p l i c a t i o n 2: Disentangling of a wave train A second example concerns the disentangling of a wave train, by a linear superposition of damped plane waves. The problem originates from underwater acoustics: when a point source emits a sound wave above the surface of water, the wave hitting the surface splits into several components of very different characteristics, and the goal is to measure the parameters of all components. In the 2-D case [6], the underwater wave train is represented by the following signal: N
f(~) = ~ n--1
cne ig~'~ e -C~'~,
(5.1)
59
where, for each component, k~ is the wave vector, l~ is the damping vector, and cn a complex amplitude. The method proceeds in three steps. First one computes the C W T of the signal (5.1) with a Morlet wavelet. Of course, by linearity, the result is the linear superposition of the contributions of the various components. Now we go to the scale-angle representation and write the WT, for fixed b, as" N
F(., 0, b~) = Z %. F-(", 0),
(5.2)
n=l
We notice that each term Fn(a, 8) in this superposition admits a unique local maximum. Suppose that these local maxima are well separated. Then, barring some interference effects (which may often be alleviated by increasing the selectivity of the wavelet), one may write: N
IF(a, O, b-')l ~ a E I%.11/7-(a, 0)1,
(5.3)
n=l
One then reverts to the position representation, choosing for (a, 0) successively each of the maxima. Then the filtering effect of the CWT essentially eliminates all components except the nth one, which is then easy to treat. In this way, one is able to measure easily all the 6N parameters of the signal [6].
6. Application 3: Character recognition Exactly as in the 1-D case, the WT is especially useful to detect discontinuities in images, for instance the contour or the edges of an object [5]. Here an isotropic wavelet may be chosen, e.g. the radial mexican hat CU. In that case the effect of the W T consists in smoothing the signal with a Gaussian and taking the Laplacian of the result. Thus large values of the amplitude will appear at the location of the discontinuities, in particular the contour of objects (which is a discontinuity in luminosity). In order to test this property, we compute the WT of a set with the shape of a thick letter 'A', represented by its characteristic function [5]. For large values of the scale parameter a, the W T sees only the object as a whole, thus allowing the determination of its position in the plane. When a decreases, increasingly finer details appear. The W T vanishes both inside and outside the contour, since the signal is constant there, and thus only the contour remains, and it is perfectly seen at a = 0.075 (Figure 2).
Figure 2: Detecting the contour of the letter A with the radial mexican hat: (left) The C W T at a = 0.075, in level curves; (right) The same, in 3D perspective. Of course, when a gets too small, numerical artefacts (aliasing) appear and spoil the result. The corners of the figure are highlighted in the WT by sharp peaks. The amplitude is larger at these points, since
60 the signal is singular there in two directions, as opposed to the edges. In addition the WT detects the convexity of each corner. The six convex corners give rise to positive peaks, the concave ones yield negative peaks, because we use a real wavelet and plot the WT itself, not its modulus.. This exercise leads to an algorithm for automatic character recognition [10]. The letter 'A', for instance, is entirely characterized by the succession of its 12 corners and a logical flag (concavity or convexity) for each of them. The algorithm consists in locating the local maxima of the CWT and eliminating everything else by thresholding, and it is able to detect an 'A' unambiguously Actually, since only the corners are needed, we may as well use a wavelet that sees only the corners, not the edges. Typically, a directional wavelet (when misaligned), or a real wavelet such as the gradient wavelets 0x exp(-[~[ 2) or v%0y exp(-[~[2). This simple technique may be further improved by adding some denoising and inclusion of a second wavelet capable of dealing with letters of arbitrary shape (for instance, a ring-shaped wavelet sensitive to circular shapes). In addition, the automatic recognition device will need some training. An elegant solution would then be to use the simple wavelet treatment as a preprocessing for some sort of 'intelligent' device, such as a neural network. References [1] I. Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, PA, 1992 [2] Y. Meyer, Wavelets: Algorithms and Applications, SIAM, Philadelphia, PA, 1993 [3] Y.Meyer (ed.), Wavelets and Applications, Springer-Verlag, Berlin, and Masson, Paris, 1991 [4] Y. Meyer and S. Roques (eds.), Progress in Wavelet Analysis and Applications , Ed. Fronti~res, Gif-sur-Yvette, 1993 [5] J-P. Antoine, P. Carrette, R. Murenzi and B. Piette, Image analysis with 2D continous wavelet transform, Signal Proc. 31 (1993) 241-272 [6] J-P. Antoine and R. Murenzi, Two-dimensional directional wavelets and the scale-angle representation, Signal Proc. 53 (1996) (to appear) [7] S.G. Mallat, A theory for multiresolution signal decomposition: the wavelet representation, IEEE Trans. Pattern Anal. Machine Intell. 11 (1989) 674-693 [8] I. Daubechies, The wavelet transform, time-frequency localization and signal analysis, IEEE Trans. Inform. Theory 36 (1990) 961-1005 [9] J-P. Antoine and R. Murenzi, The continuous wavelet transform, from 1 to 3 dimensions, in Subband and Wavelet Transforms: Design and Applications, pp. 149-187; A.N. Akansu and M. Smith (eds.), Kluwer, Dordrecht, 1995 [10] J-P. Antoine, P. Vandergheynst, K. Bouyoucef and R. Murenzi, Alternative representations of an image via the 2D wavelet transform. Application to character recognition, Proc. Conf. "Visual Information Processing IV", SPIE's 1995 Symposium on Optical Engineering/Aerospace Sensing and Dual Use Photonics , 2488 (1995) 486-497 [11] D. Marr, Vision, Freeman, San Francisco 1982 [12] W. Wisnoe, P. Gajan, A. Strzelecki, C. Lempereur and J-M. Math~, The use of the two-dimensional wavelet transform in flow visualization processing, in [4], pp. 455-458 [13] J-P. Antoine and P. Vandergheynst, 2-D Cauchy wavelets and symmetries in images, Proc. 1996 IEEE Intern. Conf. on Image Processing (ICIP-96), Lausanne (to appear).
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
61
The Importance of the Phase of the Symmetric Daubechies Wavelets Representation of Signals J.-M. Linal,2*and P. Drouilly1 l:Centre de Recherches Mathematiques, Univ. de Montreal C.P. 6128 Succ. Centre-Ville, Montreal(Quebec), HUC 3J7, Canada and 2:Atlantic Nuclear Services Ltd., Fredericton, New Brunswick, EUB 5C8, Canada
Abstract
The multiscale representations of signals based on Symmetric Daubechies Wavelets (SDW) are complex-valued. This work investigates the role of the phase in this representation and describes an iterative algorithm that restores the signals from the phase information only. We then discuss some applications in signal processing based on the information encoded in the phase of the complex wavelet coefficients. 1. I n t r o d u c t i o n
The title of this article deliberately refers to some early works related with complex representations of signals. In the early eighties, Oppenheim, Lim and Hayes investigated the "importance of the phase in signal processing" using the usual Fourier representation of signals [1]. More recently, the "Marseille group" also promoted the importance of the phases in the continuous wavelet representation of analytic signals [2]. Their so-called "ridge and skeleton" representation based on both modulus and phase of the complex wavelet modes is, by now, an important tool for the analysis of non stationnary signals. The recently studied Symmetric Daubechies Wavelets [u] (SDW) also exhibit complex-valued wavelets; the present work investigates the role of the phase in such orthogonal basis. The paper is organized as follows. Section 2 briefly describes the SDW's wavelet basis. More details and results can be found in Ref.[3,4]. In Section 3, the "phase reconstruction" algorithm is described and commented. The "importance of the phases" and the applications it suggests are finally discussed in the conclusion. 2. T h e S D W basis The SDW multiresolution basis is endowed with the usual Daubechies properties: the scaling function qo(x) and the wavelet r are compactly supported inside the interval [ - J , J + 1] for some integer J, the set {~)j,k ---= 2 8 9 1 6 2 e ~ } is an orthonormal basis of L2(]R) and r have J vanishing moments. The symmetry condition forces J to be even and the so-called SDW solutions are complex-valued but not in quadrature. In general, a field with finite energy will be "empirically known" at some scale and expanded in some approximation space Vjm~ spanned by the suitable scaled scaling functions r . . . . k(X) = 2 2 z ~ O ( 2 J " ~ k):
f(x) = ~
c~m~ ~pj. . . . k(x)
(1)
k The discrete multiresolution analysis of f then consists in the computation of the coefficients of the expansion jma~ --1
E
E E
k
j=j0
k
(2)
where j0 is a given low resolution scale and
= (~j,klf), *e-mail"
[email protected]
d3k = (r
(3)
62 are the orthogonal projection components of f(x) over the multiresolution basis. The change of basis is done through the well-known fast wavelet decomposition algorithm (FWT) composed with the low-pass complex filter ak and the corresponding high-pass complex filter bk = (--1)kill-k:
C~-1 : V~ E Ctk-2n4'
din-1 = v/2 E bk-2n4
and
k
(4)
k
Conversely, the reconstruction is expressed by the inverse FWT:
c~ - ~ E
an-2k4
"--1
j--1
+ vf2 E bn-2kdk
k
(5)
k
The real and imaginary parts of the complex scaling function and wavelet are endowed with many interesting properties studied in Ref.[4]. Figures 1 and 2 show examples of those functions for J = 2 and J = 4.
Figure 1: SDW for J = 2. Left: ~, right: r
Figure 2: SDW for J = 4 . Left: ~, right: r The corresponding filter coefficients v ~
IJ k 1 2 3 4 5
ak
are given in the following table:
J=2 0.662912 + 0.171163i 0.110485 0.085581i 0.066291 0.085581i 0.000000 0.000000
[I 0.643003 + 0.182852i 0.151379 0.094223i -0.080639 0.117947i 0.017128 + 0.008728i 0.010492 + 0.020590i
It is worth to mentionning that the second centered moment of the scaling function real part is always vanishing. 2k-1 the As a consequence, assuming that f is a real function sampled at the scale 2 -jmo~, i.e. Xk -- 2Jmo~+x, coefficients of the expansion (1) are given by
2~~
~~ = f(xk) + O(
1 1 24J,,,~. ) + iO( 22Jm~. )
(6)
63 The scaling coefficients are thus well approximated by the value of the function at the regular sampling points. 3. P h a s e a n d P O C S We first define two projectors, PR and Pr. The projector PR extracts the first order approximation of the scaling coefficient of the expansion (1) at the finest resolution (~ denotes the real part): pR(~mo~) = ~(~mo.)
(7)
Let us now consider the wavelet expansion (2) of a given field f0 and define the phase of the wavelet coefficients is also an orthornormal basis of L2(]R): this "local rotation" of the wavelet basis leads to a multiwavelet basis adapted to the signal. Indeed, we define the isophase space F by the set of all expansions
OJk -- Arg(dJk). We observe that the new set of functions q2j,k(x) -- eiO~r
j . ~ -1 k
j =jo
k
where the coefficients r~ are now positive real numbers. Pr is the orthogonal projector on this space that depends on the the phase of the wavelet coefficients of the original field we start with. Given an arbitrary wavelet expansion of the form (2) with dJk = W3k + i vJk, the projection on the isophase space is defined by the closest point on F, i.e. 9
j . ~ -1 Pr(f)--~(~j,klf0)~j0,k+ k
~
j--jo
.
.
I"
~ k
rJk ~j,k, w i t h r J k - - / 0 ' J J J J COS~kWk + sinOkVk
" " ifcos0~w~+sin0~v~ < 0 otherwise
(9) We further observe that both PR and Pr project onto convex spaces (POCS). Considering an arbitrary point ]o in F, a well-known theorem [1] states that the sequence of alternate projections
fn = ( P R p r ) n p R ( s
(10)
converges. In the present case, the limit point is the original real signal f0 from which we defined F. The two-dimension generalization of this algorithm is straightforward using the usual cross-product of the 1-d multiresolution basis. For the sake of illustration of the "phase reconstruction algorithm", Fig.3 displays the original picture (f0), the initial point PR(]o) (obtained by killing all the modulus of the wavelet coefficients of the fourlevel decomposition i.e. jo = jma~ - 4 with SDW J = 2) and the POCS reconstructions f,~=loo and fn=looo.
Figure 3: From left to right: f0, PR(]o), fl00 andflooo We first notice that the POCS gradually restores the details of the image from coarse to fine. As we can notice in Eq.[9], the projector Pr "shrinks" , even to 0, the modulus of the wavelet coefficients. It is worth recalling that shrinkage techniques are nowadays an efficient tool for denoising[ 5]. Phases thus encode the "coherent" structures of the signal and the POCS algorithm reconstructs the original image through the coherency of the encoded information. In order to quantify this process, Fig.4(a) shows the evolution of the theoretical dimension (exponential of the entropy) and Fig.4(b) displays the ratio of phases effectively used at each iteration of POCS. The restoration of the modulus of the wavelet coefficients is illustrated in Fig.4(c) for the level j = jo - jma~ - 4 and Fig.4(d) for j = jma~ - 1, both for fn=1000. We can observe the resulting shrinkage of the wavelet coefficients that depends on the scale of the details.
64
Figure 4: (a): theoretical dimension vs. iteration; (b): phases vs. iteration (%); (c): coarse scale wavelet modulus of fl000 vs. original wavelet modulus; (d): finer scale wavelet modulus of fl00O vs. original wavelet modulus. Let us further mention two points: (i) the significant speed-up of the POCS algorithm by adaptively fixing a relaxation parameter in the isophase projector and, last but not least, (ii) the alternative possibility of choosing the Symmetric Daubechies Wavelet basis since, as shown in Ref.[4], there exist 2~ -1 such solutions for a given number of vanishing moments J. 4. Conclusion The present work emphasizes the role of the phase of the wavelet modes in the rather new context of the Symmetric Daubechies Wavelet analyses. An immediate application is an iterative process for denoising signals for which little information about the noise content is available. An other application presently under development is the edges enhancement by modifying the phases of wavelet coefficients. This technique, that relies on the phases of the complex scaling coefficients at each scale of the decomposition [4] can be simultaneously coupled with the shrinkage of the wavelet coefficients leading to denoising. Most of the material presented here is detailed in Ref.[6]. This work was supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada. REFERENCES
1. A.V. Oppenheim and J.S. Li, "The importance of phase in signals", Proc. IEEE, vol.69, p.529-541, 1981; M. Hayes, "The reconstruction of a multidimensional sequence from the phase or magnitude of its Fourier transform", IEEE Trans. ASSP, vol.30, p.140-154, 1982; D.C. Youla and H. Webb, "Image restoration by the method of convex projections", IEEE Trans. on Medical Imaging, vol. 1 (2), p.81-94, 1982. 2. A. Grossman, R. Kronland-Martinet and J. Morlet, "Reading and Understanding Continuous Wavelet Transform", in Wavelet, Time-frequency Methods and Phase-Space, J.M. Combes et al. Eds, Springer-Verlag (1989). 3. J.M. Lina and M. Mayrand, "Complex Daubechies Wavelets", App. Comp. Harmonic Anal., vol.2, p.219-229, 1995; W. Lawton, "Applications of Complex Valued Wavelet Transforms to Subband Decomposition", IEEE Trans. on Signal Proc., vol.41, p.3566-3568, 1993. 4. J.M. Lina, "Image Processing with Complex Daubechies Wavelet", CRM-preprint 2335, to appear in the Special Wavelet issue of the Jour. of Math. Imaging and Vision, 1996. 5. R. De Vore and B. J. Lucier, "Fast wavelet techniques for near-optimal image processing", Proc. 1992 IEEE Military Commun. Conf., IEEE Communications Soc., NY 1992; D. Donoho and I. Johnstone, "Adapting to unknown smoothness via wavelet shrinkage", to be published in J. Amer. Statist. Assoc., 1995 (and reference therein). 6. P. Drouilly, M.Sc Thesis, Physics Dept and CRM, Univ. of Montreal, (1996).
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
using
65
Contrast enhancement in images the two-dimensional continuous wavelet
transform
Jean-Pierre Antoine and Pierre Vandergheynst* Institut de Physique Th~orique, Universit~ Catholique de Louvain B - 1348 Louvain-la-Neuve, Belgium E-mail:
[email protected] Abstract The 2-D Continuous Wavelet Transform (CWT) is now a standard tool in image analysis. In this paper we recall the definition of local contrast, introduced by M. Duval-Destin [3] and show how it can be used to obtain a discrete reconstruction formula. 1
The
2-D Continuous
Wavelet
Transform
Let us first recall the basic properties of the 2-D CWT. We represent an image by a function s E L 2 ( ~ 2 , d2~). In practice it is often a positive valued function whose discrete values correspond to the level gray of each pixel. Then a wavelet is just a function r E L2(R 2,d2~) that satisfies the following admissibility condition: d2fc ~r = ( 2 ~ ) ~ / I~ < ~" (1.1)
Ir
For a sufficiently regular function, this simply means that the wavelet has zero mean: = 0 ~
/d2~r
= 0.
(1.2)
The continuous wavelet transform of the image s with respect to the wavelet r is defined as the scalar product of s with the transformed wavelet Ca,e,~:
S(a,9, b-') = (r
-
a-2 f
r
= f ei6"gr
~ - b)) s(Z) d2Z.
(1.3)
~(fc) d2fc.
where we assumed r to be normalized (cr = 1). This transform has nice natural properties (see [1] for a quick review), in particular it is invertible:
s(~) =///dadgd2ba 2
Local
Ca0,,~(~) S(a, 9, b).
(1.4)
Contrast
Psycho-physicists usually think that our visual system is contrast sensitive, that is it reacts to relative variations of the image intensity. On the other hand, it is well known that the 2-D CWT is particularly good in detecting absolute variations of the intensity, for example discontinuities [2]. Following this idea, it might be useful to introduce an adaptive normalization of the CWT, which takes care of the surrounding luminance of each pixel. More precisely, let h E L I ( R 2) N L2(R 2 ) be a real, positive-valued *Boursier F.R.I.A, Belgium.
66 function. We also take h and the wavelet r isotropic in order to get rid of any directional sensitivity. We define the luminance level of the image s around position b as 2t:/8(a, b) = Ilhll~-1/R= d2~s(~)ha,~(~), where
h,a,~(:~) - a-2h
(2.1)
( a - l ( ~ - b)). Since the image and the normalization function are positive,/Qs is
also positive and
/tT/s plays the role of a local mean of s, and the overall luminance of the image does not depend on a, the scale parameter fixing the size of the region taken into account. Introduce now the following functional:
c~(~, ~) =
s;~' 0, b')
(2.2)
Ms(a,b) "
Cs(a, b) is well defined for all a 6 ] ~ and b 6 R 2 , if and only if the essential support of the wavelet is included in the corresponding support of the normalization function h. Then, by positivity of h and s,
M~(~,g)=0 ~ C~(~,g)=0. Cs(a, b) is called the local contrast of the image s around b"at scale a [3]. One easily verifies from (2.2) that Cs takes its largest values in regions of low luminance. A nice simplification arises when the wavelet r is taken as a difference of two positive functions, = a-2h(a-ls
r
- h(Z) (0 < a < 1).
The function r satisfies (1.1) if the first moment of h vanishes at the origin. Taking the same function to compute the luminance, we have
c~(~,b) = (r_
(h a,~ls)
= (ho.,~l~) - 1
(2.3)
(ha,dis)
The support condition imposed on the wavelet turns now into a constraint on h alone:
sup
supp
which means that the support of h is a star-shaped domain around the origin. Notice that it suffices that h decays radially for Cs to be bounded. In the real world, the approximation of an image at a fixed resolution is the only accessible data. This can be viewed as an estimation of the luminance of the image at a given scale. In this case, using difference wavelets, local contrast yields a simple reconstruction scheme. Let ao be the finest resolution (scale), that is,/tT/s(ao, b) is the original dataset. Then let/17/,(a, b) be a low resolution approximation of the image with a0 = a a n , a < 1. We have
l~s(aa, b) : /tT/s(aa2,g)
l~s(a, b-*). (Cs(a, [) +
= lVls(a,b-*). (Cs(a,b)+
1) 1).
(Cs(a,,g)+
1).
By recurrence, we find a multiplicative reconstruction formula: n--1
=
II
+,)
/=0
That is, one reconstructs the original image at full resolution by starting with a low resolution approximation and adding successive details. This clearly mimics the usual multiresolution analysis scheme. Figure 1 shows an example of a local contrast chart of an image, and its reconstruction using (2.4)
57
Figure 1: An image (a) and its C W T using a mexican hat wavelet with a = 0.125 (b). Local contrast of the same image (c) and reconstruction over five dyadic scales using a gaussian normalization function (d).
3
Infinitesimal
multiresolution
analysis
In this section we will briefly show how to generalize the previous formalism. The interested reader should refer to [4, 5] for more details. It is well known that the continuous reconstruction formula (1.4) allows one to use two different wavelets: one for decomposition and the other for reconstruction, provided they satisfy a cross admissibility condition
/Rd
2~
I~1~
= 1.
(3.1)
This yields the so-called bilinear scheme 1 of the C W T [4]. If the analyzing wavelet r is regular enough, one can choose as reconstructing wavelet a Dirac delta measure, getting a simpler reconstruction formula which is just the sum of the wavelet coefficients over all scales:
da (r ~(~) = R .+ ~-
(3.2)
This is the so called linear scheme, and we will use it in the sequel (although everything extends to the bilinear formalism). Let p be a positive valued smoothing function, that is f~(0) - 1 and ~ is continuous and derivable. We define the infinitesimal wavelet associated to p as:
~(~) = -~~(~1. Then we introduce the approximation of s at scale (resolution) a
and the details of s at scale a
(das) = (Ca[S). The original image can be expressed as s = lim a~ a--+0
1Bilinear with respect to the wavelet.
=
(Tao
-~-
~oa~daa --
(das).
58 j=J Now let us choose an arbitrary decreasing sequence of scales {aj)j=0, aj+l < aj < a j - i , where a j (resp. a0) stands for the finest (resp. coarsest) resolution. We call wavelet packets, the following integrated filters:
~J (~) = fa aj -da - r
(3.3)
j-I-1 a
These wavelet packets allows one to compute the approximation of the image at resolution aj+l from the coarser approximation at resolution aj, yielding a discrete reconstruction formula similar to that of ordinary multiresolution analysis: J
s(~,) = aao(Z) + Z ( ~ J l s ) .
(3.4)
j=0
Now, if we introduce the following contrast coefficients
(r ,
Cj = ~
O'aj
a multiplicative reconstruction formula similar to (2.4) naturally comes.in: J-1
s(Z) = aao(Z) H (1 + Cj(Z)).
(3.5)
j=O
Thus again, contrast coefficients form a sufficient information for the characterization of the analyzed signal. Remarks: 9 If (aj) is a geometric sequence, aj = )~aoj, then wavelet packets are simply wavelets in the usual sense since they are generated from a unique function r by dilations by powers of )~. The choice aj = 2-J yields the familiar dyadic wavelet analysis [6]. 9 Wavelet packets do not form an orthogonal basis, but they characterize the signal without loss of information. 9 Anisotropic wavelet packets can be used, this simply results in a sum over all possible orientations in (3.4). 9 Fast pyramidal algorithms are available [5], just like in the usual discrete wavelet transform scheme.
References [1] J.-P. Antoine, Two-dimensional directional wavelets and image processing, this volume. [2] J.-P. Antoine, P. Carrette, R. Murenzi and B. Piette, Image analysis with two-dimensional wavelet transform, Signal Processing 31 (1993) 241-272. [3] J.-P. Antoine, R. Murenzi, B. Piette and M. Duval-Destin, Image analysis with 2-D continuous wavelet transform: detection of position, orientation and visual contrast of simple objects, Wavelets and Applications (Proc Marseille 1989), pp. 144-159, Y. Meyer (ed.), Masson, Springer-Verlag, 1992. [4] M. Duval-Destin, M.A. Muschietti and B. Torresani, Continuous wavelet decompositions, multiresolution and contrast analysis, SIAM J. Math. Anal. 24 (1993) 739-755. [5] M.A. Muschietti and B. Torresani, Pyramidal algorithms for Littlewood-Paley decompositions, SIAM J. Math. Anal. 26 (1995) 925-943. [6] S. Mallat and W.L. Hwang, Singularity detection and processing with wavelets, IEEE Trans. Inform. Theory, 38 (1992) 617-643.
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
69
WAVELETS A N D DIFFERENTIAL-DILATION EQUATIONS T. Cooklevt, G. Berbecel1:, and A. N. Venetsanopoulost 1:Genesis Microchip Inc. 200 Town Centre Boulevard
)~Dept. E l e c t r . & C o m p . E n g . U n i v e r s i t y of T o r o n t o 10 K i n g ' s C o l l e g e R d . Toronto, ON M5S 3G4, Canada
S u i t e 400 Markham,
ON L3R 8G5, Canada
ABSTRACT
In the paper a wavelet is constructed starting from differential-dilation equation. It has compact support and excellent time domain and frequency domain localization properties. The wavelet is infinitely differentiable and therefore cannot be obtained using digital filter banks. New sampling and differentiation techniques are also introduced.
1
INTRODUCTION
Wavelets are an important tool in signal processing. There are three types of wavelet transforms : continoustime wavelet transform (CWT), discrete-time wavelet transform, and discrete wavelet transform (DWT). The C W T is defined as [1]
X(a,b) = < x(t), Ca,b(t) > =
1 /R x(t)r
(~)dt
(1)
The continuous-time wavelet transform depends on two parameters: dilation a and shift b. The C W T is, in principle, invertible, provided the wavelet is admissible (i. e. , it has sufficient decay). The wavelet transform involves basis functions which do not have a constant length: very short basis functions are used to achieve good time resolution, while longer basis functions can be used to obtain fine frequency analysis. When a and b are continuous the set of basis functions does not constitute an orthonormal basis, i. e. , the representation is redundant. The discrete-time wavelet transform can be obtained by discretizing a and b. The basis functions m t-bo n ) , where a = a~n and b0 = n b0 a~n. The case where a0 = 2 and b0 = 1 become Cm,n(t) = aom/2r is the most common and the corresponding grid is called dyadic. The discrete wavelet transform (DWT) corresponds to a filter bank iterated along the lowpass channel. In this paper we shall be concerned only with the CWT. A very important question in wavelet analysis is choosing the basis function, and this is the focus of our concern in this paper. We are looking for a continous-time wavelet r that is infinitely differentiable, has compact support and provides good frequency localization. A wavelet with these three properties has not been used in signal processing. The Mexican hat and Morlet wavelets are infinitely differentiable, but do not have compact support. Wavelets generated from filter banks can have compact support, but cannot be infinitely differentiable.
2
THE BASIC CONSTRUCTION
Iteration of a digital filter followed by downsampling leads to a limit function provided the filter satisfies certain constraints [1]. Downsampling and upsampling are discrete-time multirate operations. We are going to find it useful to define a continuous-time decimator (Fig. 1 (a)). While the term "continuous-time decimator" may not be ideally appropriate the idea is clear - the support of the function f(t) shrinks by a factor of two. Note that the block in Fig. 1 (a) is purely a mathematical tool that is only conceptually similar to the discrete-time decimator. Suppose now that the blocks of continous-time filtering and decimation are cascaded and iterated (Fig. l(b)). The impulse and frequency responses of the resulting system after two stages will
f(t)
@
f(2t)
Figure 1: Fig. 1 (a) Continuous-time decimator and (b) infinite iteration of a continuous-time system followed by decimator
70
be r = 4 h ( 4 t ) , 2h(2t) and O2(w) = H (r H (~0/2). The functions r satisfy a dilation equation Oi(w) = H (w/2) ~i-1 (w/2). If we continue the iteration to infinity and assume convergence the impulse and frequency responses of the system will be r
-
lim r
= 2i + i - 1 + + 2
i-+oo
lim [ h ( 2 i t ) , h(2i-lt),
i-+oo
... , h(2t)]
(2)
'
{:x3
~(w)
-
i~oolimOi(w)-HH(W)~
9
(3)
i=1
is equal to the support of h(t). The iterations make sense only if they converge.
Note that the support of r The simplest case is when h(t)
_ { 1/2 -l_
1,
otherwise.
(4)
The function h(t) is assumed to be symmetric with respect to the origin to avoid problems with the phase. Its Fourier transform is H(~o) = sinw/~o. The normalizing constant in (2) is to make the area under the graphs of h(2t), h ( 4 t ) , . . , equal (to one). For this function h(t) the dilation equation is r
which can be written as O(w) -
- sin(w/2) 0a/2 ~ ( 2 ) ,
eJw/2 _ e-jw/2 9 2j~/2
or jw@(w)
de(t) The last equation in the time domain is ~ - 2 [r
+ 1) - r
-
~(0)-
(e j~12
-
1.
e-j 12)
- 1)].
(5)
(6)
(7)
The above equation is a differential-dilation equation, a new type of equation for a scaling function. The support of the function r is [-1, 1]. There is no analytic expression for r but there is an elegant formula for its Fourier transform: oo
r
1 j_~o eJ~tIIsine (~/2'1 d~
=
2~
oo
-
~
~
(S)
i--1 oo oo
cos~d~o.
(91
i=1 k=l
P r o p o s i t i o n 1 The function O(w) - IIi~=1 sinc (w/2 i) is a well-d4ned function and is continuous. The proof is based on the result that the Fourier transform of a compactly-supported function is continuous. It can be proven that f _ ~ r - 1 because the conditions of the theorem of Fubini are satisfied and the integral of the convolution is equal to the product of the integrals. In general, the function r has properties between polynomials (which are solutions of differential equations) and scaling functions (which are solutions of dilation equations). It is more smooth than scaling functions, but less smooth than polynomials. The odd-indexed moments of the function r are zero, since in the expansion
~(~o) -
~
cJ.
(10)
k=0 (I)(2k'~" 1 ) ( 0 ) ( - - 1 )
F
k
r
1
(2k + 1)[ - (2k + 1)[
The function r ation properties.
has a smaller second-order moment than B-splines, which means that it has better localiz-
~o oo
t2r
d20(~0) ~o-~o dw ~
oo
= ... = - - .
For the B-spline function of order N r
r
d t - O.
we have c2k+1 =
1
(12)
9
(w) --
(11)
sinc~
N > -~, 1 and it can be shown that J t2flN(t)dt -- -~
,
N > 1.
(13) (14)
71
Figure 2: Fig. 2 (a) The wavelet r 2.1
The corresponding
The function r
I
-
fo ~ I'I'(')l~
and (b) the CWT of a signal that is two sinusoids and two pulses.
wavelet
= r (t) is a mother wavelet, since it satisfies the admissability condition: <~.
(15)
w
The wavelet r has compact support, excellent localization characteristics and is infinitely differentiable. This wavelet has been used in signal analysis with very good results. A simple example is given in Fig. 2. The CWT gives a picture of how the frequency content of a signal varies with time. We hope that the function r will be a useful "benchmark" wavelet in addition to the Morlet and Mexican hat wavelets. 2.2
Approximation
of polynomials
using the function r
An important property is that by dilations and translations of the function r any polynomial can be represented. This is convenient, because polynomials are not square-integrable functions. If we start from E r
-- k) - 1
(16)
k
then ~ k / r 1 7 6 1 7 6 1 6 2 1 7 6 k
But E r 1 6 2 1 6 2 k
(17) k
k
r
+ 1)- r
- ~ - 1)]
(18)
k i
and k ~
r
k)dt- E
kr 2 2
k
from which we obtain t - E
Ir=(t-k)/2--E -2 r
~r
k
The next step ist2 -~ - v k / ~
k
- c0 E r
- k).
'
(19)
(20)
k
r (~)
(21)
dt - cot - el.
(22) k
k
72
(23) it follows that E
~
dt- ~
r
k
-~-r
(24)
k
and therefore t 2 = E
--4-r
- e0
~r
- cl E
k
r
(25)
- k)
k
It is plain to see that we can continue in a similar way for higher powers of t. The formulae, as well as other results, cannot be given here due to the space limitation.
2.3
Reconstruction function r
of a continuous-time
function
from
its samples
Consider now the space V = span (r - k); k E Z). and suppose f(t) e V: f(t) = ~ k Ckr looking for a function u(t) E V which satisfies
f(t) = E
f(k)u(t - k) .
using
the
-- k). We are (26)
k
We want the samples of the function f(t) to be the coefficients in the expansion. From
f(k) - ~
c,r
-
k) -
(27)
c~
l
it follows that the function u(t) must be exactly r In the space Y the interpolation function is r itself. The space V can be written as V = Vod~ @ V~,en, where Vodd = s p a n { r (2k + 1)], k E Z} and Ve,~,~ - s p a n { r k E Z}. The functions { r k E Z} and { r k E Z} are orthonormal bases for the spaces Yodd and Vev~n respectively. 2.4
A differentiation
technique
We can derive an efficient algorithm for differentiation.
f(t)
--
E ckr
(2s)
- k)
k
f' (t)
-
~
2ck [r
- 2k + 1) - r
- 2k - 1)] - ~
k
2(ck-1 - ck)r
- k) - 11.
(29)
k
If dl(k) is defined as dl(k) - 215(k- 1 ) - 5(k)]
(3O)
[dl(k) * ck] r
(31)
then f' (t) - E
k) - 1]
k
For the second derivative f" (t) - 8 E ( e k _ l - 2ck + ek+l)r (4(t - k))
(32)
k
and if we define d2(k) in a similar way d~(k) = 815(k - 1) - 23(k) + 5(k + 1)] then f" (t) - E
[d2(k) * ek] r
k) - 1]
(33) (34)
k
3
CONCLUSIONS
In the paper a new continous-time C ~ wavelet with useful properties was constructed starting from a differential-dilation equation. Useful sampling and differentiation schemes were also discussed. Applications of the constructed wavelet r in image analysis will be discussed in another publication. Many opportunities exist for generalization of the results obtained here. In a smilar way, as shown in Fig. 1, other functions can be obtained that have useful properties. The two-dimensional non-separable case looks also interesting (and difficult). It is to be noted that that systems described by differential, difference and, recently, dilation equations, have been studied, while systems described by differential-dilation equations have not previously been studied and used.
References [1] I. Daubechies, Ten lectures on wavelets, CBMS-NSF Regional Conf. in Appl. Math., vol. 61, SIAM, Philadelphia, PA, 1992.
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
WAVELETS IN HIGH RESOLUTION RADAR IMAGING CLINICAL MAGNETIC RESONANCE IMAGING
73
AND
Walter Schempp
Abstract. Coherent wavelets form a unified basis of the multichannel perfect reconstruction analysis-synthesis filter bank of high resolution radar imaging and clinical magnetic resonance imaging (MRI). The filter bank construction is performed by the Kepplerian spatiotemporal Hilbert bundle strategy which allows for the stroboscopic and synchronous cross sectional quadrature filtering of phase histories in local frequency encoding channels with respect to the rotating coordinate frame of quadrature reference. The Kepplerian strategy of dynamic physical astronomy and the associated filter bank construction take place in symplectic affine planes and are implemented by the polarized symbol calculus of the Heisenberg nilpotent Lie group G which extends Fourier analysis. Thus the pathway of this paper leads from Keppler to Heisenberg, and to the fascinating aspects of electronic engineering concerned with the implementation of the symmetries inherent to the semi-classical approach to clinical magnetic resonance tomography by large scale integrated (LSI) circuit technology. Where the telescope ends, the microscope begins, which of the two has the grander view? - VICTOR HUGO (1862)
A radar system employs a directional antenna that radiates energy within a narrow beam in a known direction. The radar antenna senses the return scattered by the target, and the receiver amplifies the echo and translates its energy band to the intermediate frequency of the radar. The intermediate frequency signal wavelet is operated on linearly by the predetection filter. Finally, the output of the predection filtering is coherently detected by a linear phase-sensitive processing configuration. One unique feature of the synthetic aperture radar (SAR) imaging modality is that its spatial resolution capability is independent of the platform altitude over the subplatform nadir track. This is a result of the fact that the SAR image is formed by simultaneously storing the phase histories and the differential time delays in local frequency encoding subbands of wideband radar, none of which is a function of the range from the radar sensor to the scene. It is this unique capability which allows the acquisition of high resolution images from satellite altitude as long as the received echo response has sufficient strength above the noise level. The Kepplerian spatiotemporal Hilbert bundle strategy of dynamic physical astronomy centered on the sun is derived from the quadrature conchoid trajectory construction and the second fundamental law of planetary motion analysis ([1], [16]), as displayed in Johannes Keppler's famous Mars commentaries of 1609 entitled Astronomia nova seu physica coelistis. It allows for the stroboscopic and synchronous cross sectional quadrature filtering of phase histories in local frequency encoding channels with respect to the rotating coordinate frame of quadrature reference, and provides the implementation of a matched filter bank by orbit stratification in a symplectic affine plane. An application of this procedure leads to the landmark observation of the earliest SAR pioneer, Carl A. Wiley, that motion is the solution of the high resolution radar imagery and phased array antenna problem of holographic recording. Whereas the Kepplerian spatiotemporal strategy is realized in SAR imaging by the range Doppler principle ([6], [10], [11], [12]), it is the Lauterbur encoding principle ([15]) which takes place in clinical MRI ([3], [4]). At the background of both high resolution imaging techniques lies the construction of a multichannel coherent wavelet perfect reconstruction analysis-synthesis filter bank of matched filter type ([7]). Beyond these applications to local frequency encoding subbands, the Kepplerian spatiotemporal Hilbert bundle strategy leads to the concept of Feynman path integral or summation over phase histories.
74
As approved by quantum electrodynamics, nilpotent Fourier analysis allows for a semi-classical approach to the interference pattern of quantum holography ([15]). Indeed, the unitary dual (~ of the Heisenberg nilpotent Lie group G consisting of the equivalence classes of irreducible unitary linear representations of G ([18]) allows for a coadjoint orbit fibration by symplectic affine planes O, (u # 0). 9 The hierarchy of energetic strata O~ (u :/= 0) is spatially located as a stack of tomographic slices inside the vector space dual Lie(G)* of the real Heisenberg Lie algebra Lie(G). This fact is a consequence of the Kirillov homeomorphism ([13])
~ Lie(G)*/CoAda(G). In terms of standard coordinates, the Heisenberg nilpotent Lie group G consists of the set of unipotent matrices
(1 x yz)lx,y,z E 0 0
1 0
R}
1
(lXZ) (1 x z)(1 x+x z+z+xy)
under the non-commutative matrix multiplication law 0 0
1 0
y 1
.
0 0
1 0
y' 1
--
0 0
1 0
y+y' 1
,
and Lebesgue measure dx | dy | dz as a Haar measure. If the unipotent matrices {P, Q, I} denote the canonical basis of the three-dimensional real vector space Lie(G), where
(110)
expa P -
0 0
1 0 0 1
(100)
,expaQ-
0 0
1 0
1 1
,expaI-
holds under the matrix exponential diffeomorphism expa : Lie(G) ~ is given by
(1 x z)
CoAda
0 0
1 0
y 1
(101) 0 0
1 0 0 1
G, the coadjoint action of a on Lie(G)*
(1 0 -y)
-
0 0
1 0
x 1
.
Therefore the action CoAda reads in terms of the coordinates {a, fl, u} with respect to the dual basis {P*, Q*, I*} of the real vector space dual Lie(G)* as follows: CoAda
The linear varieties
(l Z) 0
1
y
0 0 1
(c~P* + ~Q* + u P ) - (c, - uy)P* + (~ + ux)Q* + uP.
0,, = CoAda(G)(uI*)= R P * + RQ* + u/*
(u :/= 0)
actually are symplectic affine planes in the sense that they are in the natural way compatibly endowed with both the structure of an affine plane and a symplectic structure. Therefore the symplectic affine planes O~ (u :/= 0) in Lie(G)* are the predestinate planar mathematical structures to implement the Kepplerian spatiotemporal strategy of Hilbert bundles sitting over the bi-infinite phase coordinate line R, and to carry quantum holograms acting as multichannel perfect reconstruction analysis-synthesis filter banks ([14]). 9 In radar imaging, u :p- 0 denotes the center frequency of the transmitted pulse train, whereas in clinical MRI the center frequency u is the frequency of the rotating coordinate frame of quadrature reference defined by tomographic slice selection. The stationary singular plane u=0 in Lie(G)* consists of the single point orbits or focal points
75
corresponding to the one-dimensional representations of G. As the reconstruction plane it plays a fundamental role in the coherent optical processing of radar data ([6]), morphological MRI, and functional MRI recording of synchronized neural activities in the brain ([15]). It follows from this classification of the coadjoint orbits of G in Lie(G)* the highly remarkable fact that there exists no finite dimensional irreducible unitary linear representation of G having dimension > 1. Hence the irreducible unitary linear representations of G which are not unitary characters are infinite dimensional and unitarily induced. Their coefficient cross sections for the Hilbert bundle sitting over the bi-infinite phase ccordinate line R define the holographic transforms. Let C denote the one-dimensional center of G transversal to the plane carrying the quantum holograms. Then C = R.expaI is spanned by the central transvection expaI , and aligned parallel to the bore of the magnet. In coordinate-free terms, G forms the non-split central group extension
c,~ a ---~ a / c where the plane G / C is transversal to the line C. The irreducible unitary linear representations of G associated to the coadjoint orbit O~ are unitarily induced in stages by the unitary characters of closed normal abelian subgroups which provide a fibration of G sitting over the base line R. If the elements w E O1 are represented by complex numbers of the form
0
(; :)-(0
~ 1)
(x 0)
including the differential phase
0
x
and the local frequency
as real coordinates with respect to the frame of quadrature reference rotating with center frequency u :/: 0, and the Weyl matrix J-(01
O1)
as imaginary unit, it becomes obvious that the multiplication law of the quadrature cell matrices reads X
yl
xl ) - -
( xxl yyl - ( y x ' "t- xy') "~ yx ~ + xy ~ xx ~ - yy~ ,] "
The conjugation identity
:1)(;
:1) -1
-y
x
yields the area law Iwl2 = det w. for w E C. Hence, for w :/= 0, the inverse of the associated quadrature cell matrix reads in terms of the complex conjugate x _ 1 x y y x2 + y2 _y x
( :)1
(
)
9 In spin echo protocols, refocussing of nuclear spin angular momenta is performed by conjugation with respect to the rotating coordinate frame of quadrature reference. The rotational curvature form of the coadjoint orbit O~ is exactly the standard symplectic form of R | R, dilated by the center frequency u r 0. The bundle-theoretic interpretation of the inducing mechanism gives rise to the pair of isomorphic irreducible unitary linear representations
(u~,v ~)
(~,r
of G unitarily induced in quadrature by the unitary characters of the associated closed normal abelian subgroups of G. Then the commutation relations
76
U ~( \
( 1 x Z))y o ((1 x Z))y = ((1 x Z))y ((1 x Z))y 0 1 V" 0 1 e4~ri~yV ~ 0 1 oUv 0 1 0 0 1 0 0 1 0 0 1 0 0 1
are satisfied for all triples (x, y, z) E R 3 and v 76 0. For non-zero center frequencies u 76 v', the irreducible unitary linear representations U ~ and U ~' of G are non-isomorphic, and the same holds for V" and V ~'. The metaplectic representation of the commutator group of G lifts the Weyl quadrature matrix d to the square root of the symmetric phase factor (x, y) "-+ e 47riuxy
occurring in the commutation relations supra. Hence the Fourier transform gives rise to an intertwining operator of the isomorphic irreducible unitary linear representations U ~ and V ~ of G acting on the standard complex Hilbert space L 2 (R) and admitting the same central unitary character
[x,-U'lC-V'lC
(~r
I
of the form )Cv : Z "--+ e
27riv z
with center frequency u. An infinitesimal version of the unitary representation U" is given by its differentiated form. The differentiated form of U" provides by the evaluation d U~ ( P ) - -~i
0 1 0) 0 0 0 E Lie(G) the temporal derivative on the bi-infinite real time scale, in accordance to the 0 0 0 fact that the Kepplerian dynamic physical astronomy centered on the sun is in terms of magnetic forces which control local frequencies in orbital planes, not acceleration. Thus Kepplerian forces are not analogs of the iewtonian forces which are controlled by gravitation ([16]). Moreover, at P -
provides by evaluation at Q finally
provides by evaluation at I -
(ooo) (001) 0 0 1 0 0 0
u ~ (Q) = 2~i~,t •
E Lie(G) the multiplication with the imaginary time scale, and
U ~'(I) = 27riu x
0 0 0 E Lie(G) the multiplication with the imaginary angular frequency, 0 0 0 with the usual maximal domains as skew-adjoint Hilbert space operators. 9 The rotating coordinate frame of quadrature reference of the coadjoint orbit O~ (u :/: 0) of G is defined by the polarized cross section G / C . 9 The rotating coordinate frame of quadrature reference implements in the coadjoint orbit Ov (v r 0) of G the second Kepplerian fundamental law of planetary motion analysis, the area law. Similarly, an application of the Weyl quadrature matrix J yields for the differentiated form of the unitary representation V v of G d Y ~ ( P ) - 2~rivt •
V~(Q) -
dt'
V ~ ( I ) - 27rim •
when evaluated at the canonical basis {P, Q , I } of the Heisenberg Lie algebra Lie(G). The induced Hilbert bundle sitting over the bi-infinite phase coordinate line R is G-homogeneous in the sense that the Heisenberg group G moves its fibers around by linear transformations. Therefore Johannes Keppler needed Tycho de Brahe's data base of observations of the planets in all different configurations, far surpassing any data base that had ever before accumulated, in accuracy and - equally important - in quantity. Previous astronomers had a more restriced program, chiefly concerned with critical moments of the planets' motions
77
such as oppositions to the sun. The linear representation U" of G and its swapped copy V u are globally square integrable mod C. Indeed, it is well known that a coadjoint orbit is a linear variety if and only if one (and hence all) of the corresponding irreducible unitary linear representations is globally square integrable modulo its kernel. It is reasonable to regard global square integrability as an essential part of the Stone-von Neumann theorem of quantum mechanics, because a representation of a nilpotent Lie group is determined by its central unitary character X~ if and only if it is globally square integrable modulo center. Thus X, allows for selection of the tomographic slice O~ with rotating coordinate frame of quadrature reference of center frequency u r 0. The corresponding equivalence classes of irreducible, unitarily induced, linear representations U" of G acting on the complex Hilbert space of globally square integrable cross sections for the Hilbert bundle sitting over the bi-infinite phase coordinate line R are infinite dimensional and can be realized as Hilbert-Schmidt integral operators with kernels If" E L2(R | R) ([13]). 9 The kernel function K ~ E L 2 (R | R) associated to the irreducible unitary linear representation U" of central unitary character )C, = U ~ I C implements a multichannel coherent wavelet perfect reconstruction analysis-synthesis filter bank of matched filter type. The reconstruction of the phase histories in local frequency encoding subbands of K ~ is performed by the symplectically reformatted two-dimensional Fourier transform. 9 Application of the symplectic Fourier transform which is inherent to the polarized symbol with respect to the rotating coordinate frame of quadrature reference precludes the ability to directly relate signal intensities to the number of excited protons within the selected tomographic slice (9~ (u r 0). The Heisenberg nilpotent Lie group approach leads to the non-locality phenomenon of quantum mechanics displayed by the double-slit interference experiment ([14]), to a quantum vacuum radiation explanation of the sonoluminescence phenomenon where a light pulse is emitted during every cycle of the sound wave with extremly small jitter ([8], [2]), and to major application areas of pulsed signal recovery methods, the corner turn algorithm in the digital processing of high resolution SAR data ([17]), the spin-warp procedure in clinical MRI, and finally to the variants of the ultra-high-speed echo-planar imaging technique of functional MRI recording of synchronized neural activities in the brain ([5], [15]). Combined with multi-slice acquisition, it is the spin-warp version of Fourier transform MRI which is used almost exclusively in current routine clinical examinations to acquire tomographic images from the distribution of nuclear spin angular momenta of protons. Switching from the SchrSdinger representations U ~, V ~ of G acting on complex Hilbert spaces of equivalence classes of globally square integrable cross sections, to their alternative realizations which are given by the Bargmann-Fock model of creation and annihilation operators acting on complex Hilbert spaces of holomorphic functions on symplectic affine planes ([13]), provides a spatial localization of the excited proton clusters. Because holomorphic functions allow for direct point evaluations, the positions of these clusters can be projected onto the stationary singular plane u = 0 of focal points, and therefore provide a quantum mechanical localization approach to the functional MRI recording of synchronized neural activities in the brain. ,, In terms of the symbolic calculus on the Heisenberg nilpotent Lie group G, the transition from morphological MRI of soft tissues to functional MRI recording of synchronized neural activities in the brain is performed by transition from the polarized symbol with respect to the rotating coordinate frame of quadrature reference to the isotropic symbol associated to the plane G/C transversal to the line C. ,, Morphological MRI requires a Fourier transform evaluation procedure; the quantum mechanical localization requires a statistical evaluation procedure. The speed with which clinical MRI spread throughout the world as a diagnostic imaging tool was phenomenal. In the early 1980s, it burst onto the scene with even more intensity than X-ray computed tomography (CT) in the 1970s. Whereas at the end of 1981 there were only three working MRI scanners available in the United States, presently there are more than 4.000 imagers performing in a non-invasive manner more than 8.5 million examinations per year. At the Division of MRI in the Department of Radiology of Johns Hopkins Medical Institutions, for instance, there are 5 MRI scanners but only one X-ray CT imager currently working. This illustrates clearly the trend towards MRI in the field of clinical diagnostic radiology: There are only a few circumstances under which X-ray CT still plays a role.. The speed of growth is a testimony of the clinical significance of the technique. Today the fastest growing imaging modality in radiodiagnostics is firmly established as a core diagnostic tool in the fields of neuroradiology and musculoskeletal imaging ([9]), routinely used in all medical centers in Western Europe and the United States.
78
Acknowledgment. The author is grateful to Professor George L. Farre (Georgetown University) for his generous advice as well as the Austrian Society for Cybernetic Studies (Vienna) for continuing support of this work.
Figure 1: Phased array MRI- long spine imaging
79
Figure 2" High resolution cranial MRI - sagittal and coronal tomographic slices
80 References [1] E.J. Aiton, The elliptical orbit and the area law. In: Kepler - Four hundred years. Proceedings of conferences held in honour of Johannes Kepler. A. Beer and P. Beer, Editors, Vistas in Astronomy, Vol. 18, pp. 573-583, Pergamon Press, Oxford, New York, Toronto 1975 [2] G. Barton, C. Eberlein, On quantum radiation from a moving body with finite refractive index. Ann. Phys. 227 (1993), 222-274 [3] J. Beltran, Editor, Current Review of MRI. First edition, Current Medicine, Philadelphia 1995 [4] M.A. Brown, R.C. Semelka, MRI: Basic Principles and Applications. Wiley-Liss, New York, Chichester, Brisbane 1995
[5]
M.S. Cohen, Rapid MRI and functional applications. In: Brain Mapping- The Methods, A.W. Toga, J.C. Mazziotta, editors, pp. 223-'255, Academic Press, San Diego, New York, Boston 1996
[6]
L.J. Cutrona, E.M. Leith, L.J. Porcello, and W.E. Vivian, On the application of coherent optical processing techniques to synthetic-aperture radar. Proc. IEEE 54, 1026-1032 (1966)
[7]
E.R. Davies, Electronics, Noise and Signal Recovery. Academic Press, London, San Diego, New York 1993
[8]
C. Eberlein, Sonoluminescence as quantum vacuum radiation. Phys. Rev. Lett. 76, 3842 - 3845 (1996)
[9]
R.R. Edelman, J.R. Hesselink, and M.B. Zlatkin, Clinical Magnetic Resonance Imaging. Two volumes, second edition, W.B. Saunders Company, Philadelphia, London, Toronto 1996
[lO]
M. King, Fourier optics and radar signal processing. In: Applications of Optical Fourier Transforms. H. Stark, editor, pp. 209-251, Academic Press, Orlando, San Diego, San Francisco 1982
[11]
E.N. Leith, Synthetic aperture radar. In: Optical Data Processing, D. Casasent, editor, pp. 89-117, Topics in Applied Physics, Vol. 23, Springer-Verlag, Berlin, Heidelberg, New York 1978
[12]
E.N. Leith, Optical processing of synthetic aperture radar data. In: Photonic Aspects of Modern Radar, H. Zmuda, E.N. Toughlian, editors, pp. 381-401, Artech House, Boston, London 1994
[13]
W. Schempp, Harmonic Analysis on the Heisenberg Nilpotent Lie Group, with Applications to Signal Theory. Pitman Research Notes in Mathematics Series, Vol. 147, Longman Scientific and Technical, London 1986
[14]
W. Schempp, Geometric analysis: The double-slit interference experiment and magnetic resonance imaging. Cybernetics and Systems '96, Vol. 1, pp. 179-183, Austrian Society for Cybernetic Studies, Vienna 1996
[15]
W. Schempp, The Structure-Function Problem of Fourier Transform Magnetic Resonance Imaging: Coherent Wavelet Filter Banks, and Spatiotemporally Encoded Synchronized Neural Networks. John Wiley & Sons, New York, Chichester, Brisbane (in print)
[16]
B. Stephenson, Kepler's Physical Astronomy. Princeton University Press, Princeton, NJ 1994
[17] D.R. Wehner, High Resolution Radar. Artech House, Norwood, MA 1987 [18] A. Weil, Sur certains groupes d'op~rateurs unitaires. Acta Math. 111 (1964), 143-211. Also in: (Euvres Scientifiques, Collected Papers, Volume III (1964-1978), pp. 1-69, Springer-Verlag, New York, Heidelberg, Berlin 1980
Walter Schempp Lehrstuhl fiir Mathematik I Universit~it Siegen D-57068 Siegen, Germany
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
Wavelet
Transform
based
81
Information
Extraction
from
1-D and 2-D
Signals Adam D~browski Poznafi University of Technology Institute of Electronics and Telecommunication ul. Piotrowo 3a, PL-60 965 Poznafi Poland
[email protected]
Abstract Information extraction from signals requires sufficiently high resolution in both time (space) domain and the frequency domain. This is however impossible in a single analysis because of general principle, the so-called uncertainty principle stating the impossibility of satisfactorily good simultaneous signal representations in both time (space) and frequency. The difficulty following from the uncertainty principle can be overcome using the multiresolution signal representation, the concept resulting in the so-called wavelet transformation. The multiresolution information extraction procedure consists of at least two steps: information is first only roughly extracted from the lowest resolution signal. Then the interesting signal parts or components are analyzed precisely in the highest resolution signal in order to get the satisfactorily exact information [4]. In this paper, this idea is applied to efficient edge detection in images (a 2-D example) and to the detection of DTMF (dual tone multifrequency) signals used for signaling in touch-tone telephones [8, 9] (a 1-D example).
1
Introduction
T h e classical way for signal analysis, say of a signal x(t), is by its spectral r e p r e s e n t a t i o n X(w), i.e. by the Fourier t r a n s f o r m
x@) =
~(t)~-J~dt.
(1~)
9(t) = ~1 f ? ~ X(w)6Wtd w
(lb)
oo
Thus
Signals which are short in time (or narrow in space) yield a wide s p e c t r u m a n d vice versa. p r o p e r t y is well-known as the uncertainty principle [1] and can formally be f o r m u l a t e d as 1
crtaoj >_ -~
This
(2a)
where
a~ = f _ ~ ( t - < t >)21x(t)12dt
f-~oo Ix(t)l 2dr
2
f-~176w - < w >)2[X(w)]2dw
and % =
(2b)
L% Ix(w)l 2dw
moreover
< t >--
F
oo
tlx(t)12dt
and
< w >-
F
eolX(w)12dw .
(2c)
oo
T h e u n c e r t a i n t y principle states t h a t time (or space) waveform w i d t h at a n d frequency s p e c t r u m w i d t h cr• cannot be b o t h a r b i t r a r i l y small simultaneously. For e x t r a c t i o n of information from signals, this
82
means that for high frequency resolution, a long time (or space) analysis is necessary, and vice versa [3]. Therefore, in order to reduce the analysis time, a hierarchical, i.e. multiresolution analysis should be made: first the signal is only roughly analyzed and then its interesting parts are analyzed precisely
[4]. Two particular illustrative examples of this approach to the extraction of information from signals are presented: 9 edge detection in images serves as a 2-D example, 9 DTMF (dual tone multifrequency) signaling detection, consisting in the detection of two sinusoidal components of equal magnitude, is a 1-D example.
2
Wavelet transform
The continuous wavelet transform of a signal x(t) is defined by
1/ X(a, T) = --~
x(t)V
(t--T) a dt
(3)
where ~(t) is the so-called mother or basic wavelet and v/-a-l~(.(t- r)/a) are the baby wavelets determined by the mother wavelet using shift r and scale a; v/-~-t being the energy correction factor [2]. The discrete wavelet transform is given by
X(m, n) = / x(t)q2mn(t)dt
where
~mn(t) = 2~ 9 (2rot - n) .
(4)
The most important case is the orthonormal wavelet basis. We can write CO
x(t) =
~
amn~mn(t)
(5)
.
In order to introduce the multiresolution concept we say that the moth order resolution signal Xmo (t) approximating x(t) is a part of the sum (5) composed of terms with m < mo. Introducing the scaling functions (~mn(t) = 2 ~ (2rot - n) , where function (I)(t) is frequently called the father wavelet, we obtain CO
m(t)= 9 ~
(6)
c,..~m.(t)
n'----CO
and
O0
Xm(t)
-- X m - l ( t ) + Wm-l(t) --
E n----CO
3
CO
C(m-1)n~(m-1) n(t) + E
d(m-1)n~(m-1) n(t) 9
(7)
n~--CO
Edge d e t e c t i o n
Discrete signals C(m-1)n and d(m-1)n in (7) can be considered as outputs of a low-pass filter (L) and a high-pass filter (H), respectively, excited with the signal Cmn. These filters form a QMF filter bank. In order to represent 2-D signals (images), we can use separable scaling functions and separable wavelets defined as follows LL
LH .L l n 2 ~IIT/~n
(ti, t2) HH q2mnln2 (tl, t2)
83
Figure 1: Original image boats
Figure 2: Edge detection using FDG masks: (a) (-1, 0,1), (b) ( - 1 , - 1 5 , 1 5 , 1 ) For edge detection, or generally, for information extraction, we have more freedom with the choice of wavelet and scaling functions than for other wavelet transform applications, as e.g. subband coding. That means that these functions need no more to form a QMF filter bank. This is because, in our case, we do not need to reconstruct the signal after it was split into subbands by the wavelet decomposition. The scaling function can therefore be as simple as possible. It can be, e.g., defined by a (1/2, 1/2) or even (1, 0) mask. The latter has been chosen in the example presented below. Wavelet filtering is, however, more complicated because it is used for the edge detection. Two alternative mask classes were tested: the first derivative of Gaussian (FDG) type mask and the second derivative of Gaussian (i.e. separable Laplacian) type mask (or the LOG type mask for short). The latter is, in fact, the most common type of mask for this application. As an edge detection criterion, the extremum search was used in both cases (for LOG type masks, we have, however, to count absolute values) and the zero-crossing search for LOG type masks only. Extremum search was realized very simply by an appropriate threshold. Results of experiments with the image boats Fig. 1, are shown in Fig. 2a and b for the FDG type masks (-1, 0, 1) and ( - 1 , - 1 5 , 15,1), respectively and with the threshold edge detection, and also in Fig. 3a and b for the LOG type mask ( - 1 , - 4 , - 1 3 , 3 6 , - 1 3 , - 4 , - 1 ) with the zero-crossing and the threshold criteria, respectively.
84
Figure 3: Edge detection using LOG mask ( - 1 , - 4 , - 1 3 , 3 6 , - 1 3 , - 4 , - 1 ) (b) threshold criterion
4
DTMF
with (a) zero-crossing and
detection
In DTMF signaling, each signal consists of a couple of sinusoidal signals with proper frequencies. They belong to two separate frequency groups: the low frequency group: 697, 770, 852, 941 Hz and the high frequency group: 1209, 1336, 1477, 1633 Hz. Signal detection can be interpreted as filtering with a nonuniform filter bank. For our goal, we discretize equation (3) in the following way 1
Z(a, N r , k ) = --~~u x(u)@k(u--Nr).
(9)
Assuming the sampling rate F - 1/T, the correspondence between the continuous-time t and the discrete-time p is u = [t/(aT)J where by LrJ the greatest integer less or equal to r is denoted. Thus we can also write Nr = LT/(aT)J 9For the detection of kth DTMF component, a set of mother wavelet functions of the form ff2k(V) = Wk(P)wkk (10) is assumed in equation (9), where
Wgk = e-J(2r/Nk ) and Wk(P) is a smoothing window function, e.g., the Blackman window =
0.42-0.5cos
Nk--1
+0.08cos\Nk_l
where u = 0, 1 , . . . , Nk -- 1. Furthermore, we assume NT - mNk, m being any integer; provided that the block size Nk is optimally chosen for minimization of the passband center frequency error 6. Thus, without any lack of generality, we shall henceforth assume that m = 0. The proposed algorithm is a modified Goertzel algorithm [6] for computation of the wavelet transform X(a, k) (i.e. for NT = 0). It follows from equation (9) and the following manipulations 1 Nk-1 v--O
=
1 w _ k N k Nk-I y=0
(Ii)
85
1
Nk-1
The equation (11) is a convolution of signals w k ( u ) x ( u ) and - ~1 w-k{w-k'~u gk~, Nk! ' l / - - 0'I,...,N 1. The latter can be interpreted as the impulse response of an IIR filter with the following transfer function (12)
H k ( z ) = - ~ 1 -- W N k Z -1
Thus, the value X ( a , k )
in equation (11) is the ( N k - 1 ) t h output sample of this filter, i.e. (13)
= y k ( N k -- 1) ,
X(a,k)
with 1
=
n
(14)
Z
v--0
where k - 0, 1 , . . . , N k -- 1. Transfer function (12) can be expanded to the form 1
WNk--z
H k ( z ) = v/-d 1 -- "ykz - 1
-1 + Z -2
(15)
'
with ~k = 2 c o s ( ( 2 u / N k ) k )
(16)
,
Thus, the basic algorithm step given by qk(n) =
7 } q k ( n - 1) - qk(n - 2) + w k ( n ) x ( n )
(17)
consists exclusively of real computations. Next, the signal energy but only once for a block, i.e for n = Nk - 1, must be computed, given by s
-- a - 1
([qk(Nk - 1) - qk(Nk -- 2)] 2
(18a)
\
2~rk
- 2 q k ( N } - 1)qa(N} - 2)(cos - ~ k -- 1)/
/
or
s
-'- a - 1
([qk(Nk -- 1) + qk(Nk -- 2)] 2
2~k
--2qk(Nk -- 1)qk(Nk -- 2)(cos - ~ k + 1)
)
(18b) .
In the experimental program, equation (18a) was used. Now we shall determine the necessary block size Nk. This may be done by analysis of required frequency resolution or of permissible relative error 5 of the filter passband center frequencies. As the required frequency resolution we can define the minimum difference between the DTMF frequencies, i.e. 73 Hz = ( 7 7 0 - 697) Hz. Thus, we get N >_ 110. The relative error 5 of the passband center frequency fp for the DTMF frequency fDTMFis defined as
= fp - fDTMF 100% .
(19)
fDTMF
Since DTMF frequencies are generated with relative error less than :kl.8%, then approximately the same bounds can be postulated for 5. Assuming some margin of say 45% we get the bounds -t-1.8 x 1.45% = • and then we obtain N >_ 104, i.e., approximately the same value as before from the frequency resolution requirement.
86
5
Conclusions
Multiresolution approach has been proposed for information extraction from signals and was applied to edge detection in images. By this means pseudoedges can easily be removed while the proper edges are placed precisely. Two mask types: the FDG mask type and the LOG mask type and also two edge-detection criteria: extremum search and zero-crossing, have been considered. FDG type masks surprisingly occured to be better in the considered application than the commonly used LOG type masks. This is because even very simple FDG masks like (-1, 0,1) (see Fig. 2a) yield quite satisfactory results and the threshold can be adaptively chosen for adges with the prescribed thickness. On the other hand, LOG type masks are very noise sensitive and the threshold criterion works very badly with them (see Fig. 3b). Therefore it cannot be practically used for them. It should be stressed that the threshold operation is much easier than the zero-crossing detection. The second example considered in this contribution is the DTMF detection. The main idea proposed for that application is a modification of the classical Goertzel algorithm. This modification consists in varying the analysed block length (to some extent only - - of say 5-6%) according to the frequency to be detected. This leads to the massive reduction of the average block size and to the reduction of the center frequency errors (f.
References [1] L. Cohen Time-Frequency Analysis, Prentice-Hall, Englewood Cliffs, N J, 1995. [2] Y. T. Chan Wavelet Basics, Kluwer Academic Publ., Boston, 1995. [3] S. G. Mallat, Multifrequency channel decompositions of images and wavelet models, IEEE Trans. Acoustics, Speech, a. Signal Proc., vol. 37, No. 12, Dec. 1989, pp. 2091-2110. [4] A. D~browski, A. Franc, A. Czajka, Realization of wavelet transform for Windows with application to edge detection in images, First International Symposium "Mathematical Models in Automation and Robotics", Mi~dzyzdroje, Sep. 1-3, 1994. [5] A. Dajbrowski, Wavelet transform-based modification of Goertzel algorithm for detection of DTMF signals, The International Conference "Signal Processing Applications & Technology, Boston, MA, 1995. [6] G. Goertzel, "An algorithm for the evaluation of finite trigonometric series", The American Math. Monthly, vol. 65, pp. 34-35, Jan. 1958. [7] A. D~browski, "Nonuniform digital filter bank for DTMF receiver", Proc. Workshop on Multirate Systems, Filter Banks and Wavelet Analysis, ETH Zurich, Oct. 26, 1992, pp. 10-13. [8] A. Da~browski, W. Kabacifiski, "Experiences with DTMF receivers and tone senders in Poland using DSP's", Proc. Int. Conf. Signal Process. Appl. & Technology, ICSPAT'93, Santa Clara, USA, Sep.1993, pp. 193-198. [9] S. Bagchi and S. K. Mitra, "An efficient algorithm for DTMF decoding using the subband NDFT", Proc. ISCAS'95, Seattle, USA, April 1995, pp. 1936-1939. Work supported in part by the grant KBN 0452/P4/94/07 and in part by the project DPB 44-443.
Invited Session C:
GENERAL TECHNIQUES AND ALGORITHMS
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
89
C o m p u t a t i o n a l M e t h o d s and Tools for Simulation and Analysis of C o m p l e x P r o c e s s e s
k criteria, ANN and CA) (applying cv~V.V. Ivanov Laboratory of Computing Techniques and Automation Joint Institute for Nuclear Research, 1~1980 Dubna, Russia FAX: 007-096-21-651~5; E-mail: ivanov 9 Abstract The tutorial is devoted to computational methods and tools for simulation and analysis of different complex processes in physics, in medicine and in social life. There are considered: 1) multivariate data analysis methods based on w~k-criteria and artificial neural networks, 2) neural networks applications for solving problems of data classification and one-dimensional function approximation, and 3) cellular automata usage in pattern recognition and complex system simulation. These methods and tools are developed in the Laboratory of Computing Techniques and Automation of the Joint Institute for Nuclear Research (Dubna, Russia) in collaboration with the International Solvay Institutes for Physics and Chemistry (Brussels, Belgium).
1. M u l t i v a r i a t e d a t a a n a l y s i s b a s e d on cv~ k-criteria and A N N The primary goal of experimental data processing consists in identification of useful events among all events obtaining in the experiment. Under an event we mean the set of features characterizing the analyzed pattern. The classification of events in one-dimensional case is carried out with the help of a simple cut on a feature variable. When an event is characterized by more then one variable, the procedure for constructing a multivariate classifier is not trivial. k In paper [1] we have suggested and investigated a class of new nonparametric co~statistics
w~ =
n5 k+l
i=1
i - 1 _ F(xi) n
1
-
- F(xi)
1
where F(x) is the theoretical distribution function of x, Xl < x2 < ... < x,~ is an ordered sample, and n is the sample size. On their basis were constructed corresponding goodnessof-fit criteria, which are usually applied for testing the correspondence of each sample-event to the distribution known apriori. On the w~k criteria basis was developed a method for extracting low probability multivariate events from a background of predominant processes [1], which was successfully applied in several experiments for selection of rare events [2, 3]. Recently, the use of artificial neural networks (ANN) in multi-dimensional data analysis has became widespread [4]. One such problem consists in classifying individual events represented by empirical samples of finite volumes pertaining to one of the different partial distributions composing the distribution analyzed. A feed-forward multilayer network multilayer perceptron(MLP) - is a convenient tool for constructing multivariate classifiers, although its learning speed and power of recognition depends critically on the choice of input data [5].
90 Such network involves an input layer corresponding to the processing data, an output layer dealing with the results and, also, hidden layers. A network architecture is presented in Fig. 1.
Result
t
t
yi
Wij hj Wjk
t
t
Input
t
data
t
Xk
Figure 1: Architecture of multilayer perceptron with one hidden layer Here xk, hj and yi denote the input, hidden and output neurons, respectively; wjk are the weights of connections between the input neurons and the hidden layer, and wij are the weights of connections between the hidden and the output neurons. The signals aj = ~kwjkxk and a i = ~jcoijhj are fed to the inputs of hidden and output neurons, respectively. The output signals from these neurons are determined by the expressions hj = g[(aj + Oj)/T] and yi = g[(ai + Oi)/T], where g ( a , T ) i s a transfer function, T is the "temperature", determining its slope, 0 is the threshold of the corresponding node. Typically, g(a, T)is a sigmoid, for example, of the form g(a, T ) = tanh(a/T). The tuning of MLP on the solving problem (this procedure is known as the ANN learning) consists in minimization of the following error functional with respect to weights E-~I ~[r
t-(p)]2
p
where p = 1,2, ..., Ntr~i~ - is the number of training patterns, and tip) is the desired value of the output signal. A comparative study of multidimensional classifiers based on the goodness-of-fit criteria k and multilayer perceptrons (MLP) has been carried out in work [6]. It was shown that a) n MLP exhibits the "instantaneous" learning effect and a power close the limit in the case of input data represented in the form of variational series. The reasons are analyzed that underlie these effects. Recommendations for joint usage of the w~k criteria and of MLP are given [6]. Rare events identification on a background of dominated processes is an important problem of applied mathematical statistics. The practical impossibility of ANN training on data with significantly different contributions of separating classes strongly restricts the wide inculcation of neural computational methods in this field. The method for solving this problem was developed in work [7]. It is based on application of MLP with a single layer of
91 hidden neurons having a step-like transition function. The procedure includes two stages: 1) the network learning on data with identical contributions of each separating class, and 2) the transformation of calculated bias matrix. It is shown that the developed approach allows to use neural networks for the identification of rare events with a contribution of order 0.1%. 2. A N N in d a t a classification and function a p p r o x i m a t i o n In recent years artificial neural networks have acquired widespread application in natural sciences, in medicine etc. Here we present some examples of ANNs usage for solving problems of data classification and pattern recognition, and for function approximation. A two-level trigger is developed for suppression of the background and for effective selection of events involving short-lived A-, E- and C-particles in the experiment DISTO. The first-level trigger is intended for selection of events by their multiplicity" only fourprong events are selected. Events accepted by the first-level trigger are then examined with the help of the second-level trigger, which is to be applied for track recognition, in searching for a secondary vertex, and for identifying the secondary particles. It is based: 1) on a recognition of straight tracks applying the specialized cellular automaton (see details in the next section), 2) on the momentum variables permitting effective selection of events containing a secondary vertex, and 3) on the identification of the secondary charged particles applying MLP [8]. A simple and efficient algorithm for identifying events with secondary vertex making use of MLP was developed in paper [9]. The differences Rx, Ry (respectively, in X O Z and Y O Z projections) between the largest and the smallest impact parameters 1 D~ (i - 1,2, ..., n) of all tracks belonging to each of the events analyzed were used in establishing the identification criteria of signal and background events. An effective method for identifying the tracks associated with a particular secondary vertex in an event was developed. The method is based on the differences between the asymmetries exhibited by the sets Di of individual signal events and background events. A procedure for recognition of features in the ECG of one heart beat and from a single channel using MLP was developed in work [10]. The main idea of the method is to present to the network not raw data, but the transformed data. We believe that a system of polynomials orthogonal on a set of uniformly spaced points is the adequate formalism for the analysis of electrocardiograms as measurements are taken in equal time intervals, and all points can be denoted 0, 1,2,... n. The above mentioned polynomials Pk,n(x), k - O, 1,2, ...,m _< n are related by the following recurrent equation (see, for instance, [11])"
(x
(rn + 1)(n - rn)
2(2, + 1)
m(n+m+l) 2(2m + 1)
P.~_~,~(x)- O,
Po,~(x)- 1, P l , ~ ( x ) - 1
Pm+l,n(X)+
1 < m < n, 2x n
The polynomial P,~(x) approximating the function f ( x ) i n this case is
P (x) - c0P0,
(x)+ ClPl, (x) + . . . + cmP ,n(x),
1The impact parameter of a track in the plane passing through the center of the target and perpendicular to the beam.
92 where
(2i + 1)n (i)
Ci -- (i q- n q- 1)(i+1)
E f(k)Pi,,~(k), k=0
i = 0, 1 , 2 , . . . m ,
and x (i) is the polynomial of the form x ('~) - x ( x -
1)... (x - n + 1).
The proposed transformation provides significantly simpler data structure, stability to noise and to other accidental factors. The method was tested on the data generalizing features of normal and modified ECGs and provided high level of recognition for unveiling of barely noticeable pathologies. A by-product of the method is compression of the raw data and reduction of its amount; the compression coefficient has a value of 5+ 10 and can be improved. The procedure adopted for the parametrization of functions defined on a finite set of argument values plays an essential role in the problem of experimental data processing. Diverse methods have been developed and are widely applied in constructing approximating functions in the form of algebraic or trigonometric polynomials. In our work [12], a nontraditional approach to the interpolation of one-dimensional functions is presented. It is based on the application of the specialized feed-forward neural network, which realizes expansion in the set of orthogonal Chebyshev polynomials of the I-st kind. This approach permits to calculate the expansion coefficients during the network training process, for which arbitrary points (for instance, measured in experiments) from the function's domain are used. The neural network provides an accuracy of function approximation practically coinciding with the accuracy, that can be achieved within the traditional approach, when the values of the function at the nodal points are known. 3. CA in p a t t e r n r e c o g n i t i o n and c o m p l e x s y s t e m simulation Cellular automata arose from numerous attempts to create a simple mathematical model describing complex biological structures and processes [13]. A cellular automaton is a most simple discrete dynamical system, the behavior of which is totally dependent upon the local interconnections between its elementary parts [14]. This model turned out to be very productive and has been widely and successfully applied in describing various complex structures and processes in physics, biology, chemistry and etc. A typical cellular automaton is constructed in accordance with the following algorithm: 1. cells and their possible discrete states are defined; usually, each cell may assume one of two states, 0 or 1; however, there may be cellular automata with more states; 2. interconnections between cells are defined; usually, each cell can only communicate with neighbour cells; 3. rules determining evolution of the cellular automaton are fixed; they depend on the actual problem considered and usually have a simple functional form; 4. the cellular automaton is a timed system, in which all cells change states simultaneously. A model of the cellular automaton for recognition of straight tracks has been developed in paper [15]. In this case a cell was identified with the straight-line segment connecting two hits in neighbouring coordinate detectors. To take into account the inefficiency of the detectors one must consider, also, the segments connecting hits skipping one detector.
93
Clearly, only such segments can be considered neighbours which have a common point serving as the end of one segment and the beginning of the second. At each step a cell can assume one of two possible states: 1, if the segment can be considered a part of the track, and 0 otherwise. As the criterion in assigning segments to a track was taken the angle between two adjacent segments. Owing to the coordinate detectors having a discrete structure and to multiple scattering in the material of the experimental apparatus, the angles between track segments in the real experiment are not zero, but an upper limit can be imposed. Upon completion of the work of the cellular automaton additional testing of the quality of reconstructed tracks (for instance, for the presence of at least two hits belonging only to each individual track) is carried out. This permits rejecting "phantom" tracks, which were accidentally constructed from hits belonging to different tracks.
Figure 2: Initial configuration of the cellular automaton for a typical Monte-Carlo event in the spectrometer DISTO
Figure 3: Resultant configuration of the cellular automaton for the event presented in the previous figure
The program realization of the described approach has shown high efficiency and speed for the simulated data for the experiment DISTO. Its working speed provides for the processing of approximately 1000 events/sec using the 50 MIPS RISC processor. This makes suitable its application for track recognition in the second-level trigger of the DISTO spectrometer. In the paper [16] the implementation of Probabillistic Cellular Automata in the study of multispecies agent groups is investigated. As a first step we consider the communication between the two species governed by a probabilistic rather than a deterministic process. This way we implement the kind of coupling suggested in [17] following the spirit of probabilistic control for unstable systems. Here as controller one can consider the population of agents following a specific pattern, and as the unstable system the population which tends to cover all available space in an ergodic-like fashion. From the variety of all possible realizations of the above idea we start our investigations, for the sake of clarity and simplicity (helping us fix ideas by simple examples), considering first two-species one of these species being represented by a single agent and the other by a small group of agents (50 + 100).
94
Acknowledgments This work has been partly supported by the Commission of the European Community within the framework of the EU-RUSSIA Collaboration, in accordance with the project ESPRIT 21042: Computational Tools and Industrial Applications of Complexity.
References [1] V.V. Ivanov and P.V. Zrelov:
"Nonparametric Integral Statistics w~" k Main Properties and Applications", Int. Journal "Computers & Mathematics with Applications" (in Press); JINR Communication P10-92-461, 1992 (in Russian).
[2] P.V. Zrelov and V.V. Ivanov:
"The Relativistic Charged Particles Identification Method Based on the Goodness-of-Fit w~3-Criterion', Nucl. Instr. and Meth. in Phys. Res., A310 (1991) 623-630.
[3] P.V. Zrelov, V.V. Ivanov, V.I. Komarov, A.I. Puzynin and A.S. Khrykin:
"Modelling of Experiment on Investigation of Processes of Subthreshold K + - Mesons Production". JINR Preprint, P10-92-369, Dubna, 1992; "Mathematical Modelling", v.4, N%ll, 1993, p.56-74, (in Russian).
[4] B. Denby: "Tutorial on Neural Networks Applications in High Energy Physics: The 1992 Perspective". In Proc. of II Int. Workshop on "Software Engineering, Artificial Intelligence and Expert Systems in High Energy Physics". New Comp. Tech. in Phys. Res. II, edited by D. Perret-Gallix, World Scientific, 1992, p.287. [5] A.Yu. Bonushkina, V.V. Ivanov and P.V. Zrelov: "Input Data for a Multilayer Perceptron in the Form of Variational Series". In: Proc. of the Fourth Int. Workshop on Software Engineering, Artificial Intelligence and Expert Systems for High Energy and Nuclear Physics, April 3-8, 1995, Pisa, Italy; "New Computing Techniques in Physics Research IV", edited by B. Denby & D. Perret-Galix, "World Scientific", 1995, p.751. [6] V.V. Ivanov: "Multidimensional Data Analysis Based on the wnk Criteria and Multilayer Perceptron". In: Proc. of the Fourth Int. Workshop on Software Engineering, Artificial Intelligence and Expert Systems for High Energy and Nuclear Physics, April 3-8, 1995, Pisa, Italy; "New Computing Techniques in Physics Research IV", edited by B. Denby & D. Perret-Galix, "World Scientific", 1995, p.765. A.Yu. Bonushkina, V.V. Ivanov and P.V. Zrelov: "Multivariate Data Analysis Based on the wnk-Criteria and Multilayer Perceptron', Int. Journal "Computers & Mathematics with Applications" (in Press). [7] V.V. Ivanov and P.V. Zrelov: "Rare Events Selection on a Background of Dominated Processes Applying Multilayer Perceptron'. Report at this conference. [8] M.P. Bussa, L. Fava, L. Ferrero, A. Grasso, V.V. Ivanov, I.V. Kisel, E.V. Konotopskaya, G.B. Pontecorvo: "On a Possible Second-Level Trigger for the Experiment DISTO", "Nuovo Cimento", vol. 109A, 1996, p. 327. [9] A.Yu. Bonushkina, V.V. Ivanov, Yu.K. Potrebenikov, T.B. Progulova and G.T. Tatishvili: "Identification of Events with Secondary Vertex in the Experiment EXCHARM". JINR Communications, P1-96-56, Dubna, 1996 (in Russian).
95 [10] A. Babloyantz, V.V. Ivanov, P.V. Zrelov and P. Maurer: "A New Approach to ECG's Analysis Involving Neural Network". "Neural Networks Letters" (in Press). [11] I.S. Berezin and N.P. Zhydkov: "Computing Methods", vol. 1, Moscow, 1959 (in Russian). [12] V. Basios, A.Yu. Bonushkina and V.V. Ivanov: "On a Method for Approximating OneDimensional Functions", Int. Journal "Computers & Mathematics with Applications" (in Press). [13] S. Wolfram (ed.): "Theory and Applications of Cellular Automata". World Scientific, 1986.
[14]
T. Toffoli and N. Margolus: "Cellular Automata Machines: A New Environment for Modelling". MIT Press, Cambridge, Mass., 1987.
[15]
M.P. Bussa, L. Fava, L. Ferrero, A. Grasso, V.V. Ivanov, I.V. Kisel, E.V. Konotopskaya, G.B. Pontecorvo: "Application of a Cellular Automaton for Recognition of Straight Tracks in the Spectrometer DISTO", Int. Journal "Computers & Mathematics with Applications" (in Press).
[16]
V. Basios, F. Bosco, V.V. Ivanov and I.V. Kisel: "From Individual Interactions to a Collective Behaviour of Autonomous Agents Group". Report at this conference.
[17] "Probabilistic Control of Chaos: Chaotic Maps Under Control", to appear in: The Int. Journal "Computers & Mathematics with Applications", Special Issue, Eds. I. Prigogine, I. Antoniou, et al. (in Press).
This Page Intentionally Left Blank
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
97
RARE EVENTS SELECTION ON A BACKGROUND OF DOMINATED PROCESSES APPLYING MULTILAYER PERCEPTRON Ivanov V.V. and Zrelov P.V. Joint Institute for Nuclear Research, Dubna, Russia Abstract
Rare events identification on a background of dominated processes is an important problem of applied mathematical statistics. The practical impossibility of a neural network training on data with significantly different contributions of separating classes strongly restricts the wide inculcation of neural computational methods. In this work the uniform approach for solving the mentioned problem is developed. Our approach is based on application of one type of neural networks - a multilayer perceptron with a single layer of hidden neurons having a step-transfer function. 1. I n t r o d u c t i o n The present work deals with problems involving the application of neural-network classifiers for identifying rare events in the case of a background of dominant processes. Under an event we mean the set of features characterizing the analyzed pattern. The main difficulty of application of neural networks for solving the indicated problem is connected with "reluctance" of network to learning, when samples corresponding to different classes with strongly distinguishing apriori probabilities are supplied to its input. The network as if does not "observe" patterns presented by relatively small quantities. In this case the source of errors is caused by a will of investigator to train the network on the basis of equal probabilities (P(Wl) = P(w2)), and then to apply it for separation of classes with inequal contributions (P(Wl) =/= P(w2)). We call this procedure the "approximate" bayessian classification. The investigator usually does not take into account the fact that separating boundaries for these two cases can differ significantly in these two cases and this can lead to incorrect classification. All problems considered in the paper correspond to the bayessian classification with minimal level of error [1]. Results of this work correspond to the case when bayessian limit exists and separating boundary can be found. 2. C r i t e r i a of I d e n t i f i c a t i o n of R a r e E v e n t s In the theory of pattern recognition a keyword criterion characterizing a quality of receiving result is, so-called, level of recognition R. It represents a fraction of correctly identified events from a whole number of events presented for classification and can be written in the form
R-
1-a+m(1-fl) l+m
'
(1)
where m - ~ and N1, N2 are numbers of events of the first and the second class respectively, c~ and/3 are parts of correctly identified events of the first and the second classes. However, it must be noted that in the problem of identification of rare events the value R can not be basic or, at least, the only criterion, because in such problems it is necessary
98 to extract useful events with minimum losses and to leave that part of background events on which level the examined signal is displied quit well. A fraction of signal events to the whole number of events can be used as a convenient criterion. It can be represented in the form: 1--OZ 71
1 - c~ + / 3 m '
(2)
In the dependence of the subject field and on the concrete problem a role of parameters R and 71 can be changed. It can be convenient also to consider some modification of criterion 713. B a y e s s i a n classification of d i s t r i b u t i o n s w i t h different c o n t r i b u t i o n s Let's carry out the quantitative consideration of indexes of bayessian classification on the example of classification of multidimensional gaussian distributions with diagonal covariance matrixes Ej - 0"~ I, where j - 1,2, and I is an unit matrix. Let O"1 # 0"2- The boundary separating classes has a character of a hypersphere with a radius r and with a center in a point b, 2 2 0"10"2 11/71 -
2
-
-
} 1/2
0"~) l n [ ( . e ~ ) = g @ l
0"2#1i
I ,1
-
-
,
,
(3)
i - 1 , 2 , . . . n; fij is a vector of mean values, and P(wj) is apriori probability of events wj, j = 1, 2; n is the dimension of space. It can be shown, that in common case fil 7~ fi2 (for definiteness 0.2 < 0.1, fil = 0), a good approximation for value c~ is served an expression { l(n+2a)
a ~ I
-2 n + 3 a
(_~1)
+
a2 ] ~f}l
n+3a
(4)
'
where I is incomplete V function, a = Y ] n _ l ( b i / 0 . 1 ) 2 , and f = (n + 2a)3/(n + 3a) 2. Similarly the approximate expression for/3 has a form
~- 1 - I
1
-~
+ 2a'
+ 3a'
r
a t2
+
n + 3a'
1,}
, ~f
,
(5)
where a ' = ~=l((bi-~- ~2i)/0.2) 2, and f ' = (n + 2a')3/(n + 3a') 2. Expressions (1) - (5) connect values Crl, cr2,/71, fi2 with variables R and 71 characterizing the quality of classification. 4. R a r e e v e n t s classification a p p l y i n g n e u r a l n e t w o r k The general scheme of the method involves a two-step procedure. At the first stage the training of a network is performed for a ratio of 1:1 between the classes being separated, while at the second stage correction is carried out of a certain group of network parameters termed shifts. In a number of simple cases the correction permits simple analytic transformations, and in more complicated cases it requires minimization of a functional in the space of shifts for given weights of the neural connections. In general case a change of
99
Fig.1.
T h e efficiency of t h e m e t h o d for two rela~tions b e t w e e n c o n t r i b u t i o n s of t h e e v e n t s :
1) P ( w 1) = 0.1, P ( w 2 ) = 0.9 (two t o p c h ~ r t s ) , 2)
P ( w l ) -- 0.001, P ( w 2 ) = 0.999 ( t w o b o t t o m c h a r t s ) .
relation between P(col) and P(co2) leads to some transformation of the separating surface and in special case are determined by a relation of similarity. In the process of construction of separating boundary with the help of a multilayer perceptron with one layer of hidden neurons, which have step-transfer function, fitting of parameters approximating this boundary is carried out. The change of the relation between P(col) and P(w2) corresponds to a parallel translation of each hyperplane. The value of this shift is determined by the threshold of a network. The method are best considered using the example of previous chapter. In this case a change in the relationship between P(COl) and P(aJ2) only results in a change of the radius of the separating hypersphere, i.e. is determined by similarity relations. It can be readily shown, that for known radii, r and r', of the Bayes hyperspheres the
100 shift (c~) of each neuron of the hidden layer should be recalculated by the formula
-
+
.
(6)
i=1
where wij is the weight connecting the j-th neuron of the hidden layer with the i-th input neuron.
For practical realization of the method a multilayer perceptron simulator from the package J E T N E T - 3 [2] was used. There were considered two versions of the method: l) with recounting values of shifts after network training for a ratio P(w~) = P(w2), and 2) with determining the indicated values by means of minimization of network functional in the process of its repeated training. In the second case to the input of neural network with fixed weight matrix were supplied the same data. The problem of separation of two gaussians with diagonal matrixes Ej - crj2I, j - 1 2 i n the space of dimension n = 5 and with or1 = 1, ~r2 = 0.3 was considered. In fig. 1 are presented values of variables R and 71 for the case ~1 = /~2 = 0 and P(wl) = 0.1, P(w2) = 0.9 (figures l a and l b) and for the case IA#il = 1,i = 1 , 2 , . . . 5 and P(wl) = 0.001,P(w2) = 0.999 (figures 1 c and 1 d). The presented values are marked by asterisks for the case of minimization of functional, and by squares - for the case of recounting shifts using a formula (6). Contour notations concerns the training pattern, s h a d e d - to the control sample. The results concerning the approximate bayessian classification (denoted by circles) are presented for comparison. Moreover, results of training and testing of network after first stage of the correction procedure concerning a ratio P(wl) = P(w2) = 50% (notated by triangles), as well as curves concerning bayessian classification are presented. Some deviations from a theory are caused by insugiciently well carried out of training at the first stage of the correction procedure, which characterizes the accuracy of the method. 5. C o n c l u s i o n The method for solving the problem of small probability events classification was developed. It is based on application of one type of neural network - - multilayer perceptron with a single layer of hidden neurons having a step-transfer function. The developed approach allows effectively use the neural network for identification of rare events, which contribution does not exceed 0.1%. This method is mostly actual in the case of small sample sizes n< 10+20. Acknowledments This work has been partly supported by the Commission of the European Community within the framework of the E U - R U S S I A Collaboration, in accordance with the project ESPRIT 21042: Computational Tools and Industrial Applications of Complexity.
Bibliography [1] R.O. Duda and P.E. Hart, Pattern Classification and Scene Analysis, (Wiley, New York, [2] C.Peterson, T.RSgnvaldsson and L.LSnnblad, : "JETNET 3.0- A Versatile Artificial Neural Network Package", CERN-TH.7135/94, December 1993.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
101
Cellular Automaton and Elastic Neural Network Application for Event Reconstruction in High Energy Physics I. K i s e l
E. K o n o t o p s k a y a
V. K o v a l e n k o
J o i n t I n s t i t u t e for N u c l e a r R e s e a r c h , D u b n a , 141980 R u s s i a
Abstract
cellular automaton for filtering data and elastic net for geometrical reconstruction of events in high energy physics. The advantages of methods are simplicity of the algorithms, fast and stable convergency, and reconstruction efficiency close to 100%. These methods were tested with success on simulated events and real data obtained in the experiments NEMO (Modane, France) and MI0 (PSI, Switzerland). W e use
1
Introduction
The rapid development during last 10-15 years of various theories of artificial neural networks [1] was a reflection of and attempt to overcome the gulf between the huge amount of factual material relating to the biological mechanisms of brain operation accumulated in neurophysiology and the inadequate existing mathematical formalism and computational means of technical realization of the formalism. The principal advantages of the brain in fulfilling logical, recognition, and computational functions, using capabilities that are essentially parallel, nonlinear, and nonlocal, did not match the prevailing principle of sequential calculations with orientation of the mathematical formalism toward locality, linearity, and stationarity of the descriptions. Included among these are problems whose solution is complicated precisely by nonlinearity, nonlocality, discreteness, and, often, nonstationarity of the situation. For instance, we have here problems of pattern recognition, construction of associative memory and optimization. Essentially, the theory of artificial neural networks is a part of the general theory of dynamical systems in which particular attention is devoted to the investigation of the complicated collective behavior of a very large number of comparatively simple logical objects. Having their own significance cellular automata can be regarded as local discreet form of neural networks. They are used particularly in high energy physics for data filtering and track searching. Here we describe an application of a cellular automaton for searching tracks and an elastic neural net for fitting tracks in the NEMO experiment [2] and for searching for vertex in the MI~ experiment [3].
2
N E M O experiment
The goal of the NEMO collaboration I is to study tiff decays of l~176 and other nuclei to probe the effective Majorana neutrino mass down to 0.1 eV. The collaboration is building the NEMO-3 detector to realize this. A prototype detector NEMO-2 designed for tiff studies has already provided some measurements and is presently running in the Frdjus Underground Laboratory. The detector is a 1 m 3 volume made of tracking chambers composed of drift cells operating in the Geiger mode and two plastic scintillators arrays for energy and timeof-flight measurements. A typical event in this experiment has a few number of tracks usually good separated in space. But this situation is complicated by essential effect of multiple scattering and even hard scattering on wires. 2.1
Cellular
automaton
for track
searching
Searching for tracks in the presence of the left-right ambiguity of drift tubes and significant effects of multiple scattering in a gas and even hard scattering on wires becomes a task lying out of typical problems of event reconstruction in high energy physics. Therefore the method of cellular automata was chosen as flexible one and was good recommended itself working in nonstandard situations. Cellular automata are dynamical systems that evolve in discreet, usually two-dimensional, spaces consisting of cells. Each cell can take several values; in the simplest case one has a single-bit cell: 0 and 1. The laws of Ihttp ://nuweb. j inr. dubna, su/LNP/NENO
102
evolution are local, i.e., the dynamics of the system is determined by an unchanged set of rules (for example a table) in accordance with which the new state of a cell is calculated on the basis of the states of the nearest neighbors surrounding it. It is important that this change of states is made simultaneously and in parallel, with time proceeding discreetly. Cellular automata became particularly popular in the 1970's through the publication of M. Gardner in Scientific American which was devoted to Conway's game, Life. Specific features of the experiment make more preferable the segment model of cellular automaton where elementary unit (cell) is segment connecting two fired wires in neighboring layers. To construct a cellular automaton for track searching in NEMO-2 data, one proceeds following the logic of cellular automata. First, note that the cellular automaton is three-dimensional. A cell is identified with a straight-line segment connecting two fired wires in neighboring drift tube layers, making the cellular automaton essentially local. To take into account Geiger tube inefficiencies one must also include segments connecting wires which skip one layer. At each step an individual cell has two possible states: 1 ~ if a segment can be a part of a track, and 0 otherwise. Second, in establishing the criterion for assigning segments to a track it is obvious that only segments with a common extremity can be considered as neighbors. Then owing to the coordinate detectors having a discreet structure and to multiple scattering in the material of the experimental apparatus, the angles between track segments in the real experiment are not zero, but an upper limit Cmax can be imposed. Third, all cells are initialized with a state 1 and at each step of evolution they look on neighbours and decide to change states to 0 if there are no neighbours with state 1 at both sides. Neighboring segments forming a small angle are preferable during evolution. Fourth, the definition of time is as usual and evolves discreetly. All cells change their states simultaneously. The cellular automaton has following features (comparing with the previous track finder): increase in the tracking efficiency of 9%; increase by a factor 35 in the processing speed; working in 3D space; good reconstruction of tracks with hard scattering on wires; reconstruction of short tracks; simple to modify. 2.2
Elastic
net for tracking
Let us consider only single straight tracks out of magnetic field. There are no noise, which does not need track searching, and no missing wires, which slightly simplify an algorithm. It is obvious to
Define track with multiple scattering as the most smooth line touching all circles rounded fired wives and crossing all layers. Let's try to construct the elastic net as a line which is deformed under influence of t w o t y p e s of force: 1. the first pulls it to the edges of circles; 2. the second smooths out the line. In the case of left-right ambiguity of drift tubes the task can be considered as a problem of minimization of a function of many variables with many local minima. To solve this problem we propose to start from two points rounding global minimum and covering all possible area of physical region of the parameters have to be found. These points are not independent but attract each other. So the points will pull each other from all local minima until the global minimum will be reached. According to this idea we start from two bounded tracks which restrict geometrical area of the real track. One of them touches circles at upper sides but another one B at down sides. Then we introduce a t h i r d t y p e o f force which is: 3. attraction between these bounded tracks to squeeze a geometrical area to the real track. This method allows to find an optimal trajectory which corresponds to our model of track. The elastic net can be simply modified to be able to reconstruct broken tracks. We have only to switch off track smoothing at a break point, which has to be found during preliminary analysis. An example of evolution of the elastic net for a multiple scattered track is presented in Fig. 1. Layers are numbered from left to right and iterations go from up to down. There are two starting tracks B upper and down tracks. One can see smoothing the upper track at first layer after first iteration, but then this track becomes almost linear at the left group of layers and is stable being attracted to neighboring edges of circles. It is in a local minimum and moves down only under pressure of the third type of force ~ attraction to the down track. The middle part of upper track is smoothed at the beginning and then evolves mainly due to attraction to the down track. The right part of tracks is in equilibrium stage at the middle of the evolution and goes to the global minimum only due to smoothing.
103
~
!40
~120
IO0
8O
4O
2O
2:_~
~
.
.
.
.
.
.
.
.
.
.
.
.
.
00 ' .
5
.
10
.
.
.
.
15 20 25 Number o f iterations
i
30
.
~
.
35
.
Figure 1" Example of the elastic net evolution for a track with multiple scattering.
Figure 2: Number of iterations for convergency.
The elastic net method has fast convergency to track with few iterations (see Fig. 2) and reconstruction efficiency close to 100%. m
MM experiment
3
This is a new experimental search for the lepton-number-violating process: M = (~+, e - ) --+ M - ( j u - , e+), a spontaneous conversion of muonium (M) into antimuonium (l~I). This process is forbidden in the standard electroweak theory but allowed in some modern theories beyond the standard model. The experiment is performed at the proton cyclotron at the Paul Scherrer Institute (PSI) in Villigen, Switzerland. The detector consists mainly of two parts. The first one is a magnetic spectrometer with a large solid angle. It consists of five cylindrical multiwire proportional chambers and one scintillator hodoscope built up of 64 strips surrounding the chambers. The target is located at the center of the detector. The second part of the detector, the positron detector, consists of position sensitive micro channel plate and 12 segmented CsI crystal surrounding it.
3.1
E l a s t i c net for v e r t e x search
We use elastic net for vertex search in the case of arc tracks. This kind of task is appeared in many experiments, so the algorithm can be applied widely. The main problem is caused here by a big target (~ 10 cm in diameter) used in the experiment. So we have no good initial approximation for the least squares method applied usually for such tasks. Another feature of the experiment is different number of tracks per event w up to 10. So the algorithm must search for vertex in events with any number of tracks. Let us
Define vertex as geometrical point with maximum density of tracks. We construct the algorithm on the basis of elastic ring and introducing only t w o t y p e s of force: 1. attraction of the ring to all tracks placing it at the condensation area of tracks; 2. attraction to the nearest tracks localizing the vertex region. An example of testing of the algorithm on simulated events is presented in Fig. 3. Here you see 3 tracks crossing the big target. The iterational procedure is also presented in the picture by circles converging to the vertex. A comparison of the elastic net method with the fast vertex search method [4] based on the Chebyshev metrics was also made. Good correlation between errors of vertex for the elastic net method and the method based on Chebyshev metrics (Fig. 4) shows reliable working of the method.
104
Figure 3: Vertex search in 3 tracks event. The iterational procedure is presented by circles converging to the vertex.
4
Figure 4: Correlation between errors of vertex for the elastic net method and the method based on the Chebyshev metrics.
Conclusion
The results of testing on simulated and real NEMO tracks and simulated and real MI~I events demonstrate reliable working of the cellular automaton of the elastic net method. The advantages of the methods are: 9 simplicity of the algorithms; 9 fast and stable convergency; 9 high reconstruction efficiency. This work is partially supported by the Commission of the European Community within the framework of the EU-RUSSIA Collaboration under the ESPRIT contract P9282-ACTCS.
References [1] I. Kisel, V. Neskoromnyi and G. Ososkov, Applications of Neural Networks in Experimental Physics. Phys. Part. Nucl. 24 (6), November-December 1993, p. 657. [2] R. Arnold et al. (NEMO Collaboration), Performance of a Prototype Tracking Detector for Double Beta Decay Measurements. Nucl. Instr. and Meth. A354 (1995) 338. [3] W. Bertl et al. (MI~I Collaboration), Searching for Muonium-Antimuonium Oscillations. Proc. IV INS on Weak and Electromagnetic Interactions in Nuclei (WEIN'95), Osaka, ed. H- Ejiri et al., World Scientific Publ., Nov. 1995. [4] N. Chernov, A. Glazov, I. Kisel, E. Konotopskaya, S. Korenchenko and G. Ososkov, Track and Vertex Reconstruction in Discrete Detectors Using Chebyshev Metrics. Comp. Phys. Commun. 74 (1993) 217.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
105
R E C O G N I T I O N OF T R A C K S D E T E C T E D B Y D R I F T T U B E S IN A M A G N E T I C FIELD Baginyan S.A., Ososkov G.A. Joint Institut for Nuclear Research, Dubna, Russia. Abstract An algorithm of track recognition in an uniform magnetic field is proposed for the drift straw tube detecting system of solenoidal geometry. The problem solution is given for (x,y) plane perpendicular to the magnetic field. Our algorithm is elaborated on the basis of (1) sequential histogramming method, which is, in fact, a modifications of the Hough transform and (2) a modification of deformable template method following by a special procedure of parameter correction. Being tested on simulated events our method shows satisfactory efficiency and accuracy in determination of particle momenta.
Introduction The efficiency of the track reconstruction algorithm depends on reasonability of a clustering method applied to group measured data points into track candidates. As examples of such reasonable algorithms one can point out well known methods like variable slope histogramming or stringing (track following) methods [1], as well as relatively new approaches like Hopfield neural networks [2]. One of detector systems widely used in modern experiments of high energy physics (ATLAS, EVA/E850) are drift straw tube detectors (DSTD). Each time, when a passing particle track hits a tube, it registers two data: its own center coordinate and the drift radius, i.e. the drift distance between particle tracks and the anode wire situated in the center of this tube. The main problem, which hinders applications of above mentioned conventional track recognition methods, is so-called left-righ-ambiguity of drift radii. They don't contain the information about, on which side of anode wire the track was passed. In this report the algorithm of track recognition in an uniform magnetic field is proposed for the DSTD system of solenoidal geometry. A problem solution is given for (x,y) plane perpendicular to the magnetic field and anodes of drift straw tubes. Our algorithm is elaborated on the basis of modifications of the Hough transform and deformable template methods. However, the main futures of the proposed algorithm have the common character and are independent of the experimental setup geometry.
2
Formulation
of the Problem
The set S = {xi, yi; ri, i = 1,N}, where (x~, y~) are coordinates of the hitted tube centers and ri are drift radii, is the result of the event measurments. Geometrically the set S can be considered as the set of circles on the plain with centers (xi, yi) and radii ri. Thus the m a t h e m a t i c a l f o r m u l a t i o n of the p r o b l e m is to draw the track line as a circle (a, b, R) tangential to the maximum number of these little circles from S. Let us introduce, as a measure of two circle tangency on the plain, the minimum distance between crossing points of these circles with the straight line juncting centers of both circles. If two circles aretangential, their tangency measure is, obviously, equal to zero. Then our above formulated problem can be reformulated as the following" to find such a circle
106
(a, b,R) t h a t m i n i m i z e s t h e s u m of its t a n g e n c y m e a s u r e s w i t h all circles f r o m t h e set S. Let us denote by D{(a, b,R) the distance from the center of the circle (x~, y{; r~) to circle (a, b, R). This variable can take both positive and negative values. Therefore the tangency measure square of those two circles (xi, yi; r,) and (a, b; R) is twofold: if Di(a, b; R) > 0, then d:, - (Di(a, b; R) - ri) 2, otherwise d + - (Di(a, b; R) + r~) 2. As in [3] we define the two-dimensional vector ~'~ - (s+,sT) with admissible values (1, 0), (0, 1), (0, 0). ~'~ - (0, 0) means i-th tube is the noise tube and the combination ~'~ (1, 1) is forbidden. Let us denote by A the measurement error of the drift radius and define a functional L depending of five parameters (a, b, R, s~-, s+) 9 N i--1
Thus to recognize a track one has to: (1) from the set of all measurement extract a subset S, which as much as possible contains all data for one of tracks; (2) find the L global minimum (although it would be enough to reach its close vicinity). To solve the first problem we modify the Hough transform method [4], which we following to [5] call as the method of sequential histogramming by parameters (SHPM). Besides of extracting of a subset S SHPM provides also starting values of the circle (a0, b0; R0) needed to solve the problem on the next step. The second problem is solved by the deformable template method (DTM) with the special correction of parameters of obtained tracks.
3
Sequential histogramming method
Let ~ - {X~, Yi, i - 1, N} be a set of coordinates Xi, Y~ measured in the process of registering of an event. So-called sequential histogramming method [5] gives us the following algorithm for finding of initial track parameters" 1. Circles are drawn through all admissible point triplets. Then the first coordinate aj of each circle is histogrammed. The value am is obtained corresponding to the maximum of this histogram. 2. With the fixed a,~ circles are drawn through all admissible pair of points from Y/. Then the second coordinate bj of each circle is histogrammed. The value b,~ is obtained corresponding to the maximum of this second histogram. i
3. With the fixed coordinates of the center a,~, b,~ all admissible points Rj of the set are histogrammed. The value Rm is obtained corresponding to the maximum of this third histogram. Then the obtained parameters (am, bin; Rm) are subjected to more sophisticated tests and specifyings. If results are positive, i.e parameters (a,~, bin; Rm)are accepted as a true track, all measurements corresponding to it are eliminated from the set ~ and the whole procedure is repeated starting from the step 1. If the circle (a,~, b,~; R,~) is rejected by testing, then select next combination of parameters. In order to apply SHPM the results of measurements must have a format of the ~-set, i.e. to be a set of track point coordinates.. However, we have instead the set S of little circles {xi, y~; r~, i = 1,N}, so we have to determine on each of these circles a point associated
107
with some of tracks. Supposing the vertex area, from which all tracks of the given event are emanated, is known, one can roughly determine such a point, as a tangent point of the tangent line drawn to each little circle (xi, yi; ri) from the center of the vertex area. So, we have two possible track points. It would not restrain us in applying of the SHPM, but it should be kept in mind that the left-and-right uncertainty factor doubles the elements number of the set f~ = {Xi, Y/, i = 1,2N} in a comparison with the number of elements in the original set S = {xi, yi; ri, i = 1, N}.
4
Deformable template method
After obtaining by SHPM initial values of track parameters and choosing an area where this track could lie, we proceed to look for the global minimum of the functional L (1). One of the main problems here is how to avoid local minima of L provoked by the stepwise character of the vector ~'i - (s +, s~-) behaviour. One of known way to avoid this obstacle is the standard mean field theory (MFT) approach leads to the simulated annealing schedule [6]. As it was shown in [3], parameters s + n s~- of the functional L with fixed (a, b; R) can be calculated by the formulae, where the stepwise behaviour of the vector s'i is replaced in fact onto sigmoidal one. The L global minimum is calculated according to the following scheme: 1. Three temperature values are taken: high, middle and a temperature in a vicinity of zero, as well as three noise levels corresponding to them [3, 6]. 2. According to the simulated annealing schedule our scheme is started from the high temperature. With initial circle values (a0, b0; R0) parameters s +, s~- are calculated. 3. For obtained s +, s~- new circle parameters a, b; R are calculated by standard gradient descent method. 4. The ending rule is standard. 5. If the conditions of the step 4 are not satisfied, then with the new circle parameters (ak+l, bk+l,/i~k+l) next values of s +, s~- are again calculated and go to the step 3. 6. After converging the process with the given temperature, it is changed (system is cooled), values of (a, b, R) achieved with the previous temperature are taken as starting values and we go to the step 2 again. 7. With each temperature value after completing step 5 the condition L < Lc~t is tested. If it satisfied, then our scheme is completed and the algorithm proceeds the next stage of correcting of obtained track parameters (a, b, R). Otherwise, if with the temperature in a vicinity of zero we obtain L > Lc~t, then a diagnostic is provided that the track finding scheme is failed.
P r o c e d u r e of the track p a r a m e t e r correction Deformable template method provide us by track parameters (a, b; R). Hovever these parameters could appeare rather apart of the L global minimum. Therefore we have to elaborate an extra stage for the track parameter correction.
108 On each circle of the set S = {x~, yi; ri, i = 1, N} taking in account corresponding values of ~'i a point is found nearest to the track-candidate. Then all these points are approximated by a circle and X2 value is calculated as a criterion of their smoothness and fitness quality. If it is hold X2 < X~t, then the approximating parameters (ac, be; Re) are accepted as true. Otherwise the track-candidate is rejected.
6
Results
Proposed track finding algorithm of tracks detected by DSTD system in a magnetic field was tested on simulated events. 990 tracks have been modelled as circle arches with radii in the range from 40 cm to 5 m emanatying from a target under various angles. 955 tracks from 990 have been recognized correctly that means 96,4% of the algorithm efficiency. The distribution of the radius relative error shows that its mean and RMS are of the order 10 -2 of radii what is satisfactory for considered experimental setup.
References [1] H.Grote, Pattern recognition in High Energy Physics, CERN 81-03, 1981. [2]-C. Peterson, B. S5derberg, Int. J. Of Neural Syst. 1, 3 (1989). [3] S. Baginyan et al.,Application of deformable templates for recognizing tracks detected with high pressure drift tubes, JINR Commun. E10-94-328, Dubna, 1994. [4] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis, (Wiley, New York, 1973) [5] Yu. A. Yatsunenko, Nucl. Instr. and Meth. A287 (1990), 422--430. [6] M. Ohlsson et al., Track Finding with Deformable Templates- The Elastic Arms Approach, LU TP 91-27.
Session D: ADAPTIVE SYSTEMS I: IDENTIFICATION AND M O D E L I N G
This Page Intentionally Left Blank
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
111
AN UNIFIED CONNECTIVE REPRESENTA.TION FOR LINEAR AND NONLINEAR DISCRETE-TIME SYSTEM IDENTIFICATION Jacques FANTINI L.A.M.A, Universit6 d'Orl6ans, E.S.E.M, rue L6onard de Vinci 45072 ORLEANS Cedex 2 Abstract System identification is the subject of much research and many articles propose often complex computation algorithms. Moreover, the models and identification methods used are different according to whether the real system is linear or nonlinear. This paper presents an identification methodology based on a single model deduced from Neural network theory. It measures and analyses the degree of precision obtained, and defines the influence parameters for convergence of network errors. I-- INTRODUCTION In order to regulate and control a real system, a mathematical representation is required which provides a satisfactory estimation of the process and which can be obtained by identification. The particularity of Nonlinear systems lies in the fact that the principle of superposition is not applicable. Thus the identification algorithms currently proposed are based either on approximation principles or on methods which can not be generalized. 2. ===.
DISCRETE-TIME REPRESENTATION AND CONNECTIVE MODELISATION
2.1 Linear systems Let [Y] = [ F ] [U] be the representation by transfer matrix of a multivariable system, defined as commandable and observable such that [YI the output vector of dimension ny, [U] the command vector of dimension nu, and
~b I zl fij = l= 1 , m < n, element ij of matrix F (ny x nu). The output yi is linked to command uj by the following m Z" ! Eal !=0 . . + Y~bluk.l, m j j characteristic polynomial of degree n . (1) y[ = EalY~,.i whereYri the value of yi and u rJ I=i
the value
ofu
j
in
1=1
the time range [rA, (r+ 1)A[, with A the sample period. i The determination of all of the coefficients a~ and b i of It] and of degree n of each characteristic polynomial is a necessary and sufficient condition for defining a satisfactory representation of the system. Given the Neural network in figure 2.1, with p = ~n, q = m2 and the transformation of variables (T) y~ = 1 I exp(yi~) ,U~ = ~1- ,/ ~exp(u~ )+ ) , bijections of the set of real numbers in [0 -~- [. "N"~,exp(y'r ) + The activation function of each neuron f..(x)=sh(x), an odd, ascending function,for which there exists a derivative, oo
X 2n
and which has the following properties: fact (x) = ~o (2n + 1)! = x + x~ with ~-~0 when x-~0. i
The output expression of the neuron r is O r = f0r) = sh(wr, tE[-t + Wr.t+lEk-t+l )" Given hypothesis (H1): sh(x)~,x, whose conditions and limits of validity are set out in section 4 ==> Or = (Wr.tEik.t + ::> (2)
/ i hy, [wy !,(21-1 )Yk-(21-1)
yi k
+ wy
q ,.2,Yl~-21]+ ,~ ha, (wu,,(21-1)U
i k-(21-1) +
i Wr, t + l E k . t + l )
wul,21U ik-2,)
1=1
From the identity (1)=(2) it follows that: i
ar =
hY~(r)WY~r)Yl~'r
i Yk-r
"b~-
hu ~(r) wu d(r) U Jk-r U~-r
, re[1,m] and 8(r)=Whole Part of ( r + 1~ k 2 /
2.._22 Nonlinear systems Let e(t)=eosincot be applied to the input e of a Nonlinear system. The output y is a non sinusoidal periodic function oo
decomposable into a Fourier transform y(t) = Z sisin(iro t + u ). Its discrete-time representation is: i=!
112
S(Z.I) _-- ~. S i [si__n_9'i_.+z"sin(._....~i_coT-q/i ] i=~
eoZ'lsinw
, and E(z "t) ffi
z 2 - 2z Icosco T + 1 YIk.l wyt,t
z 2 - 2z "lcosicoT + 1
yik. 2
hyl
yi
wyza
Yik. 4 yi k
Eik4
Yk
well,! :wel2,1
het
Eik.t i
/ha,
EJk.2
hez
EJ~_, jk-m
!,2
l~UIq,m
input
hidden layer
output
EJk-Z(
fifure 2.1
EJu.3
EJk.3}
The transfert function F(z a) = Y ( z l ) of the E(z "l ) nonlinear system becomes
wel,3 '2,3
~k~
3 ) - ' l b j i z "j F(Z "1) ~. ~ - - - - - - - - L J = ' 3
i=~ eosinoJT
~vell,2
~k-~
Y aj z j
j=0
The output y is linked to the command e by the following characteristic polynomial: Yk ffi
2
i=! e osincoT j9 - - !
b j,i e
~.j
a y
+ "=
J k.j
=
EJk-4~// inpul ~4
(3) ._.-
a multivariables system, each output y' is linked to the input e i by the same polynomial statement. F o r
we~,4
hidden layers
1~4
output
/ieure 2.2
Given the Neural network infigure 2. 2 with the same definitions enonced in 2...21,the output expression of the i neuron r is O r - f(Ir) = sh(wr, tEik.t + Wr,t+lEk.t+l). Hypothesis (Ill) ~ O r = (Wr, tEk.t + Wr,t+lEk.t+l) :::>
From the identity (3)=(4) it follows that"
ar
_.
"b r =eosinwT
hY~
Yk-r
he r
i we o~(r),rE k-r ek-r
2.3 Identification methodoloL~v
.....
The set of data y ik, discrete-time responses of a system subjected to the commands u ~ (e~), are known and define the information vectors of the Neural network. Thus, Vk>O and bounded, the training sample is defined
(i)
,--{
by the s couples X [ , y k with X k the output.
Yii
"
[k - 1, k -
m]/}the input vector of the network and y 'k
113
The weights wyt,r, WUm,r,wel,~, hy,, hui and hel determined by the training stage allow direct calculation of the characteristic polynomials coefficients. However, the quality of the identification performed will depend respectively on the behaviour of the learning error and the incidence of error generated by the approximation hypothesis. 3_ PROPAGATION OF THE APPROXIMATION ERROR 3.1 Expression of the approximation error of the activation function oo
x2n
fact (X) = 0~ (2n + 1)! = x + x~=~.G < exp(x) - ~ let G~0 for x--~0, x any variable treated by the Neural network. The application of Yk and Ek to the inputs of the Neural network makes it possible to reduce the numeric value of the variables of the system without modifying the identification results. Therefore, the transformation (T) with N a scale factor defining the adjustment parameters of all variables x of the network with e sufficiently weak to result in a satisfactory identification of the system. 3....22Expression of the approximation error in feedforward propagation For a neuron j of the hidden layer: Ij = Wj,rEk-r + Wj,~+lEk-<~+l),and Oj = Ij + Ijej [~oj = I j6j ], approximation error on Oj, with ~j < exp(Ij) - 1 p+q
p+q
For the output neuron: I= ~ w j O j = Y'.wj (Ij +Ijzj ) and S ik = I + 16 with e<exp(I)-I ==~ j=l
<
WjI j(exp(/~)- 1
j=l
with L-~0 when I1--~0, approximation error on S~.
Therefore, the approximation error propagated through the Neural network to output S~ tends towards 0 for variables of small dimensions. 3....33Expression of the approximation error in back-propagation The back-propagation algorithm results in the following equation modifying the weights
w j(r)= w j(r-1)-2e(r)(S i- yi)f:ct(i)Oj with e(r)the gradient step at stage r, Si output calculated by the p+q
p+q
network, yJ desired output, f:act derivative of the activation function, and I = ~ w jO J - :c w j=l
(xj,
j=l
P+q
Given that the approximations obtained are respectively Oj = Ij and I = ~ w jIj, the approximation error for j=! P+q
p+q
i=l
i=l
the update ofwj becomes: ~wj=(lj+ ljl~j)ch('W+V) - Ijch(W) with W = ~ w jIj and V = ~ w j l j s j , then I ~wl < llexp(W)[exp(L) - I] with L---~ when I1---~01 Therefore, the approximation error propagated during the training stage, for the updating of weights w~j, tends towards 0 for variables of small dimensions. 3.4 Generalization Hypothesis H1, approximation of the activation function, generates an error ~ such that its property of tending towards 0 is maintained both during feedforward propagation, and during the training stage. Thus, the error generated is linked to a controlled condition: the size of values y~ and th, treated. The following section will analyse the evolution and the incidence of this error on the identification of a real system.
4_ APPLICATION AND CONCLUSION 4...~1Presentation
Identification of a real system by Neural networks is the search for coefficients a j and b I of polynomial
i=i
i-nwT
j=
j,i ek-j +
--
ajy k-j" The analysis of the effect of the approximation error on the precision of
t h e identification result involves the study of two parameters, the scale factor N, and the dimension i of the Network. For each of these parameters, the end condition for the training stage relies on a satisfaction index (the quadratic sum of the errors of each output neuron) or, if this is not reached, on a maximum number of iterations.
114 4....22Scale factor influence Given N e [1 oo[, and the dimension i fixed, the effect of the approximation error and the behaviour of the neural network are analysed using the satisfaction index of the training stage (randomly initialising weights). The difference A between the response of the real system and the identified process subjected to the same requirements is equal to the quadratic sum of the errors of each sample and is studied in order to test the quality of the identification. Examination of these two parameters brings out a minimum for a specific value of scale factor N=N,t. (figure 4.1). The satisfactory result of the identification performed is confirmed by the value of A obtained at Nn~n, and by_the precision of the coefficients of the identified system calculated by the network compared to those of the real process (< 1%). 4....33Network dimension influence Given NfN.t., the satisfaction index and the difference A between the real and identified process are analyzed in terms of the dimension i of the Network. Each parameter tends towards 0 as the dimension increases (figure 4.2), therefore the precision of the identification increases with i.
The convergence of these parameters towards 0 is linked to a growing dimension of the training vector samples, and a richer topology in terms of neurons and network links, as dimension i increases. However, if the dimension is too high, the model obtained will not be very easy to use, and it will be difficult to evaluate the most significant parameter of the system. 4...~4Conclusion Identification of processes using this model produces a simple computation method which can be expanded to linear and nonlinear systems, and which obtains a satisfactory representation of the system. Moreover, this method allows for the evaluation of the precision of the model according to the dimension chosen, and is linked to a growing dimension of the training vector samples, and a richer topology in terms of neurons and network links, as degree n increases. Finally, use of this computation model simplifies the identification mecanism for multivariable systems, as it represents the extension of the identification of a monovariable sub-system by duplication of neural networks. REFERENCES
Pl
BEALE R. & JACKSON T.
[2]
Adam Hilger, Bristol Philadelphia and New York, 1990 KHANA T.
[6]
Neural Computing
Linear Systems PRENTICE HALL, 1980 [7]
KUCERA W.
[8]
Academia Praha, 1979 I J U N G L.
[9l
Prentice Hall, 1987 STOICA & PETRE
Foundations of Neural Networks
[3]
Reading, Addison-Wesley, 1990 DAYOFF J.E.
14l
Princeton, Van Nostrand Reinhold, 1990 WESSERMAN P.D.
Discrete linear control
Neural Network Architectures
System Identification: theory for the user
Neural Computing: Theory and Practice
is]
Princeton, Van Nostrand Reinhold, 1989 BROOMIIEAD D.S & LOWE U.
KAILATH T.
Identification of linear discrete-time systems using instrumental variable approach [10]
T-AC Oct 88 RUTKOWSKI & LESZEK
Multi-Variable Functional Interpolation and Adaptative Networks
Online identification of time-varying systems by nonparametric techniques
HMSO. RSRE report, April 1988
T-AC May 91
P r o c e e d i n g s I W I S P '96; 4 - 7 N o v e m b e r 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
115
Predicting a Chaotic Time Series Using a Dynamical Recurrent Neural Network Roberto
TERAN
Jean-Philippe
Gustavo C A L D E R O N
DRAYE
Davor
Gai~tan
PAVISIC
LIBERT
Parallel Information Processing Laboratory Facult6 Polytechnique de M o n s -- B 7 0 0 0 M o n s ( B E L G I U M ) Abstract.In this paper we present two kind of recurrent neural networks for time series forecasting. They both associate a time constant to each neuron. The first recurrent network is able to adapt its time constants (Dynamical Recurrent Neural Network), and the other one keeps all its time constants with a fixed value (Static Recurrent Neural Network). Our results, using a chaotic time series, show that adapting the time constants decrease the training time, and improves the prediction capability compared with static recurrent neural network. 1 Introduction Feedforward networks are able to perform many tasks well, and have several practical implementations in control, recognition problems and forecasting time series ([4], [11], [14]). However, they have their limitations for predicting nonlinear time series. Recurrent Neural Networks have appeared showing a better performance compared with traditional or feedforward networks ([2], [3], [7]), they are able to learn attractor dynamics, and they can store information for later use. Moreover, Dynamic Recurrent Neural Networks (DRNN) are able to enhance Static Recurrent Neural Networks (SRNN) capabilities, specially handling with time dependent problems or temporal tasks. The main difference between a DRNN and a SRNN is the fact that DRNN use a adaptive time constant associated to each neuron. These time constants act as a linear filter, and we can consider a DRNN as a FIR network (see [ 13]), but with recurrentconnections. SRNN have limited storage capabilities and they may be inappropriate to deal with confusing time series. DRNN ability are well known to handle with temporal processing and with nonlinear problems (see [ 1], [2], [5], [6], [7], [9]), they can recognize the underlying law governing the system by a system of nonlinear differential equations, and a chaotic system can be approximated by a set of nonlinear differential equations (see [9]). DRNN have more parameters than a SRNN, hence in order to implement dynamical systems with chaotic behaviour we may train the network using a proper algorithm. So we will train such network using a modified version of the standard backpropagation algorithm called Time Dependent Recurrent Backpropagation.
2 Network Dynamics We consider the dynamics of our network governed by the following equations:
Ti dyi dt
- - Yi + F ( x i )
dr"I i
(a)
and
xi =
wij y
Co)
(1)
J
where Yi represents the output or the activation level of the ith neuron, F is the transfer function defined by
1
xi represents the total input to neuron i, li is an external dynamical input to neuron i, w 0 is the weight F(x) --"zT_~, - 1+ e between j and i neurons, n is the total number of neurons, and Ti which is a time constant that acts like a relaxation process. Equation 1(a) and Equation l(b) define a general dynamic system. Using the appropriate training procedures the system will exhibit asymptotic behavior (i.e. fixpoints, temporal behaviour), such behaviour is desired to handle with chaotic systems. There are few training procedures that can reach this convergence, we have chosen Time Dependent Recurrent Backpropagation algorithm, which will be explained in the next section.
3 Time Dependent Recurrent Backpropagation Time Dependent Recurrent Backpropagation algorithm (TDRB) is an extension of the standard backpropagation algorithm. Standard backpropagation algorithm has been modified permitting us to yield a powerful algorithm which can be used with dynamical systems in an efficient way (see [10]). This algorithm allows DRNN to learn nonfixpoints attractors, produce desired temporal behaviour, and reach a stable state quickly.
116
Let us consider the following cost function which measures the deviation of the actual output y(t) from the desired output d(t):
E =-21 It' ( y ( t ) - d ( t ) ) 2 d t
(2)
o
where the values to and t l limit the time interval during the correction process happens. Let us now consider
OF. e ~ ( t ) - Oyi(t)
(3)
measuring the influence of an infinitesimal change of the output Yi at time t on the cost function if everything else is left unchanged. Now the training algorithm is obtained by deriving Equation 1 with respect to the various parameters. However, we first introduce some additional variables zi defined by the equation:
dzi 1 ~_~ 1 dt - Ti zi - ei - J ~_ _i j F '
(x~ )zj
(4)
where we use the boundary conditions zi(h)=O. Then the parameters correction process is carried out by using the following equations:
= w--
yiF'(xi)zflt
and
-
and
AT, =-p,
Zi
t
(5)
such that
0E Aw,~ =-po 0w,~
0E
(6)
where Po and Pi are constants which act as learning rates. This algorithm not only modifies the weights (Aw0), but also the time constants (ATi) associated with each neuron. Time constants improve the memory effect of the time delays and the non-linearity effect of the sigmoid function. Then, in order to speed up the convergence to the desired function, we have used a momentum term, and we have associated a learning rates to each neuron. A complete presentation of this algorithm can be found in [6].
Figure 1:1000 time points of a chaotic laser data.
4 The Data This series was obtained in a physics laboratory experiment and shows the chaotic intensity pulsations of an NH3 laser (see [12]). This series has been distributed in 1991-1992 by Santa Fe Institute Time Series Prediction and Analysis Competition. This time series presents a complicated behaviour and it is regarded as a low dimensional chaotic dynamical system. It presents pulsating cycles when the amplitude of the periods begins to grow larger and it presents collapsing cycles when the amplitude of the periods reduces its size. We do not know exactly when a collapsing cycle will occur (see Figure 1).
117
5 Experimental Results We have used two fully connected recurrent neural networks, each one with 20 neurons, and we have associated a time constant to each neuron. We use only one neuron as input and another one as output. All the neurons receive the input signal y(t), except the output neuron. Then, we have done two different experiments: In the first case, we allow the algorithm to modify the time constants of the recurrent neural network (DRNN). In the other case, we do not allow the algorithm to change the time constants values (SRNN), keeping all their values constants. In both cases as training set we use the 500 initial points of the data set, and a maximum value of 2000 iterations (epochs) for the training process. The training process consists on adapting network parameters producing a signal that approximates y(t+A) (The best results were obtained using A=6). This mean that we introduce to the network the first value of the series as input, and the desired output will be t h e 6 th value of the series, then we introduce the second value of the series, and the desired output will be the 7 th of the series, and so on. To estimate the total error value, we use the sum of errors as cost function: E T = ~_~ En where E is the Equation 2, and n is equal to the training set pattern (500 in our case). Thus, our goal is to minimize the value of function ET. We stop the training process when: We have reached the maximum number of epochs (2000), or if we have an error value (ET) small enough to get already good predictions. Afterwards, we freeze the weights for the DRNN and the time constants values of the network and then we use the next 500 points of the data set as the validation. The summarized results are showed in Table 1. Figure 2 represents the comparison between the desired output, and the predicted output of the best trained network. Note that the ideal output in both cases should be a straight line with a slope of 45 ~ We can see that subfigure 2(a) (DRNN) match better this straight line compared with subfigure 2(b) (SRNN).
Neural Network DRNN SRNN ....
Averaged Error 3.84 4.73
Best Error 3.42 4.11
Averaged Epochs 1010 1530
Table 1: Fist column describes the kind of neural network used. Second column is the averaged total error value, and the third column is the minimum total error value over 50 runs (Both errors are normalized). And, forth column represents the averaged number of epochs of the 50 different experiments before attaining convergence.
Figure 2. Actual predicted values 3(t + A) versus the desired predicted output o(t + A). Note that the ideal curve is a straight line with a slope of 45 ~
118
6 Conclusions In this paper, we show that the fact of adapting the time constants increases the prediction ability of a Dynamical Recurrent Neural Network (DRNN), using less time for the training process before attaining convergence, and without losing quality of the predicted values. Even though using a Static Recurrent Neural Networks (SRNN) (using a constant value for the time constants) reach good results in our tests, the fact of adapting the time constants rather than using a constant value outperform these results. Both, DRNN and SRNN can save and use last outputs as processing information for the network. Nevertheless, the effect of adapting the time constants improves the memory effect of the internal memory of the network, and enhances the non-linearity of the sigmoid functions. In this way we do not lose the information produced by the network, enhancing neural capability to handle with temporal task, and allowing to the network to understand better the dynamics of the chaotic series. Consequently, the network identifies the chaotic behavior of the series faster, and the quality of prediction is better.
References [ 1] Alex Aussem, Fionn Murtagh and Marc Sarazin, Dynamical Recurrent Neural Networks - Towards Environmental Time Series Prediction, Technical report, European Southern Observatory, 1994. [2] Y. Bengio and M. Gori and G. Soda and P. Frasconi, Recurrent Neural Networks for Adaptive Temporal Processing. Proc. of the 6th Italian Workshop on Parallel Architectures and Neural Networks, pages 85-117, 1993. [3] Y. Bengio and P. Simard and P. Frasconi. Learning Long-Term Dependencies with Gradient Descent is Difficult. IEEE Trans. on Neural Networks. 5(2): 157-166, 1994. (Special Issue Dynamic and Recurrent Neural Networks.). [4] David E. Rumelhart Bernard Widrow and Michael A. Lehr. Neural networks: Application in industry, business and science. Communications of the ACM, 37(3), March 1994. [5] Jerome T. Connor and L. E. Atlas. Recurrent Neural Networks and Time Series Prediction. IEEE Transactions on Neural Networks. 1994. [6] Jean-Philippe Draye and GaEtan Libert. Dynamics Recurrent Neural Networks: Theoretical aspects and optimization. Neural Networks World. June 1993. [7] John F. Kolen. Exploring the Computational Capabilities of Recurrent Neural Networks. PhD thesis, Ohio State University. 1994. [8] Alfredo Medio. Chaotic Dynamics. Cambridge University Press. 1992.
[9] Hiroyuki Mori and Toshiji Ogasawara. A Recurrent Neural Network for Short-Term Load Forecasting. IEEE Transactions on Neural Networks. 1993. [10] Barak A. Pearlmutter. Dynamic Recurrent Neural Network. PhD thesis, Carnegie Melon University, 1990. [ 11] Zaiyong Tang. Feed-forward Neural Nets as Models for Times Series Forecasting. Technical Report TR91-008, University of Florida. 1991. [12] U. Huebner, N.B. Abraham and C.O. Weiss. Dimension and entropies of a chaotic intensity pulsations in a single-mode far-infrared NH3 laser. Physics Review. 1989. [ 13] Eric A. Wan. Finite Impulse Response Neural Networks for Autoregressive Time Series Prediction. Proceedings of the NATO Advanced Workshop on Time Series Prediction and Analysis. May 1992. [14] Eric A. Wan. Time Series Prediction by Using a Connectionist Network with Internal Delays. Time Series Prediction - Forecasting the Future and Understanding the Past. 1993.
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios ans P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
119
A New Neural Network Structure For Modeling Non-linear Dynamical Systems Amir Hussain, John J.Soraghan*, Tariq S.Durrani* and Douglas R.Campbell Department of Electronic Engineering and Physics, University of Paisley, High St., Paisley, U.K. PAl 2BE *Signal Processing Division, University of Strathclyde, George St., Glasgow, U.K. G! 1XW. Abstract: In this paper a new two-layer linear-in-the-parameters feedforward network is presented termed the Functionally Expanded Neural Network (FENN). Its output error surface is shown to be uni-modal allowing high speed single run learning. It employs a least squares based learning algorithm for updating its output layer weights thereby alleviating the non-linear learning difficulties associated with conventional multi-layered neural networks. New nonlinear basis functions emulating other universal approximators, namely the sigmoidal, Gaussian, and polynomial-subset basis functions, have been proposed for inclusion in the network's single hidden layer. Both simulated chaotic (Mackey Glass time series) and real-world noisy, highly non-stationary (sunspot) time series data have been used to illustrate the superior modeling and prediction performance of the FENN compared with other recently reported, more complex feedforward and recurrent neural network based predictor models. 1. Introduction Over the past decade, there has been an increasing interest in the use of Artificial Neural Networks (ANNs) for solving complex real-world problems [ 1-11].This is mainly due to their ability to effectively deal with non-linearity, non-stationarity and non-Gaussianity [1]. The modeling and analysis of so-called chaotic processes has also recently attracted the attention of many researchers [5-12]. Deterministic chaos is characterized by an exponentional divergence of nearby trajectories [8]. For the problem of timeseries prediction, which is synonymous with modeling of the underlying physical mechanism responsible for its generation, there are two related consequences: Firstly, since the uncertainty of the prediction increases exponentionally with time, chaos precludes any long-term predictability. Secondly, on the other hand, itallowsforshort term predictability, that is, a random looking time series might have been produced by a deterministic system and actually be predictable in the short run. A prediction algorithm for chaotic systems thus has to capture the short term structure of the time series [9]. The short-range structure of chaotic behaviour can be captured by expressing the present value of the chaotic time series sample, as a function of the previous d values of the time series: y(k) = f (y(k-1) ..... y(k-d)), where the vector (y(k-1)...y(k-d)) lies in the d-dimensional state space [9]. An efficient method of fitting the non-linear function f(.) is to use a feed-forward neural network predictor with a single output [6-9] [11]; the inputs to the network being the observation vector (y(k-1)...y(k-d)). In real-world chaotic time-series processes, intrinsic noise will be present and the task of the neural network predictor will be to reconstruct .[(.) without modeling the noise. Two well known feedforward ANNs are the MLP and the RBF networks, both of which have been shown to be capable of forming an arbitrarily close approximation to any continuous non-linear mapping [3]. Consequently, both have to-date been successfully employed for approximating fl.) above [6-10][12]. However, the MLP has a highly non-linear in the parameters structure, and requires computationally expensive non-linear learning algorithms (such as the back-propagation) which are very slow and can converge to local minimum solutions [3] [5]. On the other hand, the RBF network has a linear in the parameters structure giving the relative advantages of ease of analysis and rapid learning. However, it suffers the drawback of requiring a prohibitively large number of basis functions to cover high dimensional input spaces [4]. The topology of the RBF can be considered to be very similar to that of a two-layered MLP. The primary difference between the two structures is in the nature of their basis functions; with the hidden layer nodes in the MLP employing sigmoidal basis functions (which are non-zero over an infinitely large region of the input space); whereas the basis functions in the RBF network cover only small localized regions. Hush [3] has recently shown that some problems such as functional approximation can be solved more efficiently with sigmoidal basis functions; while others such as classification problems are more amenable to localized (e.g. Gaussian) basis functions. This paper describes an interesting case which combines both these types of basis functions within a single neural network layer, so that the distinct universal approximating capabilities of both the MLP and the RBF networks can be employed. The idea is developed to yield a new linear-in the-parameters feedforward neural network termed the Functionally Expanded Neural Network (FENN). Its output error surface is shown to be uni-modal allowing high speed single run learning. A least squares based learning algorithm is employed for updating its output layer weights, and a general design strategy is also presented for specifying the type and number of basis functions within the network's single hidden layer, for an arbitrary number of network inputs. The new structure is shown to be highly efficient in the modeling of both simulated chaotic, and real world noisy time-series processes, and its performance is compared with other recently reported neural network models [10]. A simple pruning strategy based on an iterative pruning-retraining scheme coupled with model validity tests has been used to optimise the size of the new network, and shown to result in parsimonous FENN predictor models that consistently outperform the other techniques, both in terms of non-linear prediction ability and relative computational requirements. Two simulation examples are presented using the chaotic Mackey-Glass Equation, and real-world sunspot data.
2. The FENN Structure The complete two-layer FENN is shown in Figure 1. It comprises an input functional expander within its single hidden layer, and an output layer. The FENN functional expander performs a non-linear transformation which maps the input
120
space onto a new non-linear hidden space of increased dimension. The choice of basis functions to be employed in the functional expander has been discussed in [14] and is outlined in the design strategy below. The output layer of the FENN comprises a set of linear combiners. It is interesting to note that the RBF network with fixed non-linear hidden layer basis functions or centres (and widths) can be regarded as a FENN. The linear-in-the-parameters Volterra Neural Network (VNN) which employs a purely polynomial expansion of its inputs [ 16] can also be considered to be a special case of the FENN, in which the number of polynomial expansion terms grow exponentionally with increasing input dimensions. Hussain [ 14] has shown for a variety of non-linear dynamical system modeling applications that the nonlinear approximation ability of the FENN is significantly enhanced by employing a combination of non-linear basis functions emulating other universal approximators such as the squashing type sigmoidal, Gaussian and polynomialsubset activation functions.
2.1 Design Strategy For any number n of FENN inputs (all normalized to within the range (+ 1,-1)), expand the input vector [xl(k) ... x, (k)] using the following expansion model F(k)" F(k)= sum of the following (linear and non-linear) N components: 1. zero-order (dc) term (resulting in 1 term). 2. original input terms x~ .... x, (resulting in n terms). These terms will enable modeling of linear systems as well. 3. sine expansion of the n inputs, comprising sin(xi), sin(2xi) and sin(3xi) terms for i=1 ...... n (resulting in a total of 3n terms). These terms emulate the squashing type sigmoidal basis functions. 4. cosine expansion of the n inputs comprising cos(xi), cos(2xi) and cos(3xi) terms for i=1 ...... n) (resulting in a total of 3n terms). These functions emulate the Gaussian like basis functions of various widths. 5. product of each input with the sine and cosine functions of other inputs comprising xi sin(xj) and xi cos(xj) terms (for i ~:j, i,j= 1.... n) giving a total of 2n(n-1) terms. These cross terms emulate quadratic and sigmoidal type basis functions respectively. 6. Outer-product expansion of the n inputs resulting in a total of (P2 n + P3 n 4" . . . . . . 4" Pn-I n + 1) terms for n greater than two inputs, with Pm n= n! / (n-m)!m! where ! denotes factorial. Note that for n=2 inputs the outer-product expansion will result in 1 term, and for n=l there will be no outer product terms). These higher order outer-product terms can be considered to be a polynomial expansion of the inputs without the q-th power of the inputs [ 16]. Hence, in general for n inputs, the FENN functional expansion model F(k) will comprise a total of N = (1+ 2n 2 + 4n + ~ i = l n Pin ) terms. That is for n= 1, N=8 ; n=2, N=20 ; n=3, N=38 ; n=4, N=64 and so on. Note that the design procedure presented above provides a useful starting point. Nevertheless, the input expansion model of the FENN is extremely flexible in that, virtually any function of the input such as tanh(.), exp(.), etc. could also be employed. In practice some physical knowledge of the non-linear system to be identified can also be incorporated into the input expansion model. If no a priori system knowledge is available and a more enhanced FENN approximation to the underlying system is required (than that provided by use of the above expansion models), then additional higher order polynomial terms from the Volterra series expansion of the inputs can be included in the FENN input expansion model. Thus, the overall FENN structure can perform non-linear approximation by virtue of the input non-linear functional expander, and yet learning of its output layer weights is a linear problem. It is this latter characteristic of the FENN that provides the real motivation for exploiting its use in complex real-world non-linear dynamical system modeling applications.
2.2 Learning Algorithm: (1) Compute the i=1 ..... m FENN outputs at time k, as yi(k) = F T (k) Wi (k-l) where F(k) defines the [Nxl] hidden layer vector comprising the non-linear functional terms, and Wi (k-l) is the [Nxl ] weight vector of the i-th output node. (2) The output prediction error for each FENN output is: ei(k) = di (k) - Yi(k), where di (k) is the i-th desired output. The Mean Squared Error (MSE) is therefore (where E(.) denotes the expectation operator and T denotes matrix transpose): E(ei(k) 2) = E(di(k) 2) - 2Wi(k-1) T E(di(k)F(k) ) + Wi(k-1) T E(F(k)F(k) T ) Wi(k- 1) The corresponding minimum MSE (MMSE) for the FENN can thus be readily written as [ 15] (with superscript - 1 denoting matrix inverse): MMSE = E(di(k) 2) - E(di(k)F(k)) r E(F(k)F(k) r )-I E(di(k)F(k)) which includes the best linear (Wiener) MMSE for of F(k)=input vector [xl(k) ... x, (k)], that is, without a non-linear functional expansion of the inputs. The advantage of this particular FENN structure is that linear adaptive filter theory can be readily applied for on-line adaptation. The quadratic form of the above MSE expression shows that there will be no local minima and so fast and certain convergence may be obtained in practice by use of the following recursive weight updates: Update the FENN weights for each of the m outputs using the exponentionally weighted Recursive Least Squares (RLS) algorithm as follows: (3) Update the inverse of the correlation matrix of the input expansion vector: P(k) = [F(k)F(k)X] -t = 1/~. [ P(k- 1)- P(k- 1)F(k)F(k)Xp(k - 1)/{ ~,+F(k)Tp(k- 1)F(k) } ] where E is the forgetting factor ( < 1), which introduces exponentional weighting into past data. (4) Update the output layer weights for each output as: Wi(k) - Wi(k-1) + P(k) F(k) el(k) Numerically robust versions of the RLS can be used instead of the above [13][15]. The simpler Least Mean Squares (LMS) algorithm which is a stochastic gradient algorithm can also be used for updating the output layer weights as follows: Wi(k)=Wi(k-1) + ix ei(k)F(k) where ix controls the convergence rate. However, the rate of convergence of
121
the LMS algorithm is dependent on the spread of the eigenvalues of the input expansion model's autocorrelation matrix, E(F(k)F(k) r ), with a large eigenvalue spread dictating a significantly slower convergence rate [13]. On the other hand, the Least Squares criterion based RLS algorithm will converge more rapidly compared to the LMS but at the expense of an increased computational complexity, O(N2) compared to O(N). Various Fast RLS (FRLS) algorithms have also been recently proposed [ 15] to reduce the complexity of the RLS from O(N2) to O(N), and can also be readily applied to train the above FENN. Thus, once the full expansion model at the input layer of the FENN has been specified, the exponentionally weighted RLS algorithm can then be used to provide an efficient means for real time adaptation of the network weights. This will give the FENN a significant advantage over the multi-layered neural network structures such as the MLP in recursive identification applications.
3. Simulation Results 3.1 Modeling of Real Sunspots Following Tong [11 ], Weigend [9], Svaver [12] and McDonnel [10], we first trained a fully expanded (2,20)FENN (two inputs expanded into twenty functional terms) on the sunspot series for the years 1700-1920, and then evaluated the onestep predictions of the evolved (pruned) and trained (2,14)FENN network model on the sunspot series for test years 1921-1979. The FENN was pruned by employing an iterative pruning-re-training scheme to successively prune off the insignificant basis functions. Output error auto-correlation and chi-squared statistic based model validation tests were employed at each stage in order to validate the pruned FENN model. The average relative variance (arv) [ 10] achieved by the final non-linear FENN one-step predictor model on the test data is compared in Table 1 with other published results, where TAR denotes a Threshold Auto-Regressive Model, and SLP represents a Single-Layered Infinite Impulse Response (IIR) Perceptron. Our new Tong's [11] TAR Weigend's [9] Svaver's [12] McDonnel's [10] FENN (pruned) model MLP model MLP model SLP model No. of Parameters 14 16 43 16 23 arv( 1921-79) 0.288 0.377 0.436 0.432 0.467 Table 1: Test performance comparison of various single step predictor models on the sunspot test data (1921-1979) The optimally pruned and trained (2,14)FENN (14 term) one-step predictor model of the sunspot series (which satisfied all the correlation and chi-squared model validity tests) is illustrated below: ~(k) = 1.01y(k- 1) + 1.155y(k-2) + 1.23 lsin(y(k-2)) -2.06sin(2y(k- 1)) +1.984sin(2y(k-2)) - 1.139cos(y(k-2)) - 1.49cos(2y(k- 1)) +1.539sin(3y(k- 1)) - 1.096sin(3y(k-2)) -0.662y(k- 1)sin(y(k-2))-0.11 ly(k- 1)cos(y(k-2)) -2.778y(k-2)sin(y(k- 1)) -3.809y(k-2)cos(y(k-2)) +2.693 where ~(k) is the one-step FENN prediction of the current sunspot sample y(k), based on just the previous two sunspot time series samples [y(k-1) y(k-2)]. The FENN predictor model above also illustrates the relative contributions of the various proposed non-linear basis functions which are primarly responsible for the superior FENN performance over all other recently reported, more complex neural network models (all of which required information from at least the last 6 sunspot observations).
3.2 Modeling of simulated chaofic Mackey Glass Equation [10]: d(,}I(k))
ark)
=
0.2 l ~ o ( k ~
) o O.ly(k)
l +y (k.3o)
New pruned FENN McDonnel's [ 10] Recurrent IIR Perceptron Model Total no. of parameters 4 17 arv (1-step predictions) 0.0012 0.0025 arv (2-step predictions) 0.0016 0.0070 Table 2: arv performance comparison of one-step and two-step non-linear predictor models on a 500 sample test set. (Both models trained on a different 500 sample set). The final evolved one-step (2,4)FENN predictor model of the Mackey Glass time series is illustrated below: ~(k) = 1.99y(k- 1)- 0.88sin(y(k-2)) +0.74y(k- 1)sin(y(k-2)) + O.05y(k-2)sin(y(k- 1))
4. Conclusions In this paper a new feedforward two-layer neural network was presented termed the FENN. It can be considered to be a hybrid neural network incorporating to a variable extent, the combined modeling capabilities of the conventional MLP, RBF and VNN structures. A general design strategy was also presented. The linear in the parameters structure of the FENN enabled use of fast least squares based learning in the output layer. The use of an iterative pruning-retraining strategy coupled with model validation tests resulted in parsimonous FENN predictor models (comprising the most significant of the proposed non-linear basis functions). The final evolved FENN models outperformed other recently reported, more complex neural network models in the modeling of simulated chaotic and real-world noisy, nonstationary time series data, both in terms of non-linear prediction ability and relative computational requirements.
122
The use of the design strategy presented above for the FENN structure, together with the least squares based learning algorithm and pruning strategy have also consistently resulted [14] in highly efficient FENN predictor models of a variety of other complex, simulated and real-world non-linear time series processes including the chaotic logistic map, Henon map, Non-linear Auto-Regressive (NAR) time series, Single-Input Single-Output (SISO) and Multi-Input MultiOuput (MIMO) NAR with eXogenous inputs (NARX) processes; and real-world stock market data, real laser time series and actual speech signals. An added benefit of the new FENN is that the structures of the corresponding FENN predictor models may also provide useful insights into the physics of the underlying unknown non-linear system dynamics. 5. References [ 1] S.Haykin, Neural Networks expand Signal Processing's Horizons, Signal Processing Mag., pp.25-49, Mar' 1996. [2] A.Hussain, J.J.Soraghan, T.S.Durrani, Artificial Neural Networks for Array Processing, Proc. of IEE-IEEE Intern. Workshop on Natural Algorithms in Signal Processing, Vol.1, Chem!sford Essex, Nov' 1993. [3] D.R.Hush, B.G.Horne, Progress in Supervised Neural Networks: What's new since Lippmann, IEEE Signal Processing Magazine, pp.9-39, Jan' 1993. [4], D.A.White and D.A.Sofge (Eds.), Handbook of lntelligent Control: Neural Fuzzy, and Adaptive Approaches, NewYork: Van Nostrand Reinhold, 1992. [5] O.E.B.Nielsen, J.L.Jensen, W.S.Kendall, Eds., Networks and Chaos-Statistical and Probabilist(c Aspects, Chapman and Hall, 1994. [6] A. Lapedes and R.Farber, Non-linear signal processing using neural networks: predicion and system modeling, Technical Report LA-UR-87-2662, Los Alamos National Laboratory, 1987. [7] M. Casdagli, Non-linear Prediction of chaotic time series, Physica D, Vol.35, pp.335-356, 1989. [8] D.Lowe and A.R.Webb, Time series prediction by adaptive networks: A dynamical systems perspective, IEE Proc.F, Vo1.138, pp.17-25, 1991. [9] A.S.Weigend, D.E.Rumelhart, and B.A.Huberman, Predicting the future: A Connectionist approach, Technical Report Stanford-PDP-90-01, 1990. [ 10] J.R.McDonnel, and D.Wagen, Evolving Recurrent Perceptrons for Time-Series Modeling, IEEE Transactions on Neural Networks,Vol.5, no.I, pp.24-38, 1994. [ 11] M.B.Priestley, Non-linear and non-stationary time series analysis, Academic Press 1988. [ 12] C.Svaver, L.K.Hansen, J.Larsen, On design and development of tapped delay neural network architectures, IEEE Int. Conf. on Neural Networks, San Francisco, 1992. [13] B.Mulgrew, C.F.N.Cowan, Adaptive Filters and Equalizers, Kluwer Academic Pub., 1988. [ 14] A. Hussain, New Artificial Neural Network Architectures and Algorithms for Non-linear Dynamical System Modeling and Digital Communications Applications, Research Report, Signal Processing Division, University of Strathclyde, Glasgow, 1996. [ 15] H.Schutze, Z.Ren, Numerical characteristics of FRLS transversal adaptive algorithms-a comparative study, Signal Processing, pp.317-332, 1992. [ 16] P.J.W.Rayner, M.R.Lynch, A new connectionist model based on a non-linear adaptive filter, ICASSP, Glasgow'89
n - Inputs
x1
i 2
Xn
/
(Hidden Layer)
Functional Expansion Model
Weights,
m - Outputs
m
Y~
Ym
Figure 1" The Functionally Expanded Neural Network (FENN)
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
123
A Neural Network for Moving Light Display Trajectory Prediction H. M. Lakany and G. M. Hayes
Department of Artificial Intelligence University of Edinburgh 5 Forrest Hill, Edinburgh, EH1 2QL Scotland-U.K. {hebaal, gmh}@aifh, ed. ac. uk
Abstract In this paper, a radial basis function (rbf) neural network is trained to estimate the positions of markers placed on the joints of a subject. The network was also used to predict the positions of markers that were self-occluded. Results shows that using rbf in this application is successful.
1
Introduction
Moving Light Displays (MLDs) are image sequences containing only selected points of a 3-D object in each frame. This technique has been used for studying human motion by placing passive markers or light bulbs on the joints of a subject - see fig.1. MLDs are also used for studying human motion perception, subject recognition and for human gait analysis, see for example, [4], [6] and [1], respectively.
Figure 1: A single frame of a walker having MLDs on joints - front view A computer system has been built for recognising walking persons based on MLD sequences [5]. As shown in fig. 2, in some frames one or more markers may be hidden behind another part
124
of the body. Self occlusion makes it difficult to correctly plot the trajectories of the motion of a joint and hence correctly analyse gait. In gait analysis labs where one cannot afford to miss the location of a marker at any instant, the problem of self-occlusion is overcome by placing a number of cameras - four or five - in such a way that at each instant each marker is seen by at least two cameras, which is evidently an expensive way to solve the problem. In [7], Taylor et al present an extrapolation technique based on a 2-D linear least squares approximation to predict the the position of each joint/marker in the frame and use a nearest neighbour criterion to choose among different possibilities if more than one marker is in the neighbourhood of the predicted position. In this paper, we propose a neural network trained to predict the position of the joint markers at different instants.
Figure 2: a sequence of frames of a walker having MLDs on joints - side view (sagittal plane)showing self-occlusions (a) arm occludes the right hip marker (b) hip and knee markers are visible (c) left knee marker is occluded
2
Algorithm
In the algorithm we present in this paper, a radial basis function neural network [2], [3] is trained to do the interpolation between the successive frames and use the trained network for predicting the positions of self-occluded markers. A pair of networks is trained for each joint - one for the x- and one for the y-coordinate motion. The inputs to the network are the time and the corresponding relative coordinates of the joint in a sequence of frames (coordinates are measured relative to a fixed marker on the body of the subject, e.g. shoulder marker). Several walks are recorded for the subject while walking at his/her normal speed. The training data of the network consists of a set of coordinates of the marker of a particular joint
125
and the corresponding time of several gait cycles from different walks. A gait cycle is defined as the time interval between two successive occurrences of one of the repetitive events of walking, e.g. between two right-heel strikes [8]. The testing data is a new set of coordinates in other gait cycles that has missing joint coordinates in one or more of the frames and it is the network's job to predict the position of the joint under consideration.
3
Results
A radial basis function network was trained to find the best fit of the trajectory for each marker. The trained network was used to predict the coordinates of the joint marker. Some points were intentionally occluded to test the robustness of the algorithm. The mean squared error (MSE) between the actual locations and the predicted position is calculated. Figure 3 shows the results of a typical experiment predicting the location of the hip marker which was occasionally occluded due to the arm movement in the sagittal (side view) plane. The positions of the markers at ,,~ 70 % and ,-~ 80 % of the gait cycle were intentionally occluded, yet as shown in the figure the network managed to predict the positions of x- and the y-coordinate with only one pixel difference- average Relative Error for the x-coordinate = 7.56 • 10 -3 and average Relative Error for y-coordinate = 2.37 x 10 -3 ~ . The network was trained by 15 cycles and was tested by 10 cycles. Other networks for predicting the location of the rest of the markers for both side view and frontal view were also trained and tested and showed similar results.
4
Conclusion
In this paper, we have shown that the use of radial basis function neural network is a promising method for prediction positions of joints of a walking subject and hence it can be used to estimate the positions of joints that are self-occluded during motion.
Acknowledgement Thanks to Mark Orr at the Centre of Cognitive Science - University of Edinburgh for his useful comments.
References [1] Wilhelm Braune and Otto Fischer. Der Gang des Menschen/The Human Gait. Springer Verlag, 1895-1904. Translated edition by P. Maquet and R. Furlong, 1987. [2] Tomaso Poggio & Girosi Federico. A theory of networks for approximation and learning. A.I. Memo 1140, Massachusetts Institute of Technology- AI Laboratory, July 1989. [3] Don R. Hush and Bill G. Horne. Progress in supervised neural networks. Processing Magazine, pages 8-39, January 1993.
IEEE Signal
[4] Gunnar Johansson. Visual motion perception. Scientific American, pages 76-88, November 1976.
126
Right Hip X-Motion Trajectory , ,
255 254 253 ~25~ .~.251
~o
o x
x
o
~
o
o
.
x
x
+ "t"
x 248 247 246 2450
100
Gait Cycle % Right Hip Y-Motion Trajectory ,
327 326
o
325
o
o
x
x
x
.~324
+ x
~ 323 :~322
~ 321 320 319 31g
3170
o
t
20
40
60
Gait Cycle %
80
100
Actual marker position M i s s i n g marker position (due to occlusion)
x
Predicted marker position
Figure 3i Actual and predicted positions of the right hip marker of a walking subject [5] H. M. Lakany, A. Birbilis, and G. M. Hayes. Recognising walkers using moving light displays. RP-811, Department of Artificial Intelligence, Univ. of Edinburgh, June 1996. [6] Richard F. Rashid. Towards a system for the interpretation of moving light displays. IEEE on Pattern Analysis and Machine Intelligence, PAMI-2(6):576-581, November 1980. [7] K. D. Taylor, F. M. Mottier, D. W. Simmons, W. Cohen, R. Pavlak, D. P. Cornell, and G.B. Hankins. An automated motion measurement system for clinical gait analysis. Journal of Biomechanics, Vol. 15(No. 7):505-516, 1982. [8] Michael Whittle. Gait Analysis : An Introduction. Butterworth-Heinemann Ltd, 1991.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
127
Recognizing Flow Pattern of Gas/Liquid Two-component Flow Using Fuzzy Logical Neural Network Peng Lihui, Zhang Baofen, Yao Danya, Xiong Zhijie Department of Automation, Tsinghua University, Beijing China, 100084 Abstract. This paper describes a new method based on fuzzy neural network which is used to recognize the two-component flow pattern. The paper discusses the structure of the fuzzy neural network, including the selection of the fuzzy logical rule and the training sets. An accelerated learning algorithm (adaptive backward propagation algorithm) is used to train the neural network to shorten its learning time. After computer simulation has been done, it is found that this new method can recognize four typical flow patterns existing in the gas/liquid two-component flow, which are stratified flow, annular flow, slug flow and bubble flow. Finally, Some results useful for the future work are also presented. Keywords: two-component flow, process tomography, flow pattern recognition, fuzzy neural network I. Introduction Two-component flow is very common in many industry areas such as power plants, steel factories and chemical manufactures, and the measurement about its parameters is very important to the relative research work. Because of its complex flowing states and property, it is very difficult to measure two-component flow using traditional detecting method and the measurement accuracy is usually much lower. This is unfavorable to industrial practice. After 1980's, a new detecting means has developed. That is process tomography, a technique for extracting spatial information about process parameters by using multiple sensors fixed around the process of interest. Process tomography can provide the cross section images which are valuable for the assessment of equipment designs and for the on-line monitoring of industrial processes, so it is very suitable for two-component flow measurement[ 1] [2] [3]. In the past few years, process tomography techniques have developed very rapidly. Many process tomography systems using different sensing techniques such as capacitance sensors, 2' -rays, X-rays, and acoustic method have been built and tested for various applications. Among many process tomography systems, electrical capacitance tomography(ECT) is one that is developed firstly because it is simple, non-intrusive, robust and cheap. Now, many process tomography research groups have developed their ECT systems. The ECT system built in UMIST has been used successfully for monitoring oil pipelines and pneumatic conveying pipelines[4]. Although much progress has been made in the research about ECT systems, there are still many shortcomings, one of which is image distortion caused by the "soft effect" of the capacitance transducer. While applying ECT system to two-component measurement, it is found that flow patterns inside the pipelines will affect image reconstruction, so it is very useful for improving image quality if the flow patterns have been known[5]. How to recognize the flow patterns of two-component flow is always a difficulty. Many experts have done much work using the methods of statistics and fuzzy mathematics, but each method has its limitation while applied to process tomography. Fuzzy logical neural network is chosen for its simplicity and high speed in our research work. 2. Principle Neural network is a new technique which is mainly used to solve nonlinear problems. Since its appearance, neural network techniques have developed significantly and have been applied in many areas such as signal processing, pattern recognition, control theory and so on. Recently, some experts also try to use neural network techniques in the research area of process tomography [6][7]. In our research project, an 8-electrode ECT system is used to monitor an oil/gas pipeline, and a fuzzy logical neural network is chosen to recognize the flow pattern inside the pipeline. There are four typical flow patterns existing in the oil/gas two-component flow, which are stratified flow, annular
128 flow, slug flow and bubble flow. Figure 1 illustrates the cross section image of these four flow patterns.
F/A
[al
[hi
[el
[d]
I
" continuous phase
] 9discrete phase
Figure 1. The four typical flow patterns existing in oil/gas pipeline (a) stratified flow (b) annular flow (c) slug flow (d) bubble flow Among many neural networks, the feedforward type is chosen for its simplicity structure in our research work. Figure 2 shows the structure of our fuzzy logical neural network.
x,
ol
>
I
P1
P2
I
Hide layer Figure 2. The structure of the fuzzy logical neural network In Figure 2, D 1 is the input data get from the transducer, for an 8-electrode capacitance transducer there are 28(N*(N-1)/2, N is the electrode number) dependent measured capacitance values. The function of the module P 1 is to make the input data fuzzy. The vector C(Cl, c2,... , Czs) is constructed by the 28 measured capacitance values, and the vector X ( x i , x 2 ,...,x2s ) stands for the fuzzy value of C, the fuzzy logical rule is as follows:
129
C,, - -
C i ~ Cir
C v - - el.
.,
=
0.1 0.25
f(x)-
x < 0.1 0.1 < x < 0.4
0.5
0.4 _< x < 0.6
0.75
0.6 < x < 0.9
0.9
(1)
x>0.9
where c m is the normalized capacitance, c~f and c~e denote the capacitance value of the full pipe(the pipe is full of oil) and the empty pipe(the pipe is full of gas) respectively which related to the jth capacitance, f (x) is the fuzzy logical function. The purpose of making input data fuzzy is to strengthen the pattern feature. In general, a fuzzy logical function divides the input data into "very small", "small", "middle", "big" and "very big". The fuzzy data is inputted into a feedforward neural network. The node number of the input layer is equal to the dependent measured data number, and the node number of the output layer is the same as that of the flow pattern. If needed, a hide layer can be added between the input layer and the output layer, and the node number of the hide layer can be chosen through experiment. In our network the number is equal to the electrode number of the transducer. The network output is sent to the procedure P2, according to the maximum likelihood criterion, the flow pattern which relative node output is the biggest is the recognition result D2. The mapping relationship between the input node and the output node of the neural network is showed in equation 2. m
net j - ~ wij 9x~ i=1
y j - S(net j )
(2)
1 S ( n e t j ) - l + e-,, o
where w ij is the linking weight between the ith input node and jth output node. Before the neural network can work properly, it must be trained. For this purpose, a training set must be chosen. In general, the training set should include enough typical samples. If the number of the samples is not enough, the recognition ability of the network alter training is limited. On the other hand, if the number of the samples is too big, the learning period will be very long. So, a suitable samples number is necessary in order that the network have a good recognition ability and the learning procedure is not very long. Figure 3 is the training set which is chosen in our research work. It includes 18 typical flow pattern samples.
130
Figure 3. The training set Usually, the neural network learning procedure is very time Consuming, so a good learning algorithm which can ensure the network is convergent and has a high speed at the same time must be ch0sen[8][9][10]. A learning algorithm which is called BP(Backward Propagation) learning algorithm is frequently used for its simplicity. An accelerated algorithm, adaptive backward propagation algorithm, is used to train the network in our research work in order to shorten the learning period[ 11]. Figure 4 indicates the learning procedure of our neural network.
131
Figure 4. Learning procedure In figure 4, the horizontal coordinate denotes the iterative counters, and each epoch includes 10 iterative period. The vertical coordinate stands for network output error. 3. Simulation results and discussion
Alter the network training is completed, several flow patterns is used to test the recognition ability of the fuzzy logical neural network. It is found that the network is capable of recognizing four typical flow patterns existing in the gas/liquid two-component flow, which are stratified flow, annular flow, slug flow and bubble flow. Table 1 -- 4 illustrates some results.
132
The fuzzy logical neural network makes an achievement in recognizing the flow pattems inside the oil/gas twocomponent flow pipelines, and this is very helpful to parameters measurement ofoi/gasl two-component flow. If the recognition ability of the network is not satisfactory, more samples can be added into the training set.
References
1. M.S. Beck and R.A. Williams, "Process tomography: a European innovation and its applications", Meas. Sci. Technol. Vol.7, No. 3, pp215-224, 1996. 2. T. Dyakowski, "Process tomography applied to multi-phase flow measurement", Meas. Sci. Technol. Vol.7, No. 3, pp343-353, 1996. 3. R. Abdul Rahim, R.G. Green, etc., "Further development ofa tomographic imaging system using optical fibers for pneumatic conveyors", Meas. Sci. Technol. Vol.7, No. 3, pp419-422, 1996. 4. W.Q. Yang, A.L. Stott and M.S. Beck, "Development of capacitance tomographic imaging systems for oil pipelinemeasurements", Rev.Sci.Instrum. Vol.66, No.8, pp4326-4332, 1995. 5. ~byvind Isaksen, "A review of reconstruction techniques for capacitance tomography", Meas. Sci.Technol. 7, pp325-337, 1996. 6. A.Y. Nooralahiyan, B.S. Hoyle and N.J. Bailey, "Pattern Association and FeatureExtraction in Electrical Capacitance Tomography", 1993, Proc.ECAPT,pp266-275. 7. D. Wetzlar, '~leural Network Solving the Inverse Problem of Electrical Impedance Tomography", 1993, Proc.ECAPT, pp275-284. 8. Y. Bengio, P. Simard and P. Frasconi, "Learning long-term dependencies with gradient descent is difficult", 1EEE Transactions on Neural Networks, Vol. 5, No. 2, pp 157-166, 1994. 9. O. Nerrand, P. Rouseel-Ragot, D. Rubani, L. Personnaz and G. Dreyfus, "Training recurrent neural networks: Why and How? An illustration in dynamical process modeling", 1EEE Transactions on Neural Networks, Vol. 5, No. 2, pp178-184, 1994. 10. O. Olurotimi, "Recurrent neural network training with feedforward complexity", IEEE Transactions on Neural Networks, Vol. 5, No. 2, pp185-197,1994. 11. A.G. Parlors, B. Fernandez & A.F Atiya, "An accelerated learning algorithm for multilayer perceptron networks", 1EEE Transactions on Neural Networks, Vol. 5, No. 3, pp493-497,1994.
P r o c e e d i n g s I W I S P '96," 4- 7 N o v e m b e r 1996; Manchester, U.K.
B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
133
ADAPTIVE ALGORITHM TO SOLVE THE MIXTURE PROBLEM WITH A NEURAL NETWORKS METHODOLOGY
Pdrez R.M., Martinez P., Moreno J., Silva A., Aguilar P.L. Dpto. de Informfitica Escuela Polit~cnica Universidad de Extremadura Avenida de la Universidad s/n 10071 Cficeres SPAIN Tel: +34-27-257183 Fax: +34-27-257202 Email:
[email protected] ABSTRACT. In this paper we present the development of a robust method for determining and quantificating components in a composite spectrum, assuming the patterns of the individual spectra belonging it are known in advance. The general solving method proposed in the present work is supported by a linear recurrent neural network based on the Hopfield model (HRNN). The neural model guarantees the convergence of this problem using the gradient method for minimizing errors. The HRNN has a reasonable computational cost and one of the most remarkable properties is its robutstness versus increasing levels of noise in the composite spectrum. Other interesting property is related with the possibility of implementing VLSI structures of low complexity amenable for supporting the algorithm, since only multiplications and additions are required for programming an inversion process.
1. I N T R O D U C T I O N . We consider the following problem (Mixture Problem): Assuming we are given the spectra of a number (K) of elements (Basic References), we must determine the unknown composition of a cocktail of the mentioned elements using a radiation spectrum obtained from this mixture. The goal of this paper is to explore the possibility of using a Neural Networks Methodology to give a reliable, robust and efficient solution to this problem, based on the inherent parallelism of neural networks. Adams et al. [ 1] studied this problem in the context of Surface Mineralogy Prospection, and Lawton [2] proposed conventional digital algorithms for its solution. These approaches are fairly slow because serial computation is implied. A method based on the Optical Neural Network was presented by Barnard and Cassasent [3]. To find the composition of a mixture knowing its spectrum and the spectrum of the possible components a quadratic cost function was minimized to find the optimal composition .The snag about this solution is that includes the constrains that all fractionsum the unity. The possibility of using the Multiple Regression Theory, granting an optimal solution in terms of uniqueness, has been developed by Diaz et al. [4]. This approach explore also the possibilities for improving the robustness of the proposed method, that consists of the use of the Pseudo-Inverse Matrix, supported by a Linear Associative Memory, built using the Pyle' s algorithm. The general solution method proposed here is based on the Hopfield Recurrent Neural Network (HRNN). It is a flexible, efficient and robust approach to solve the problem. The Gradient Method for minimizing errors is used to assure the convergence of the algorithm. The use of this model is fully justified since the spectrum formation in the Mixture Problem is a linear process essentially [5]. Other interesting properties, deduced from the algorithmic structure of this method, are related to the possibility of implementing VLSI structures of low complexity amenable to support the algorithm, since multiplications and additions are only required for programming an inversion process.
2. A D A P T A T 1 V E A L G O R I T H M . In order to describe the adaptative algorithm suggested in the present wvrk, it must be taken into account that a Composite S ~ x may be seen as a N-dimensional vector, built sorting the emission intensities associated to each energy channel vs. the channel number, where N is the total number of energy channels:
[
x = Xo,X,,...,XN_ ,
1"
X n >--0
0 < n <_N - 1
(1)
134
beingx,, the intensity measured as the number of photons whose energy is comprised in the n-th energy channel's interval. In this way, a Reference Spectrum is a spectrum of the same nature, but predtu:ed by an Individual Source. We denote these s p e e ~ as rk, with 0_
In a general sense, the set of Composite SpecUa is composed of all possible ~ that may be produced by a linear combination of all elements belonging to the Reference Set. When the Reference Set is composed by K linearly independent vectors, this would result in a K-dimensional Vector Space, integrated by all the vectors y given b)r.. y = R~ =
c,r,
(3)
i=0
where c is the Contribution Vector, defined as: c = [co,c , ..... cx_,] r
ck > 0
0 < k < K-1
(4)
and where every contribution ci is a function of the relative intensities of the Composite and Reference Spectra. Our goal is to estimate c assuming that R and y are known. For the mixtttre described by an estimation of c, called c', the difference between the measured ~ vector y and its re-constructed version y ': 8 = y - y'= y - ~ c ' , r,
(5)
i=O
s is called Estimation Error, which gives us a measure of how well the estimation of c has been accomplished. We use this error to optimize the estimation procedure by means of a Least Mean Square (LMS) minimization procedure. In relation to c, this procedure is lie to minimize the Measure Function F(c), defined as:
Ily- Rc'll"
F ( c ) = I1~11' =
<0
To solve this problem we use an iterative process, supported by the Linear HopfieM Minimization Procedure, that basically is a progressive refinement of the Contribution Vector: where we have used an estimation of the Gradient ofF(c) r ~
to c, given as:
v y(~) = -2R~[y- R~]
~s)
In compact notation, we may express (7) as: c § = Aq + P c where:
O<_i,j
w~ .
.
P = I - ARrR
q = R ry .
pc = - , ~ Z r rW i=/=j
.
.
.
.
.
p =I-,~ZII
p.o
.
p.o
q - - Z r y,
(9)
pffio
being 2, a parameter dependent on the trace of RTR, which controls the speed of convergence [6], and pgthe weight from the i-th node to thej-th one. This method requires only multiplication and addition operations to solve the Mixture Problem. The adaptative algorithm proposed may be described as follows: 1 Initialization steep:
Read R
2 New spectrum decomposed:
to
3 Repeat until cfcj+l[< 10-5
be
Read y
evaluate I~R
Adjust 2,
evaluate f l y
Evaluate P=I- 2 I~R
Evaluate c)+I=PCj+
3. E X P E R I M E N T A L R E S U L T S . The before stated algorithm was been widely simulated using a set of 1024 dimensional composite spectra obtained from 10 individual sources generated by gaussian compositions. Simulations results show that HRNN Algorithm is able to resolve Multi~ Mixtures at a reasonable computational cost [7,8]. Computational Cost = ( 2 K : + 2 K ) ( N + 1 ) + K z + N
-
(~n ~)
(10)
135
To evaluate the algorithm performance we use the ~ c
c:t
i=O
~=
Relative Error (QRE) as:
mir~ci
(11)
Ci * O
/
where c~ and c, are the known and the calculated contribution of the i-th component, respectively. Three sets of experiments were performed to measure the influence of different parameters:: - Level o f Noise in the Mixture Spectra - Proportion o f Elements in the Mixture. - Correlation between Componets.
The first set of experiments was concentrated in measuring the noise effects in the behaviour of the methtxt Each spectrum was contaminated with noise, adding an uniformly distributed random number to each energy channel. This produced a random variation in the energy bin value, by at most n% of its original value ( Noise Level ratio). Values used for n were comprised within the 2.5% - 35% interval. Fig I shows that in general, QRE increases with increasing NL ratios as could be expected. The iteration number used in each experiments does not show a clear dependency with the NL ratio. The results indicate a good performance in the presence of additive noise (when 35% noise was added to the ~ no noticeable degra_d_z__tionin the precision occurred). ,
0.4
,
,
;350
x
,
,
,
,
,
,
340
-"o.3 t~ | > ~ n.
330
13.2 320
s -lo 0 . 1 o
O~O'O--O/ I)---O.o...... j o ' O - -
9/
10 ~li
Fig.
Noise
~O
I
I
20
3D
3130
Level
I
I
l
5
10
15 16
1. INtumeecfthe%
l
l
213
t
25
313
35
Level
No|i 9
level n o i s e i n ~ e Q R E m d i n ~ e l t e ~ n u m b ~
The second set of experiments was intended to measure the behaviour of the algorithm for recognizing spectra of mixtures of two individual sources in different proportions. It may be seen that the procedure works reasonably well with ratios of 1:1000 completely irresoluble to the bare eye. Fig 2 shows the results in the degradation of the precision as a function of the relative proportions between the e ~ of mixtures of two components. In the worst cases a QRE of 1.2% was detected in a ratio of 1:1000. 0.08
,
0.05 L m L_ L. i) ._ .~ o
m bJ
0.04
9
10 -a
4-,
0
0.03
g~
e
n-
u .,,., a
0.02
"o
0.01
u o
0
0.Q0
-0.01
o,
u
~
n 0.00
Fig. 2 ~ c e
10. 4
9e
8
, 0.05
Proportion
9
Between
9
9
,
9 9
e9 9
9
99* *o o ;
9
L* 9 leo
9
, 0.10
component 9
~the pr~,r, ims intheQRE
Correlation
Between
Component-,
Fig. 3 Influmce~the C,xrd.~onbCwemccmponmtsinthe QRE
136
The third set of experiments studied the ability of the method to distinguish between two different ~ as a function of their relative correlation coefficient. The corresponding results may be seen in figure Fig.& .It seems that the QRE tends to be slightly incremented with higher correlation coeficients, although this tendency is not completely uniform.
4. S U M M A R Y A N D C O N C L U S I O N S . In this work, a recursive Neural Network has been introduced to solve the Mixture Problem, based on the HopfieM's of the Reference Set. Model. This model finds the composition of the mixture given the ~ The method seems to be more reliable and robust than other traditional methods based on peak analysis, which fail dramatically in these cases [1,2]. Simulation results show that HRNN has a reasonable computational cost, cubic cost. One of the most remarkable properties of the algorithm is its robusmess versus increasing levels of noise in the composite spectrum. This property is related with the iterative nature of the algorithm, that acts accmnulating average values for the magnitudes of interest, especially the decomposition error, thus filtering noisy components uncorrelated with reference ~ Another interesting property is its ability to resolve in quite uneven mixtures, with disproportion as high as 1:1000, where the bare eye can not detect any evidence of the spectnun present in a low proportion. Other interesting property, that is deduced from the algorithmic structure of the method, is related to the possibility of implementing VLSI structures of low complexity amenable for supporting the algorithm, since only multiplications and additions are required for programming an inversion process. The study conducting to a such implementation is already on the way. This method may be applied to many others problems in Spectroscopy, such as IR and Visible Spectrum Decomposition, with applications in Colorimetry, Remote Sensing of the Earth Surface, Environmental Control, an many others [9]
REFERENCES [1] Adams J., Johnson P., Smith M. and George T. "A Semi-Empirical Method for Analysis of the Reflectance Spectra of Binary Mineral Mixtures". Journal ofGeophysic Res., vol. 88, 1983, pp. 3557-3561. [2] Lawton W. and Martin M. 'The Advanced Mixture Problem-Principles and Algorithms".Technical Report 10M384, Jet Propulsion Laboratory, 1985. [3] Bamard E. and Cassasent P. "Optical Neural Net for Classifying Imaging Spectrometer"Applied Optics, vol. 18, n. 15, pp. 3129-3133, August 1989. [4] Diaz J.C., Aguayo P., G6mez P., Rodellar V. and Olmos P. "An Associative Memory to Solve the Mixture Problem in Composite Spectra". Proc. of the 35th Midwest Symposium on Circuits and System, Washington DC, pp. 422-428, August 1992. [5] P6rez tLM., Martinez P., Silva & and Aguilar EL. "Influence of the Fixed Point Format on the Accuracy of the Neural Network Solution to the Mixture Problem". Proc. of the 3rd Advanced Training Course: Mixed Design of lntegrated Circuits and Systems, pp. 413-418, Lodz (Poland), May 1996.. [6] Rodellar V., Hermida M., Diaz A., Lorenzo A., G6mez P., Aguayo P., Diaz J.C. and Newcomb 1L W. "A VLSI Arithmetic Unit for a Signal Processing Neural Network". Proc. of the 35th Midwest Symposium on Circuits and System, Washington DC, pp. 891-894, August 1992. [7] P&ez tLM., Martinez P., Martinez L., Diaz J.C., Rodellar V. and Gomez E "A Hoptield Neural Network to Solve the Mixture Problem".Proc. of the VI Spanish Symposium on Pattern Recognition and lmage Analysis, C6rddm,..pp. 744-745, April 1995. [8] P&ez 1LM. and Martinez P. "Validaci6n de una Metodologia Neuronal para una Cuantfficaci6n de Firmas E ~ e s " . P r o c . o f the V Reuni6n Cientifica de la Asociaci6n Espa~ola de Teledetecci6n, Valladolid,, pp. 146-147 September 1995. [9] P6rez 1LM., Martinez P., Silva/k, Aguilar P.L. "An Adaptative Solution to the Mixture Problem with Drift Spectra". Press of the ThirdInternational Conference on Signal Processing, Beijing, China, October, 1.996.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
137
PROCESS TREND ANALYSIS AND FUZZY REASONING IN FERMENTATION CONTROL
S. Kivikunnas , K. Ibatici
and E. Juuso
Department of Process Engineering, University of Oulu, PO Box 444, FIN-90571, Oulu, Finland Physics Department, University of Bologna, Via Irnerio 46, 1-40126, Bologna, Italy
ABSTRACT Manual supervision of fermentation processes relies heavily on visual detection of characteristic patterns of temporal changes in process variables. The automation of this experience-based reasoning conducted by operators would allow the redirection of their contribution to more profitable tasks, e.g. planning and scheduling of plant operations. In this paper, two different prototype applications for analysing temporal patterns of process trends of a continuous fermentation process variables are described. The first one is designed to operate with tuneable time windows utilising fuzzy reasoning to produce an early indication of the changes in process variable. The second one operates by comparing the first and second derivatives of the filtered process variable to patterns saved in an extendible shape library describing the trends at symbolic level. Tests suggested that both methods could be applicable to bioprocess control. However, the capabilities of present reasoning tools were discovered to be inadequate in combining symbolic and numeric information.
INTRODUCTION Lack of appropriate on-line measurements for some important variables is a serious practical problem in controlling fermentation processes. The normal solution to this class of problems is to estimate interesting variables from on-line measurements and analysis data available. Point estimates in fermentation environment, produced by any feasible technique applied, are normally noisy and very uncertain in nature. For slow processes like fermentation, temporal reasoning could become a very valuable tool to diagnose and control the process. Manual supervision relies heavily on visual monitoring of characteristic shapes of changes in process variables, especially their trends. Systems that are able to detect and analyse temporal shapes of trends from measured or estimated data could boost fermentation control systems performance remarkably. Pattern recognition approach to biotechnological processes has been seen as an emulation of human view of processes [3] and thereby suitable to serve in expert systems applied to process control. Of special interest are the methods that are capable to reason about the recent process history [2]. Although computational efficiency and practical implementation aspects are considered in papers dealing with trend analysis, we want to stress that methodological research and industrial case studies are still needed to bring pattern recognition-based methods to an industrial practice. Good starting point for applications could be simpler approaches where linear regression and moving
138 averages [4] or trend indicator based on difference of two moving averages of a process variable [5] are used for reasoning. In this paper, two applications of analysing temporal shapes of fermentation process variables are described. The first one is designed to operate with tuneable time windows producing two index values for inputs to a fuzzy logic system which gives the type of change in process variable as output, e.g. "started to wind up" or "is constantly increasing". The second one operates with the first and second derivatives of the FIR-Median Hybrid -filtered [ 1] process variable and compares the pattern of recent process history to given templates.
PROTOTYPE DESIGN AND IMPLEMENTATION
The first application was designed to serve as a trend operator in fuzzy control blocks. The method consists of calculating two index values from process data to be applied as input for fuzzy logic block. Index values are calculated as difference in short and long moving averages of recent process measurements and slope of the long time window. In Fig. 1 the principle of time windows division is shown. Different lengths for these short and long time windows can be selected depending on the process characteristics and usage of the system. The output of the rule-based system indicates whether the process variable is constant, changing linearly or having an accelerating behaviour. Five rules and conventional fuzzy logic construction with trapezoidal membership functions for input and output variables were implemented for testing.
Fig. 1. The division of the recent process measurement history to two time windows of different lengths.
The second, more complicated procedure, is based on a priori knowledge about meaningful shapes in recent process measurement history. In experimenting with the method proposed by [2], we found the behaviour of its function approximation stage unsatisfactory with noisy or corrupted data. In search for more robust smoothing method we found FIR-Median Hybrid filtering (FMH) technique introduced by [1 ] a good candidate for pretreating process measurements before template matching
139 procedure. Median filtering benefits from preserving sharp changes in signals and being effective in removing impulsive noise. In FMH filtering, sorting intrinsic to median search routines is replaced with linear averaging substructures, which gives a reduction of computational load. Both systems were implemented in MATLAB-environment and tested off-line with data obtained from continuous lactic acid fermentation experiments. The variable studied was the pH-control base consumption in continuous fermentation process. This variable is very rich in information because it promptly indicates changes in the productivity of the process.
RESULTS AND DISCUSSION The performance of the rule-based trend operator was evaluated by running the system off-line against real fermentation data and letting the process expert to judge the reasoning results. By tuning the length of the two time windows applied and by manipulating the fuzzy sets defined for indexes, it was possible to find appropriate detection sensitivity for the process in case. Because the system is not a stand-alone controller but an operator that could replace the derivative term of a PItype fuzzy controller in certain cases, this simple test gave only insight of tuning possibilities and difficulties. For final evaluation the operator will be implemented on-line connected to a conversion control system of the process. In the shape-analysing procedure FMH filter was used in two-pass manner, and the filtered data was used directly for shape analysis. Over a time window of approximately sixteen hours, nine symbols resulted the reasonable number of derivative signs to describe the real data without loosing relevant shape information. However, other time windows can be used, too. An example of the trend analysis procedure is shown in Fig. 2. The resulting shape information with degree of compatibility was utilised in a fuzzy rule-based diagnostic system with estimated conversion of the process. The use of plain fuzzy logic system for diagnostic purposes when both symbolic and numeric information served as antecedents was quite cumbersome.
Fig. 2. Example of the prototype system in use. Original data ( 'o' ), two-pass FIR-Median Hybrid filtered data ( ' x ' ) , and temporal pattern found with degree of compatibility (de) are shown. The variable considered here is the base consumption of a continuous fermentation process.
140 CONCLUSIONS In this paper we have presented two prototype applications for analysing temporal shapes of process variables. The methods were implemented real-time control requirements in mind and tested off-line with data obtained from continuous fermentation experiments. Test runs suggested that both could be applied to slow processes where sophisticated trend information processing is needed. The first application is a simple rule-based trend indicating procedure. This procedure is aimed to be used as a trend operator substructure in fuzzy logic controllers. Because additional inputs for fuzzy logic controllers make them more difficult to maintain, there has to be a specified benefit from utilising trend information. The shape analysing procedure looks especially promising in diagnosing abnormal process situations for which a priori knowledge is available from process specialists. In the future studies, performance of the both systems will be demonstrated on-line in conjunction with a PC-based control system of a fermentation pilot plant. Great emphasis will be put on developing intelligent control methods that could combine heterogeneous information in a feasible and maintainable way.
REFERENCES
[1] P. Heinonen and Y. Neuvo, "FIR-median hybrid filters," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP'35, pp. 832-838, 1987. [2] K. B. Konstantinov and T. Yoshida, "Real-time qualitative analysis of the temporal shapes of (bio)process variables," AIChE Journal, vol. 38, pp. 1703-1715, 1992. [3] G. Locher, B. Sonnleitner, and A. Fiechter, "Pattern recognition: A useful tool in technological processes," Bioprocess Engineering, vol. 5, pp. 181-187, 1990. [4] P. J. Poirier and J. A. Meech, "Using fuzzy logic for on-line trend analysis," in Proc. 2nd IEEE Conf. on Control Applications, Vancouver, B.C., 1993, pp. 83-86. [5] F. Y. Thomasson, "Improved control of drum level for boilers with 'shrink' and 'swell' problems," Tappi Journal, vol. 71, pp. 65-71, 1988.
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
141
Higher Order Cumulant Maximisation using Non-linear Hebbian and AntiHebbian Learning For Adaptive Blind Separation of Source Signals
Mark Girolami and Colin Fyfe Department of Computing and Information Systems, University of Paisley, United Kingdom. email"
[email protected],
[email protected] Abstract
We propose a novel nonlinear self-organising network which solely employs computationally simple hebbian and antihebbian learning in approximating a linear independent component analysis (ICA). The learning algorithms diagonalise the input data covariance matrix and approximate an orthogonal rotation which maximises the sum of fourth order cumulants, thus providing invariant separation of the input into the individual sub-components. We apply this network to linear mixtures of speech data, which is inherently non-stationary and positively kurtotic, there is no prior requirement for spatially whitened data. We show that the proposed network is capable of separating mixtures of speech, noise and signals with both platykurtic and leptokurtic distributions. Simulations are run on mixtures of three signals with differing higher order statistics, and mixtures of five voice traces; complete source separation is seen in all cases. 1. Introduction
The problem of multi-channel blind separation of sources and blind deconvolution occurs in many application areas of signal processing such as speech, radar, medical instrumentation, mobile telephone communication, and hearing aid devices. The problem is defined as the recovery of original source signals from a sensor output when the sensor receives an unknown mixture of the source signals. The 'Cocktail Party' problem is an example of blind separation, where a person can single out a specific speaker from a group speaking simultaneously. Another biologically inspired example is the ability of the olfactory bulb to discriminate a single scent from a received mixture. From a signal processing viewpoint, the reconstruction of the original signal when the received signal is the output of an unknown filter is an example of blind deconvolution. Blind separation of sources is an underdetermined problem and as such traditional adaptive techniques are unsuitable as the source signal statistics, as well as the mixing and transfer channels, are unknown. Techniques have been developed based on information theoretic criteria and higher order statistics (HOS); if a signal has independent components then the product of the marginal probability densities is equal to the signal probability density. Using the Kullback-Leibler divergence as a measure of independence, Comon [ 1] develops a series of contrast functions based on an Edgeworth expansion of the marginal densities and batch methods are used in their maximisation. Cardoso [2] utilises the invariant properties of cumulants under orthogonal transformations; he develops series updating algorithms based on the maximisation of the sum of the square of fourth order cumulants. Jutten and Herrault [3] were the first to develop a neural architecture and learning algorithm for blind separation; since then a number of variants on this architecture have appeared in the literature, Cichocki et al [4]. Bell and Sejnowski [5] developed a feedforward network and learning rule which minimises the mutual information at the output nodes; this yields excellent results for platykurtic signals such as speech, however, the matrix inversion required is a computational bottleneck and unrealistic from a DSP implementation viewpoint. Recently, Cichocki et al [4] have used the natural gradient descent algorithm which removes the matrix inversion requirement in Bell & Sejnowski's algorithm. The simple multiply and accumulate operations of hebbian and anti-hebbian learning are attractive for DSP hardware implementations, Karhunen and Joutsensalo [6] develop a number of nonlinear variants of principal component analysis (PCA) learning and show their utility in sinusoidal frequency estimation. All the networks and learning paradigms listed, with the exception of Bell and Sejnowski's network, require the input data to be spatially whitened, that is normalisation of the second order statistics, which requires initial pre-processing [1], [2], [3], [4], [6]. Bell & Sejnowski [9] report on improved speed of convergence of their algorithm when the incoming data has identity covariance. Suppression of lower order statistics is required for the algorithms to respond to HOS, data whitening is discussed in [7] within the context of exploratory projection pursuit (EPP).
142
2. Independent Component Analysis Let x be a variable in ~tN with a probability density function (pdf) Px (u). If the vector x has mutually independent components we can then write 8
- o
(1)
"=1
The Kullback-Leibler divergence gives a measure of the mutual information between the components of x. Approximating the marginal densities using an Edgeworth expansion (up to cumulant order 4) yields a measure of the mutual information or contrast between the components [1]. N
I( pz ) ~- J(Py ) - -~ Z {4K2i+ Ki,~, + 7K~2 - 6Ki~2K~i }
(2)
i=l
The term on the left hand side is the mutual information of the components of the vector z, where y=Mz, M being an orthogonal rotation. The first term on the right hand side is the negentropy of y, and is fully defined in [ 1]. The second term is the sum of squares of third and fourth order cumulants of y, it is clear that maximisation of this term will minimise the mutual information between the vector components and as such can be used as a contrast. Further simplifying assumptions, [1], based on the pdf symmetry and the multilinearity of cumulants reduces the contrast to the sum of squares of fourth order marginal cumulants. 2
i=1
It is noted that the sum of squares of fourth order cumulants is invariant under linear orthogonal rotation and so for a whitened two dimensional vector we can write
,- -(
)(
N=2
)
i=1
(
)
i=1
By then maximising (3), under orthogonal constraints, we can see that this will minimise the cross cumulant terms and so yield an approximation to independent components.
3. Network Architecture and Learning Figure 1 shows the network under consideration, it has full lateral and feedforward / feedback connections. The input lateral connections are a variant of Foldiaks second linear model [10], the output lateral connections are similar however the neuron activation is nonlinear. The feedforward section is an exploratory projection pursuit network [7] and has been used with limited success for ICA in [8]. U
W
V
x~
Y~
X~
Y2
X3
i
-
Y3
Figure 1" Fully Connected Network
i,
Consider zero mean source signals s mixed by the unknown linear matrix A the received signals are then, in matrix format, x=As. The output of the first layer of neurons is given as z, and so with the linear lateral connections at the input (4) z - [ I + U ] x - UIX
9 We wish the input z to have an identity covariance, which is inline with the derivation of ICA in [1] and also allows the feedforward section to respond to the higher order moments.
We derive the following learning rule for the input weights by minimising the distance of the data covariance matrix
from identity.
At; oc I - zz T
(5)
(I + U) - C 2
(6)
It can be shown that (5) will cause the weight matrix U to converge to the expression given in (6) which requires a positive definite input data covariance matrix C,.. The linear summation of the feedforward weights is r = WUI x and the final output of the network is given as y = Vl f(r) = ViWVI x - Vi~ (WVix)
(7)
where f ( r ) - r-cp (r) and ~ (r)= tanh(r), (7) can be approximated by y = VI f(r)= kVIwUIx • VI as the tanh nonlinearity will saturate to a value of 1 outwith the approximate linear region. The hebbian learning for the
143
feedforward weights has been detailed fully in [7] and can be considered as the approximate stochastic maximisation of an objective function under orthogonal constraints. The expression for the learning algorithm is given in (8).
AW -" }Tt(g)'(s){Uix-WWTUix}
- 77t(g)'(s)lz-WWTz}
(8)
The term ~' (s) is the derivative of the objective function to be maximised which in this case will be the value of the fourth order marginal cumulant of the network output. Girolami and Fyfe [8] use an EPP network for ICA, however the stochastic maximisation (8) does not generate sufficient HOS to ensure mutual information minimisation at the outputs. The addition of anti-hebbian lateral connections at the outputs of the network will yield the following using the learning of (5) AV---/z ( I - yyT) = g ( I - ( V I [Crr + ~9(r)~o (r)T ]ViT - V I [(p (r)rT + rq~ (r)T ]V:]) As (AV) --) 0 and VI _=I then
---_0 as C r y - I d u e t o (5) and w w T = I ==> <(p(r)r T +r(p (r) T -(p(r)cp(r)T) =_0 Tanh is an odd function and so taking a Taylor expansion
~q) 2k+l- ~k~-"~ 2k+l(/92m+l(ri2k+lr 9 2m+l 9 ) =0 k
m
as C,, ---I ::> (rirj.) = 0 V i ~ j r
(9)
Which is simply the minimisation of all cross cumulants of order four. The stochastic marginal cumulant maximisation of (8) under the cross cumulant minimisation constraint yields a more powerful ICA than standard EPP learning. 4. Simulations
The first simulation considers an unknown linear mixture of five speakers, the signals are presented to a 5x5 network. We show the development of the contrast during learning, along with the original, mixed and recovered signals. The mixing causes the value of the normalised fourth order cumulant of each signal to be reduced.
Figure 2 : Contrast Development During Learning and Signal Traces The input weight learning removes all second order correlation's, this continues for ten passes through the data. Once the second order terms are removed the feedforward weights start responding to the higher order statistics in the data as can be seen from figure 2. The feedforward weight changes are effectively rotating the input data in weight space to maximise the contrast defined in (3), with the maximal converged value being 98% of the original unmixed signal contrast. It is interesting to note that the output lateral weights change little during learning, however, the additional constraint of this higher order de-correlation at the output ensures a high degree of contrast maximisation and therefore independent separation.
144
The second simulation considers a mixture of signals which have fourth order cumulants of differing sign, that is signals with m-modal and bi-modal distributions. As the cumulants of these distributions will have positive and negative values the network nonlinearity used will be r~ • tanh(r~) with the choice of sign depending on the HOS of the original signals. We use the a priori knowledge that one of the sources is naturally occurring speech which will have positive kurtosis. The other signals are white noise and a fundamental tone which both have negative kurtosis. This suggests that the network activation should be chosen to match the sign of the original signals kurtosis. The results are similar with the contrast of (3) being stochastically maximised. Figure 3 shows the original distributions of the signals and the distributions of the mixed signals. It is apparent that the mixing causes the signal distributions to become more normal and as such the mixing removes the structure in the data. Complete separation of the outputs is given.
Figure 3 : Signal Distributions and Traces 5. Conclusions and Further Work We have introduced a neural implementation of Comon's ICA algorithm, the algorithm has been successfully applied to linear mixtures of speech as well as mixtures of speech, noise and low kurtosis signals. We are currently working on separation of convolved mixtures of signals and developing a method of dispensing with the a priori knowledge requirement of the signal statistics for the choice of nonlinearity. References
[ 1] Comon, P. Independent component analysis, a new concept ?. Signal Processing, 36, 287 - 314. 1994 [2] Cardoso, J,F. On the performance of orthogonal source separation algorithms. EUSIPCO-94, Edinburgh. 1994 [3] Jutten, C Herault, J. Blind Separation of Sources, Part 1: An Adaptive Algorithm Based On Neuromimetic Architecture in Signal Processing 24. 1991. [4] Cichocki, A Amari, S Yang, H. Recurrent Neural Networks for Blind Separation of Sources. International Symposium on Nonlinear Theory andApplications Vol 1. 37 -42. 1995 [5] Bell, A and Sejnowski, T. An information maximisation approach to blind separation and blind deconvolution. Neural Computation 7, 1129- 1159, 1995. [6] Karhunen, J, Wang, L and Joutsensalo, J. Neural estimation of basis vectors in independent component analysis. International Conference on Artificial Neural Networks, Vol 1 317 - 322. 1995. [7] Fyfe, C and Baddeley, R. Non-linear data structure extraction using simple hebbian networks. Biological Cybernetics, 72(6):533-541, 1995. [8] Girolami, M and Fyfe, C. Blind Separation Of Sources Using Exploratory Projection Pursuit Networks. International Conference on the Engineering Applications of Neural Networks, ISBN 952-90-7517-0, 249 - 252, 1996. [9] Bell, A and Sejnowski, T. Fast blind separation based on information theory. International symposium on nonlinear theory and applications Vol. 1, 43 - 47, 1995. [10] Foldiak P. Adaptive network for optimal linear feature extraction. IEEE/INNS International joint conference on neural networks. 1 (pp 401- 405), 1989.
Session E:
PATTERN/OBJECT RECOGNITION
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
147
A Robot Vision System for Object Recognition and Work Piece Location* W a n g M i n Dai Q i z h i W a n g Jun Department of Automatic Control Engineering Huazhong University of Science & Technology Wuhan, Hubei 430074 P. R. China
Abstract This paper presents a practical robot vision system which gathers TV signals for robot assembly
systems. Using 2D binary-state image processing the invariant features of the object are calculated and the object is recognised and located The experiment results show that this vision system can provide the necessary information for industrial robot assembly tasks. 1. Introduction Following the development of the robot techniques, the robot vision has improved very fast. Especially it is a very important part in the applications of industrial robot assembly tasks. A typically practical performance is the part recognition and location. For example, grasping parts from the conveyer and assembling them in random by robot manipulators need to tell robots the relative information of the position, orientation and species of the objects and control the robot hand to finish the above tasks. In order to realize object recognition and work piece location we present a robot scene vision system. In this system the video signals are taken from an industrial camera device f'trst, then the TV signals are translated into the binary signals by sampling and digital signals by A/D translator. A single-chip microcomputer system performs the pre-processing including extracting the invariant features from the object image and taking the feature parameters. The object recognition using model feature parameter library and the work piece location calculating are performed in a PC system which can provide the necessary information of the object position and orientation to the robot control system.
2. Principle and Structure In industrial applications an object is usually recognised by its features, such as the edges, the geometrical centre position and the configuration of the object. The ftrst step is usually to classify the objects. We select a set of typical objects as models. The feature variables of the selected objects compose a n-order vector space, where n is the number of feature variables. A set of the feature variable data of an object composes a feature vector which represents a point in the feature space tl]. Based on the model feature vector positions located in the vector space the feature space can be divided to several areas responding to different species. Each area is called a subspace and represents a species of objects. The features of the object will be extracted and the feature vector position will be calculated and judged in which area the sub-space is located. The species represented by this area is the species of the object. This recognition method depends on the selection of the models. So the models should be representive and the number of the models is big enough to distribute the sub-spaces equally in the whole feature space. The total structure of the system can be shown in Figure 1. The complete TV signals are converted to binarystate signals first. Then the noise-eliminating is used and the feature parameters are extracted. In the end pattern match is going on. A singlechip microcomputer achieves the image I monitor ] I digital feature II sampling and signal preprocessing which ~Con~~le ~ image signal paramete~ can reduce the load of the higher level I .I ~o9~ 1-~ ~ I [~ camera r samplingl 7 single-~hip[ l-l computer computer, so that the total speed of the I iI position system will be more quick. The higher Video I binary image i recognition level computer system, the PC system, signal Ii extracting i calibration then accomplishes recognition and location for the objects. Figure 1 The total structure chart of the vision system |
3. Image Pre-processing The image pre-processing contains two parts: one is to improve the image quality; the other is to calculate the geometric parameters of the work piece. Both of these two parts are the base of the following recognition. 9 This paper is supported by the Chinese National Science Fund ( CNSF ).
148 3.1 Image Smoothing There are three kinds of noises in the object image. The first is the stochastic discrete noise or so-called saltlike-noise; the second is the spot-noise due to unequal luminance or reflection; the third is line-noise caused by power disturbance. There are several ways to eliminate these noises, such as the super-quadrant smoothing method , the multi-image equal method, the Boolean algorithm and so on. In this system we adopted mathematical morphology method t21 and used erosion and expansion algorithms to eliminate noises. As to a binary-state image A, selecting a structure dement B, if A is first eroded and then expanded by B, that is an open algorithm operating on A by B. The result is called A opened by B and we define the open algorithm as A o B = (A (~)B) 09 B (1) By open algorithm, we can eliminate the noises of Y the distributed points and burrs. If A is first expanded and then eroded by B, that is a close algorithm operating on A by B. The result is called A closed by B and we define the close algorithm as A 9 B = (A (9 B) (Z) a (2) By close algorithm, we can connect the two adjacent objects and fill the white spot in a " black" object. The fmal result can be written as C = ( ( ( A (S) B 1) @ B1) (9 B2) (Z) B2
I
2
!
.t
_t
-2 -'- 0
1
_L [
-- 2
~.
x
,k J, J- 1
-4
z. 9 3
"0
2
-3
(a) B1
(b) B2
9 .-'~ 4 x
(3)
Where A is the original image, C is the smoothed image, B1 and B2 are the structure elements of the open and close algorithms shown in Figure 2.
Figure 2 The structure elements Using cross-over structure the four directions of up, down, left and right of the image can be smoothed, the burrs with the size up to 5 x 5 image elements can be eliminated and the white spots with the size up to 9 x 7 image elements can be filled. 3.2 Feature parameter extracting In binary-state image the pixel value f(x,y) is "0" or "1", "1" represents the object and "0" is the background. Then the smoothed image can be edge-tracked and the image contour can be extracted, as well as we calculate the following parameters: (1) circumference P, that is the point number in the image contour; (2) area S, that is the point number whose pixel value is "1"; (3) quadrature of every rank mpq; (4) the shape centre (x, ~ ) " (5)
the maximum radius Rm=, the minimum radius Rmm, and the average radius P~vg; (6) recognising whether an hole is existed or not. In the last calculation we shrink the object image first. If the final result is a point it means that there is no hole in the object. If the final result is a circle it means that there is an hole. In the case of an hole existed, we can obtain the hole parameters by the steps from (1) to (5) as mentioned above. 4. M o d e l R e c o g n i t i o n , P o s i t i o n a n d C a l i b r a t i o n
4.1 Recognition We need to solve three questions for the model recognition: (1) selecting the feature available to classify the object; (2) acquiring the features of the models ahead; (3) making an effective method in which we can judge the area of the feature vector in the space. According to references [3] and [4], we select the following features in our system: (1) the feature of the first invariant quadrature, ~---q2o+~1o2; (2) the feature of the second invariant quadrature, H2=(TI~-TIo2)2+4(TI.)2; (3) the feature of the third invariant quadrature, H3=(TI~-3TI~2)2+(3TI21+~lo3)2; (4) the feature of the fourth invariant quadrature, H,=013o+Tl~2)2+(Ti2~+Ti03)2; (5) the complexity C = p2 / S; (6) the ratio of the maximum and minimum radius, R 1 = Rm~ / Rm~,; (7) the ratio of the average and minimum radius, R 2 = P~vg/R~m, (8) the flag representing the hole state, K = 1 ( an hole exists ). In fact,
149
rleq - geq /[gYo + 11"
(4)
y = ( p + q) / 2
(5)
~'~Pq -- I 2 I 2 (X -- 7 ) P (y - Y)q f (x, y ) d x d y
mpq = ~
~
(6)
xe Y ' f (x, y ) d x d y
m x = m~0/mo0; Y = m01/m00 Where, mpq represents the origin quadrature of p+q rank,
(7) ~tpqrepresents the centre quadrature of p+q rank, ripq
represents the nomalized centre quadrature of p+q rank. After discretion we have
g Pq "-- Z
Z
x
y
(X -- "X)P (y - y)q f ( x ,
y)
(8)
mpq = Z Z xPYq f (x,Y )
(9)
x y We suppose that there is one hole at most in the object to be recognised. If the hole exists the complexity C, the radius ratio R 1 and R2 are all relative to the hole panel. In this case we add an extra feature to the system: (9) the distance from the hole centre to the shape centre D. Selecting the above features is based on the consideration of the invariance of the rotation, translation and proportion and can be able to distinguish itself from others. The process acquiring the features of the model is the exercising process. The visual system extracts several groups of features when placing the model on different positions and orientations. The feature vector consisting of the average values of these group features can represent the model. A set of the selected model feature vectors compose the model feature lab. The feature vector of the object is compared with the model features first and then calculate the weightsummed Euclidean distances between the object and the models. The species of the model which has the smallest distance is the object species according to the nearest-classified principle. The distance is dm2 =
~~ =
P,,,j - P oj
wj
(10)
P mj
where P m~ is the jth feature of the model, poj is the jth feature of the object, n is the dimension of the feature vector, wj is the calculating weight of the jth feature and d m represents the distance between the object and the mth model. 4.2 Position Position includes the determination of the centre position and the rotary orientation of the object. The centre position can be calculated regarding to equation (7). The rotary orientation can be represented by the angle 0 between the inertial main axis and the x axis as shown in Figure 3 ( the counter clockwise rotation is positive ). The T h e inertial axis,,"" / value of the angle0 can be solved as following" I
r 2 +
g20
-
g02
r -
1 =
(11)
0
where, r = tg 0 9When g ~1 ~ 0 , the above equation has two roots: r~ =
- b + "~/b 2 + 4 2
'
r2 =
- b - -~lb 2 + 4 2
where, b = l.t 2o - l.t o2 . When g~l > O, let tg 0
= r, ,
1/11
~tl~ < O,let tg 0
= r 2 . W h e n lXll = O , t h e a b o v e
equation (11) is not applicable, that means that there are two or more symmetric axes in the object: (1) When there are more then two axes,
' I sI
Fig.3 The rotary orientation of the object
g o2 = g 2o, the shapes are the square, circle, polygon and so on.
Considering that the directive information should be applied to the robot hand, we choose the normal direction of the shortest radius vector as the robot grasping direction. (2) When there are two axes, g 0z ~ g z0, we choose the radial of the longest radius vector as the rotary orientation.
150
4.3 Calibration The calibration task is to determine the geometric relation of the camera and the robot coordinates. Referring to 2D image the calibration is performed in 2D coordinate. Suppose the visual sensor fame is xv-ovyv, the robot frame is xr-o~-yr, a point (x~,y~) in x~-o~-y~frame can be represented with (xr,y~) in x~-o~-y~frame. If the origin of the vision frame is (x0,Y0) in the robot frame, then we have
Ix l:Ic~
s,o l[ x l Ixol
Yr sintp C O S 9 PyY,, Yo Where, q~ is the rotation angle of the sensor frame relative to the robot frame in the counter clockwise, Px and py represent the unit length of a pixel Xv and yv orientation. We can get three matrix equations by replacing three different points, then the parameters of Px, Py, x0, Y0and q~ can be solved by the equations. 5. Experiment Results The smoothing effect results using the two different smoothing methods of the super-quadrant smoothing and the open-close algorithm are compared in Figure 4. As shown in Figure 4, (a) is the original digital image on which there are a lot of salt-like-noises and spot-noises bexause of the unequal reflection. (b) is the result processed by the super-quadrant smoothing method. The small random noises are eliminated, but the bigger spot noise exists yet. At the same time the image edges nearby the bigger spot are destroyed and a gap is made up. (c) is the result processed by the open-close algorithm smoothing method. All kinds of the noises are eliminated while the details of the image are kept well. As mentioned above we can see that the effect of the open-close algorithm method is better than the superquadrant method in binary-state image smoothing. However, the former scans the image four times, while the latter needs to scan the image just one time. The experiment results showed that the correct rate of recognition is over 95%, the accuracy of location is _+2 mm and the accuracy of the orientation angle is +_2 degrees.
(a)
(b)
(c)
Figure 4 The compared results of the image smoothing 6. Conclusion This paper presents the object recognition method based on the feature parameters, determines the invariant features of the models, and composes a robot vision system which integrates the binary-state image sampling, recognition and location. This system can be used in the robot assembly tasks as the scene vision. The experiment results show that this system is reliable to accomplish the object recognition and work piece location, as well as the system has the features of low costs, simple structure and easy realization. ReaUy the system has much to be improved, for example, the methods of the image smoothing and feature extracting need to discuss deeply. It is possible to adopt a part of hardware or image process chip to speed the system performance which may satisfy the real-time control requirement for high speed assembly tasks. References: [1] B.K.P.Hom. Robot Vision. The MIT Press, McGraw-Hill Book Company, 1986 [2] Tang Chengqing. The Method and Application of the Mathematical Morphology. The Science Press, 1990 [3] Yang Jingan, Zhang Daincheng. The Vision System Based on the Model Recognising the Complexit Object. The Pattern Recognization and Artificial Intelligence, Vol. 3, No. 2, 1990 [4] Zhou Ruiyu, Wang Dapei, Li Quanyi. A Simple Robot Assembly Experiment System Guided by Vision. The Robot, Vol.3, No.2, 1989
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
151
Recognition of Objects and Their Direction of Moving Based on Sequence of Two-Dimensional Frames Bo~idar Poto~nik, Damjan Zazula Faculty of Electrical Engineering and Computer Science Smetanova 17, 2000 Maribor, Slovenia {bozo.potocnik, zazula} @uni-mb.si
Abstract We developed a new algorithm that belongs to the so-called differential methods, however, it represents a significant extension of the known analysis approaches. Our algorithm can perceive 4 types of object shifts: translatorical, object shifts on optical axis inwards and outwards, object rotations with regard to optical axis and, eventually, appearance of additive noise. We introduced a new approach in analysis of objects which are in occlusion (analytical optimization with respect to the MSE algorithm). The algorithm is very fast (time complexity is of order O(n:)). Our algorithm is a frame that can be very easily upgraded to the needs of real applications.
1. Introduction In our work, we deal with digital processing of a sequence of images from which we try to determine a moving object and trajectory of its movement. Recently, a few methods for movement analysis have been published. Sonka [5] described basic steps for movement analysis on optical flow basis and on significant point basis, J/ihne [3] attempted movement analysis with assistance of space-time images, etc. Because the result of these methods is a vector or matrix (movement field or displacement vector field), there is no possibility for accurate reconstruction of the trajectory of moving object. Various methods of movement analysis has been gathered in [5] and classified to different groups according to algorithms used. Basic steps of the algorithms may be employed also in determination of the movement trajectory. We developed a new algorithm that belongs to the so-called differential methods, however, it represents a significant extension of the known analysis approaches. The paper is organized as follows. In Section 2, we decribe the algorithm developed for movement analysis in detail, while the results and discussion follow in Section 3. Section 4 concludes the paper.
2. Analysis algorithm With our algorithm, we can analyse movement of one moving object in a sequence of gray value images. It can perceive 4 types of object shifts: translatorical, object shifts on optical axis inwards and outwards, object rotations with regard to optical axis and, eventually, appearance of additive noise. Our algorithm consists of the following steps: 1. The first step of the algorithm is binarisation of the sequence of images. Every image from the sequence is binarized with threshold operation using a global threshold. We determine the threshold for every image from the sequence extra, i.e., as a mean between minimum and maximum gray value in that image. The type of binarisation (or other preprocessing operations) can also be selected with respect to the image sequence, to be analysed (ultrasound, MR, CT or SAR images etc.). 2. Given a sequence of n binary images, the static background background(i,j) is established as follows:
152
/n
/
background(i, j) = 2 b k (i, j) div n,
(1)
k=l
where bk is the k-th binary image from the sequence and div stands for integer division. Equation (1) gives only an estimate of the real background. It is evident, that longer sequences produce better estimates. In sequences where the object is at least in one image in no occlusion with any static part of scene this estimate corresponds to actual image of the background (binary image). 3. Then, every image is subtracted the static background obtained (equation (1)), thus producing a sequence of dynamic images. Dynamic images comprise white areas where changes in gray values appear along with subsequent images in the sequence. This feature is used as a criterion for recognition of moving object in the following steps. 4. The moving object is defined as an object with the largest surface area in the dynamic images. This one is, afterwards, used as a praform (template). Polar histogram is constructed for subsequent comparisons. This criterion proved as a robust one, nevertheless it fails in case of only sligth movement throughout the entire sequence. 5. Now, all the flames with dynamic images are processed in order to find successive appearances of the moving object. This matching or searching is divided with respect to whether the object is partially hidden with another object or not. When there is no occlusion in images the procedure is straightforward. However, the occlusions introduce several problems [1, 6], like uncomplete or faulty object identification. We divided searching of the moving object position into two parts, each of the two variants composed of several steps: a. The object is in no occlusion with any static part of the scene: - Polar histogram is constructed for it (number of elements of polar histogram is selected in advance). Individual components are taken into quotients with components of the praform's histogram:
quot[i] - hist~176
+r~176 mod m] (2) histogramobje, [i] where m is number of elements of polar histogram, rotation is the shift index, and mod stands for modulus division. From Equation (2), it is obvious that the vector of quotients is to be calculated for every single rotation (number of rotations is equal to m). For the vector of quotients obtained, the mean and the variance are calculated. These two values play important role in determination of the type of shifts and rotation of the object. -
Rotating the praform, the position with minimum variance points out the most probable rotation of the processed object. At the same time, the mean of quotients corresponds to the object's scaling. -
b. The object is partially hidden: An extended area is formed in a separate flame containing the visible part of the object (from the dynamic image) and the static component, i.e. occlusion. This, newly composed region -
153 (composed object), is the basis for analysis in the following steps. - Centre of gravity is found for this composed area and a partial polar histogram is constructed for the uncovered part of the object. This calculated centre of gravity is the first estimate of our partially hidden object. The estimate becomes in case of very high occlusions rather unrealible. A partial vector of quotients is also computed for every single rotation of the praform (number of elements of polar histogram is not m anymore, but correspondingly lower). - In every rotational position, an analytical optimization with respect to the MSE algorithm for the differences of successive quotients is applied in order to reposition the centre of gravity. With this optimization we determine the final centre of gravity of moving object. - The centre of gravity calculated in the previous step is a basis for subsequent analysis. Partial variance in quotients, which was recalculated with new centre of gravity, is minimum at the most probable orientation of the object under the occlusion. The quotient mean is equivalent to the object scaling. 6. Centres of gravity discovered either way are finally bind into a trajectory of the moving object. Also the data on the object rotation and shifts on the optical axis are available. Besides, if the minimum variance in quotients at a certain frame exceeds a preselected threshold, the object is declared corrupted by additive noise. 3.
Results
and discussion
In Section 2, we described a new algorithm for analysis of movement in a sequence of gray value images. This algorithm was also implemented in C++ language for Windows and tested. An example is shown in Figure 1. In the sequel, the processing results of the image sequence from Figure 1 are shown as generated by our algorithm. First, we binarise every image, so we get a binary-image sequence (Figure 2). This sequence is used in determination of static background (Figure 3), which is obtained with Equation (1). Then, every image is subtracted the static background obtained, thus producing a sequence of dynamic images (Figure 4). From this sequence we recognize moving object with heuristic criterion (Figure 5). In Figure 6, we can see the final result of processing - an image of the trajectory reconstructed for moving obiect.
Figure 1: Testing a gray-value image sequence. Images of dimensions of 256x256 pixels have 256 gray-value levels. In this sequence, all possible object shifts which the algorithm can percieve (translatorical, object shifts on optical axis inwards and outwards, object rotations with regard to optical axis and appearance of additive noise) are present.
Figure 2: Binary-image sequence.
In the above example, we analysed a synthetic image sequence where our algorithm gives completly right results. But this is not the case for every image sequence. Our algorithm has particulary big
154 problems at sequences where the occlusion is very high. Experimenting, we also realized, that if the first estimate of the centre of gravity (step 5b in Section 2) was not close enough to the right value, then the optimization with respect to the MSE did correct the position of the centre of gravity, but this position was still faulty. Completly different problem arise at sequences where the object moves very slow through the subsequent images. In this cases we misidentify moving object (step 4 in Section 2). This problem can be solved in many ways, e.g. by coarse-to-fine strategy - we consider only every fifth image from the image sequence.
Figure 3: Static background image.
Figure 4: Dynamic-image sequence.
Figure 5: Image of moving object.
Figure 6: Image of reconstructed trajectory.
4. Conclusion In our work we presented a new algorithm for movement analysis in the image sequences. Algorithm is an extension to the differential methods of movement analysis. In its basic version, the algorithm is very simple and thus very fast (time complexity is of order O(n2)). It can be easily extended for concrete real applications.
References [ 1] E. Chamiak and D. McDermott, Introduction to artificial intelligence. Massachusetts: Addison Wesley, 1985, pp. 87-167. [2] F. van Heijden, Image based measurement systems. London: J. Wiley and Sons, 1994. [3] B. J~ihne, Digital image processing. Berlin: Springer-Verlag, 1993. [4] J.C. Russ, The image processing handbook. London: CRC Press, 1995. [5] M. Sonka, V. Hlavac, R. Boyle, Image processing, analysis and machine vision. London: Chapman and Hall, 1994. [6] P.H. Winston, Artificial intelligence. Massachusetts: Addison Wesley, 1984, pp. 335-384.
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
155
Innovative Techniques for Recognition of Faces Based on Multiresolution Analysis and Morphological Filtering Anastasios Doulamis, Nicolas Tsapatsoulis, Nikolaos Doulamis and Stefanos Kollias Department of Electrical and Computer Engineering National Technical University of Athens Greece Heroon Polytechneiou 9 Zographou Tel.: +301 772-2491
e-mail: [email protected]
Abstract In this paper, we introduce two new methods for face recognition of frontal images. The methods combine the well known Karhunen-Loeve transform with morphological and subband analysis. The use of this kind of analysis contributes to better discrimination between different images. Morphological and subband approaches are compared while the former is a non linear method and the latter a linear one. The results, obtained using 100 test images, show that both approaches are quite efficient. However, the morphological technique seems to lead to slightly better results (5%, 12% error respectively) while the subband technique has the advantage of decreasing the complexity of the task.
1. Introduction The main purpose of a face recognition system is to find a person within a large database of faces (e.g. in a police database). Such a system typically returns a list of the most likely people in the database. However, there are applications in which we want to identify a particular person (e.g. in a security monitoring system) or to allow access to a group of people and deny access to all others (e.g. access to a computer). Some other applications, like speech recognition, better man-machine interaction or visual communication over telephone and other low bandwidth lines, use face identification as an auxiliary tool. So far, the best results for two-dimensional face recognition have been obtained by techniques based on either template matching [1] or matching eigenfaces [2]. The latter uses KL-transform and has the advantage of not requiring specialised hardware. Since this transform achieves the optimal energy compression, faces can be represented in a low dimensional space as a weighted linear combination of the eigenvectors of the autocorrelation matrix of face images. This enforces the mean square error between the representation of the face image and the original image to be minimal. This representation, although optimal in discriminating physical categories of faces, e.g. sex and race, is not optimal in recognising faces due to the details which are necessary to discriminate different faces [3]. In addition, there is no accurate method that verifies the given results of the identification algorithm in order to avoid false alarms (see sect. 4). Two alternative techniques are proposed in this article, so as to increase the efficiency of the discrimination task and to obtain more reliable results. These two techniques combine KL-transform with subband decomposition and morphological filtering respectively. Subband decomposition separates the original images into complementary frequency bands (e.g. Low-Low LL, Low-High LH and so on) for each of which we create a different KL base. Since the LL band contains the largest amount of information, we use the projection of a test image on this band to find a list of the most likely face images, in the database. The higher bands are used for verification if the confidence of the decision made on the LL base is poor. Thus it is feasible to achieve a correct identification using the details kept on the higher bands. Using morphological filtering we are able to change an image to another one with lower frequencies than the original (for example the morphological opening or closing). Therefore we use the result of these filters in the same way that we use the lower band of the subband analysis. The structuring element of the morphological operator was chosen after measurements for various test images have been made. The difference between the original image and the filtered one, projected on the respective base, is used to verify our results.
2. Subband decomposition In this section we describe the first approach which is based on a multiresolution scheme proposed in [4] (Fig.l). An image of resolution (MxN) is decomposed into four frequency complementary images of resolution ---~• ~ . Using this scheme we can create from the original face database four different databases. The KL transform on each of this databases is used to produce four different instances for each face image in the database. Actually, in our consideration, only two instances for each image are used, the instances related with the LL KL-base and the LH KL-
156
base. In the XLL image, which is the image containing the Low-Low spatial frequencies of the original X(m,n) image, most of the energy is accumulated. The respective LL KI~transform converges faster than that of the KL-transform taken on the original images. In addition the complexity of the computation is lower since the autocorrelation matrix of the LL images is of dimensions ( ~ x ff)x(-~- x if) instead of (MxN)x(MxN) in the original images. The LH KLtransform converges slower than the original so more KL coefficients must be kept. Since the LH images are images of details then are used only in the verification step. The proposed algorithm is described below: D e c o m p o s i t i o n step
Given an image Y(m,n) of dimensions (MxN)we create the images YLL, YLH, YHL, YHH USing the subband decomposition scheme shown in (Fig.l). Projection step
The YLL and YLH images are projected in the LL KL-base and LH KL-base respectively, and k, 1 coefficients are retained in each case. The numbers k, I were chosen after many simulations (see section 4.1, Table II). As a result of this step two vectors, related to image Y(m,n), of sizes k, I are created: y~, Yh.
M S E calculation step
For each LL instance, xil, in the database we calculate the MSE e i -- (Xil -- y/)r
(Xil
_ Yl )
and the emin = I~. (ei ) . 1
As potential instances of the image Y(m,n), in the database, we consider the instances whose MSE lie in the interval [emJn 2*%~]. If for only one instance the MSE lies in the previously stated interval, the confidence of the decision is high, and the instance with the minimal MSE is considered to be the prototype of the image Y(m,n) in the database. On the other hand, if more than 10% of the total instances in the database have MSE which lies within the interval, the confidence of the decision is considered inadequate and the image Y(m,n) is discarded without verification. If neither of the previous extreme cases occurs the verification step is needed. Verification step
For instances selected in the previous stage the error m i = (Xih - Yh ) r (Xih -- Yh ) is calculated (Xth is the LH instances of the databases). If the minimal error is lower than a threshold T, which is equal to 0.9*max(error of images which have a prototype in the database), then the instance with the minimal error is considered as the prototype of the image Y(m,n) in the database, otherwise the image Y(m,n) is considered to have no prototype. columns t XHH G, H: Perfect Reconstruction rows $ 1 ~ II 1 s 2l 2 Mirror Filters
~
[1+21
X(m,n)
[2s1[
Keep one column out of
I1, 1
Keep one row out of two
two
IH
~.q251~
-~G
I I 1+2l
XLH
152]
XL L
I
Figure 1. Subband decomposition scheme used to split an image X(m,n) into four frequency complementary images, XLL, XLH,XHL,XHH.
3. Morphological Analysis The goal of this section is to briefly describe morphological tools of interest for the face identification algorithm. A complete description of the mathematical morphology can be found in [7]. Let f(x) denote an input signal and M~ a window or flat structuring element of size n. Then the erosion and dilation caused by this flat element are given by : e , , ( f ) ( x ) = n f m { f ( x + y),y ~ g n ] and ~n(f)(x) = max({f(x- y),y ~ Mn} Two morphological filters can be defined from the above morphological operators, namely Opening and Closing. A morphological opening (closing) simplifies the original signal by removing the bright (dark) components that do not fit within the structuring element [7]. If it is required to remove both bright and dark elements, an opening-closing or closing-opening should be used. We also define the difference between the original signal and the signal after the morphological opening (closing). We should not confuse this difference with the morphological gradient which is given by subtracting the dilation from the erosion with a corresponding indicator of structuring M~ equal to 1. Based on the above morphological filters, we propose an innovative algorithm both for identification and verification. Fig.2 illustrates the mechanism which is used. As it can be seen in Fig.2, we firstly apply a morphological operator on each image of the database. Thus a new database is created which contains the filtered images. From this database we calculate the "opening KL-base" in which the filtered images are projected. Since the new images consist
157
of lower frequencies, it is expected that higher energy will accumulate in the first coefficients of K-L transform. Moreover for each face we calculate the difference between the original and the filtered images and we also create the "difference KL base". However the images of this database contain higher frequencies and thus more coefficients are needed to accumulate the same energy as the original one. As a result this database can be used only for verification purposes. If the confidence of the decision is poor (there are many faces in the list after the projection of the test image on the open KL-base) we use the verification relied on a difference base. List of likely matching faces
Projection on open. KL- base
opening
Test image ~r
I difference I
prototype 9
J Pr~176 ~ KL-base
Confidence of decision
in
I
database /discard Verification prototype/ discard Figure 2. Face recognition scheme based on morphological filtering.
4. Results We have used the male database of the University of Essex in our experiments. We have chosen 100 different frontal faces, with no facial expressions, oriented in the centre of the image and with small scale and decline variations (let us call these images prototypes) to build the K-L bases. As test images we have selected 90 face images, which have a prototype in the database, with variations in scale, decline, orientation and facial expressions. We have used as well 10 face images with no prototype in the database. Given a test image, the question was to recognise the respective prototype, if there was one, or to discard the image because there was no prototype. Two kind of errors are emerged: False alarms (a face which has not a prototype in the base is recognised as one which has) and false discrimination (a face which has a prototype is discarded or is recognised as a false one).
4.1 Results obtained by the subband algorithm 16
25
36
49
not simulated
not simulated
not simulated not simulated not simulated 4 4
not simulated
Num. Of LL KL-base Coeff
Num. of LH KL-base Coeff.
15 16
12
25
10
not simulated 8
5
36 10 8 5 49 9 8 5 Table I: Total percentage error for various simulations of the subband based algorithm.
not simulated not simulated not simulated 3
In Table I the total percentage error (discrimination error + false alarms) is shown, for various simulations. For example retaining 16 coefficients from the projection on the LL KL-base and 25 from the projection on LH KL-base the total error is 8%. Increasing the number of the retained coefficients of the LL KL-base the total error decreases. However, increasing the number of the retained coefficients of the LH KL-base the total error doesn't decrease essentially. Note also that, the total error consists mainly of the false alarms. This can't be easily reduced because it is dependent of the considered threshold T. Faces (with a prototype in database)
Faces with high Conf. Of Decis..
Faces with inadequate Conf.
Faces with low Conf. of Decis.
Disc. Err. of faces with High Conf. of Decis.
Disc. Err. of faces with Low Conf. of Decis.
158
Faces Faces with high Faces with Faces with low False alarms (no protot~e in database) Conf. of Decis.. inadequate Conf. Conf. of Decis. ! 10 0 5 5 3 Table II : Performance of the subband based algorithm retaining 9 and 16 coefficients of the projection on LL KLbase and LH KL-base respectively. In Table II the results of a simulation in which we retain 9 and 16 coefficients of the projection on LL KL-base and LH KL-base respectively are shown. Comparisons with the results of the morphological algorithm, shown in Table III, an be deducted. 4.2 Results obtained by the morphological algorithm In Fig.3 it is presented the results of error discrimination using the above test images. The results have been taken for different structuring elements (5, 10, 15, 20, 25) and for different number of coefficients for each base. The number of coefficients which are kept for opening are the same as the keeping numbers in the difference base. (in this results 9 coefficients). It is observed that the structuring elements 15 gives better results. This is quite logical since the use of a small structuring element deducts good recognition at the opening base and poor at the difference while the use of large structuring element good verification of the difference base and poor recognition for the opening base. It should also be mentioned that the opening base keep well the significant information and as a results it gives very well identification despite the fact that the images of the database (prototypes) are not easily recognisable by the humans.
Figure 3 : Discrimination Error for different
Figure 4 9Number of Coefficients of KL transform for
structuring element
each base and the used verification.
Fig.4 shows the discrimination error for each base and for the verification (in this case we have kept the same umber of coefficients for the Open. and Dif-base). As the number of coefficients increases the total error decreases significantly, especially for the verification and the Open. Base. We choose the same number of coefficients for verification because the results conclude to be very satisfactory without keeping a large number of coefficients for the Dif. Base. One exception is presented in Table III in order to have a comparison with the subband based algorithm. It should also be mentioned that in the verification procedure the major proportion (about 70%) give the right results without the use of Dif. Base and as a result the computational time reduces significantly. Faces (with a prototype in database) 90
Faces with high Conf. of Decis.. 69
Faces with inadequate Conf. 0
Faces with low Conf. of Decis. 21
Disc. Err. of faces with High Conf. of Decis. 0
Disc. Err. of faces with Low Conf. of Decis
Faces Faces with high Faces with Faces with low False alarms (no prototype in database) Conf. of Decis. inadequate Conf. Conf. of Decis. 10 0 6 4 2 Table III" Performance of the morphological algorithm retaining 9 and 16 coefficient of the projection on opening KL-base and difference KL-base respectively (size of structuring element 15).
5.Conclusions In this paper we have presented two innovative techniques for face recognition. In the morphological based approach the results are more promising, since the verification step increases the efficiency of the algorithm. On the other hand the subband based approach is more attractive computationally. Due to the perfect reconstruction filters used in this
159 approach the LH KL-base converges slowly and consequently the verification step does not improve the efficiency of the algorithm significantly.
References [ 1] R. Brunelli and T. Poggio, "Face Recognition: Features versus templates," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15, no. I0, pp. 1042-1052, Oct. 1993. [2] M. Turk and A. Pentland, "Eigenfaces for Recognition," J. Cognitive Neuroscience, vol. 3, no. I, pp. 71-86, 1991. [3] A. O'Toole, H. Abdi, K. A. Deffenbacher and D. Valentin, "Low-dimensional representation of faces in higher dimensions of the face space," J. Opt. Sac. Am. A., vol. I0, No. 3, pp. 405-41 I, March 1993 [4] S.G. Mallat, "A Theory for Multiresolution Signal Decomposiotion: The Wavelet Representation", IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 11, no. 7, pp. 674-693, July. 1989. [5] A. Tirakis, A. Delopoulos and S. Kollias, "Two-dimensional filter bank design optimal reconstruction using limited subband information," IEEE Trans. on Image Processing, vol. 4, no. 8, pp. 176-200, August 1995 [6] L. Vincent, "Morphological grayscale reconstruction in image analysis: Applications and efficient algorithms," IEEE Trans. on Image Processing, vol. 2, no. 2, pp. 176-200, April 1993 [7] J. Serra, Image Analysis and Mathematical Morphology, New York: Academic Press 1982.
This Page Intentionally Left Blank
Proceedings IWISP '96," 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
161
PARTIAL CURVE IDENTIFICATION IN 2-D SPACE AND ITS APPLICATION TO ROBOT ASSEMBLY Feng-Hui Yao*, Gui-Feng Shao**, Akikazu Tamaki*, Kiyoshi Kato* *Dept. of Electric, Electronic and Computer Engineering, Faculty of Engineering, Kyushu Institute of Technology. 1-1 Sensui-cho, Tobata-ku, Kitakyushu 804, Japan Phone (+81)093-884-3255(Direct), Fax (+81)093-871-5835, E-mail: [email protected] **Dept. of Commercial Science, Seinan Gakuin University, 6-2-92 Nishijin, Sawara-ku, Fukuoka 814, Japan Phone (+81)092-841-131 l( Ext. 262), E-mail: [email protected] A B S T R A C T This paper describes an algorithm to identify the partial curves o f planar objects in 2-D space and its application to robot assembly. For the given boundary curves of objects, dominant points o f every boundary curve are detected. Then, by considering the dominant points as the separation points, the corresponding boundary curve is segmented into partial boundary curves which are called curve segments. And then, the curve segments belonging to the boundary curve o f an object are translated and rotated to match those o f another object to obtain the matched curve segments. From these matched curve segments, the longest consecutive matched curve is detected. At last, the effectiveness of this algorithm is shown by the experiment results.
I. Introduction The shape of the object plays a very. important role in object recognition, analysis and classification. Researches in this field can be roughly classified into (1) edge detection; (2) dominant point detection of the boundary curve; and (3) shape recognition. The researches about the edge detection focus on edges or contours[I]-[2]. Those about dominant point detection focus on the points of high curvature[3]-[4]. And those about the shape recognition pay attention to the entire shape of the boundary curve and identify the objects[5]. These researches seldom involve the problem of object connection relationship, i.e., to determine whether a part of an object can be connected with that of another one. This problem is very important in robot assembly system. This problem can be thought of as the problem of the partial curve identification. This paper focuses on this problem and proposes an algorithm to identify the partial curve of planar object. In this algorithm, firstly, the boundary curves of objects are extracted from the input image after binarization, and dominant points with high curvature are detected. Then, each boundary curve is segmented into partial boundary curves which are called curve segments by taking the dominant points as the separation points. And then, the curve segment matching is performed. The partial curve is identified based on matching errors. In the following, section 2 describes the algorithm for partial curve identification; Section 3 relates its digital implementation; Section 4 shows its application and experiment results; At last, the effectiveness of this algorithm is discussed and the future works are given.
2. Algorithm to Identify the Partial Curve of Planar Object In the following explanation, the boundary curve is simply called curve if not to point out specifically.
2.1 Dominant Point Extraction For a given object, let 7"(s) represent its boundary curve. 7"(s) is expressed parametrically by its coordinate functions x(s) and y(s), where s is a path length variable along the curve. If the second derivatives ofx(s) and y(s) exist, curvature at (x, y) is computed by C(x,y) = (x ' y " -- y ' x " ) - - ((x ' ~ + 0 ~ ' )2)3/2 (1) To express the curvature at varying levels of detail, both boundary coordinate function x(s) and y(s) are convolved with the Gaussian function g(s, or) defined by g(s, or) = exp(-s2/(2 0 2))-~-( O"4 2 7z)
(2)
where cr is the standard deviation of the distribution. The Gaussian function decreases smoothly with the distance and is differentiable and integrable. Let us assume that cr of the Gaussiang function is small compared with the total length of the curve 7 (s). The Gaussian-smoothed coordinate functions X(s, a) and Y(s, or) are defined as x(s) | g(s, or) and y(s) | g(s, a), respectively, where "| means convolution. Because both X(s, or) and Y(s, a ) are smooth functions and their first and second derivative exist, the curvature C(s, a) of the curve 7 (s) smoothed by the Gaussian function is readily given by applying X ' (s, or), Y ' (s, or), X "(s, or) and Y "(s, a ) to equation (1). For a given scale a, the corresponding curvature C(s, or) can be obtained according the procedure related above. The searching process is applied to detect the local maximum of absolute curvature within the region of support given
162
by the sequence {IGI ..... IG-~I, IGI, IG+ +1..... Ifrl}, where C~ is the curvature of the point in question, and C~ and Cr are the leftmost and rightmost points of the local region of support, respectively. The region of support for each point i is the largest possible window containing i in which [C[ to both the left and right of i is strictly decreasing. The points with local maximal absolute curvature are considered as the dominant points.
2.2 Curve Segmentation For any two objects A and B, let us assume that their boundary, curves are represented by a(s) and ,6'(s), respectively, and that their dominant points are denoted by P ( a ) ={ p ~ p ~r..... p ~ ~.~ } and P( fl)={ p ~o, p z~ ..... P aN-r}, correspondingly, where M is the number of dominant points of the curve ct(s) and N the curve ~(s). Dominant points are numbered clockwise and they are considered as the separation points. Therefore, the curves a~(s) and ~'(s) can be spilt up into curve segments. Let S ~ and S a denote these two sets of curve segments, i.e.
S a = { C t o , 1, Ctl,2 ..... a'u.,,,o}(moduloM),
Sa={/Yo,~,
,8~,2 ..... /5'N_~,o}(moduloN)
(3)
where a ~,j (i, j=O, 1..... M-I, modulo M) and ~u,v (u, v=O, 1..... N-I, modulo N) are the curve segments of the curves a~(s) and ,6'(s), respectively. In this notation, dominant point i is the start of a ~,~ and j the end, and dominants u, v has the same meanings.
Fig.1 Partial curve l~j+l,j-I is translated so that the dominant point i andj overlap,
Fig.2 Input image after binarization, which include two objects.
2.3 Partial Curve Matching The partial curve matching includes the extraction of the candidates of the longest consecutive matched curves (abbreviated as LCMC) and the decision of LCMC. 2.3.1 L C M C Candidates Extraction For the dominant point i on curve a'(s), the curve segment a i-~,i terminates at i and a i.i+~ starts from i, clockwise, where a ~-i,~, a ~,~§~ ~ S a ( i=0, 1 ..... M-l). Similarly, for the dominant point j on curve ~(s), the curve segment Bj§ ~,j terminates atj and jSj,j_~ starts from j, counterclockwise, where/qj+ ~,/, jSj,j.~ ~S~(/=O, 1..... N-l). For simplicity, these two pairs of curve segments are denoted as a ~.z, ~+ ~ and flj+ ~,j_~ and are called partial curves. Then let us consider the matching of a ~.~,i +~ and Aqj+ ~,j. ~. To perform the matching of these two partial curves, fig +~,/. 1 is translated so that the dominant pointj included in Bj+ ~,j_~ overlaps the dominant point i included in c~~-r,i+ ~ (see Fig. 1). The displacement in X-axis is the difference of the x-coordinates of the dominant point i on a'(s) and j on B(s). Likewise, the displacement in Y-axis can be obtained by using their y-coordinates. Next, Bj + r,/-~ is rotated around the dominant point j, clockwise, from 0~ to 360 ~ by 1~ per step. Let E(cr ;_r,;,,6'/+ ~,j)0 express the matching error when B j+ r,j-1 is rotated 8 ~ , which is defined by :
ffdxay + ffd ay
Dr
(4)
D2
where Dr is the area surrounded by the arc~j.~, arcj.~,~+~and arc~§ and D2 is the area surrounded by the arc~_r,~, arc~j§ and arcj+r,~.2, as shown in Fig. 1. When ~j+ ~,/.2 is rotated from 0~ to 360 ~ , the minimal value o f E ( a ~.1,i,~/§ 1,/)a is called the minimal matching error between a 5-r, ~+ r and fl~.+~,j_2, and is denoted by E ( a 5-r, f+ r, fl~ + z,/- Z)m~,. And the corresponding rotation angle is denoted by 8 ( a 5-~, ~+~, ~ j + ~,j- ~)m~,. In the following, if no confusion, they are simply written as E,~, and 8 ,~,. E,~, is simply obtained from the follows
Emi, = min{Eo, E~..... Es59}
(5)
IfE,,~, is small compared with the threshold value Te~, the partial curves a ~-r,~+1 and/5'j§ r,j-~ are said "matched". Then the clockwise neighbor of a i-r,~+ 2, i.e., the curve segment a ~+r,i+2 is added to the end of a ~.~,~+r, and the counterclockwise neighbor of /5'j+2a-r, i.e., flj-~,j-2 is added to the end of/qj§ ~,j-r, the matching procedure related
163
above is performed again. Note here that threshold value is dynamically increased by Tel, i.e., the threshold value is set at 2Te 1. If E( a ,_ I, ~+2,/~j +1,j- 2 )m~, is smaller than 2Tel, and the absolute value of difference of 8 ( a ;_1,; +i, Bj + 1,j- 1),m, and 8 ( a i_1,;+2,r 1,j-2)mi, is smaller than threshold value T 0/2, the partial curves a ~+I,~+2 and flj-l,j-2 are said "matched". This repetition will continue until the "unmatched" curve segments are encountered. Likewise, this procedure is also applied to the counterclockwise neighbors of a g-l,i+ 1 and the clockwise neighbors of flj+ ~,j-1. Here, it is worth to note that the new curve segments will be added to the beginning of a ~_i, ;+ 1 and ,6'j+ 1,j-1. The repetition will stop when the "unmatched" curve segments are encountered. These consecutive curve segments form a candidate of LCMC. The above procedure is applied to all curve segments in S ,~ and S ~. LCMC candidates whose numbers of curve segments are greater than the threshold value TL are passed to the next step for the decision of LCMC.
2.3.2 LCMC Decision For the k-th LCMC candidate (k=0, 1..... K, and K is the total number of LCMC candidates), its minimal matching error is recalculated by overlapping the centers of the corresponding consecutive curve segments and rotating the curve segments belonging to S ~ from 8 m,n -T o to 8 ,,in +T 8 by 1~ per step. The LCMC candidate whose minimal matching error is smallest is considered as LCMC at which the two curves match optimally.
3. Digital Implementation When to implement the above algorithm, it is necessary, to define the digital curve, digital curvature and digital matching error. In Cartesian coordinates, the coordinate function x(s) and y(s) of closed curve is digitally expressed by a set of Cartesian grid samples {~q, y~} for i=1, 2 ..... N (modulo N). The digital curvature at point i on curve can be calculated by
c, = A x a ~v- Ay A ~
(6)
where A is the difference operator and A 2 is the second-order difference[3]. The digital Gaussian function in [6] with a window size of K=3 is employed here to generate smoothing functions at various values of a and it is given by h[O] = 0.2261
h[l] = 0.5478
h[2] = 0.2261
(7)
where h[1] is the center value and ~ h[k] =1 (k=-0, 1, 2). This digital function has been mentioned in [7] and [8] as the best approximation of the Gaussian distribution. For digital functions with higher values of or, the above K=3 function is used in a repeating convolution process. A 2(/+1)+1 digital smoothing function is created by repeating the self convolutionj times. Note here that the digital Gaussian smoothing function for a largest a must have a window size no larger than the perimeter arc length N of the curve. A multiscale representation of the digital boundary curve from cr =0 to cr ,,~ can be constructed by the digital function defined above. Therefore, the multiscale digital curvature can be obtained according to the equation (6). And then, for each point i, a searching procedure is applied to detect the local maximum of absolute curvature. Points on the curve with local maxima of absolute curvature are considered as dominant points. For any two objects A and B, let a and/7 represent their digital boundary curve. Then a and/7 can be expressed by the sets of digital points on the boundary curves, i.e., a={(Xo, yo), (xl, yl) ..... (xM_1,yM-1)}, /7={(Xo, yo), (xl, yl) ..... (xN_l, YN-1) }. Their dominant points can be obtained by the method related just above. Their segmentation can be performed according to the method related in section 2.2. And the digital curve segments are also expressed by equation (3). Here and after, if no confusion, the digital curve segments are also simply called curve segments. Next, the matching procedure is applied to these digital curve segments. The matching error shown in equation (4) is digitally computed by max { P, Q }
max { U, V}
E( a r i-l,i, t~j+ 1,j)8 -" E (Sz3(p,p+l,q)+Szs(q,q+l,p+l))+ Z (Szs(u,u+l,v)+Szs(v,v+l,u+l)) p=O u=O q=O v=O
(8)
where P, Q, u and V are the numbers of digital points of the curve segments ,6:.+ ~,j, a ~_i,i,/qj.j- 1 and a ;, ~+ 1, respectively. As shown in Fig.l, S~,e+l,q ) is the area of the triangle formed by the points p, p+l and q. Similarly, S~ (q,q+l,p+1),S• and S~(v,v+l,u+l) have the same meanings. Here, it is worth to note that if the number of digital points included in a curve segment is less than that of the curve segment in comparison, its start point or terminal point will be employed to correspond the rest points of the curve segment in comparison to continue the calculation of equation (8). Which of them will be used is decided by the tracing direction along the curve segment (clockwise or counterclockwise). For example, in the region D2 of Fig. 1, the digital matching error is calculated, starting from the overlapped domoinant point i (or j), by taking out one point from each curve segment a ~_l,iand/~j+1,j and putting into the first item of
164
equation (8). Because the number of the points included in the curve segment a i_l,iis less than that in ,6'j+ 1,j, the start point of a i-1,~, i.e., the point i-1 is employed to continue the calculation for the rest points of ~j,j+ 1. This calculation stops at the terminal point of/~j.j+ 1, i.e., the pointj+l. The same procedure is also applied to the region D1. The partial digital curve/~j.l,j+ 1 is rotated from to 0~ to 360 ~ by 1~ per step. After each rotation, the matching error is computed. The minimal matching error can be obtained according to equation (5). Table 1. L C M C candidates obtained
No. 0 1 2 3
Curve segments of the object on left 6-5-4-3-2 6-5-4-3-2-1-0 6-5-4-3-2-1-0 9-8-7 -6-5-4-3-2
Curve segments of the object on right 17-18-19-0-1 0-1-2-3-4-5-6 0-1-2-3-4-5-6 1-2-3-4-5-6-7-8
4. Application and Experiment
Overlapped dominant point 31eftand lri~t 21eftand 5de,ht 1left and 6ri#t 31eftandZrie,ht
~
0
1
~ ~1~ 2
2
~1 18~17 u--- 19
12
,4
5
" ~15
17
An application model of this algorithm is supposed that a robot mounted a camera assembles machine parts. The experiment is Fig.3 Extracted boundary curves, detected dominant points and the final LCMC. performed with the real image. Fig.2 shows the input image after binarization, which includes two objects. Fig.3 shows the extracted digital boundary curves, the detected dominant points (marked by small "tr') numbered clockwise. Four LCMC candidates are listed in table 1. The first LCMC candidate is decided as LCMC and is shown in Fig.3 by the thicker lines. Fig.4 shows assembled result after the object on the left is translated 162 dots along X-axis and -32 dots along Y-axis, and is rotated 90 ~ clockwise. The values of Tel, T e and TL are 80, 30 ~ and 4. 5. Conclusions and Future Works
This paper proposed an algorithm for the partial curve identification in 2-D space. The application model is supposed that a robot mounted a camera assembles the machine parts in which the connection relationships among the machine parts are necessary. The problem of object connection relationship can be simplified as the problem of partial curve identification. The real images are employed to test this algorithm. From the experiment result, it is clear that this algorithm is effective.
Fig. 4 Assembled result.
This experiment employed the images of objects without texture. If the objects have some texture, the boundary curve detection will become more difficult. Moreover, if the input image includes objects more than three, a partial curve of an object may match the partial curves of multiple objects. In this case, it is necessary to employ the image values near matched curves to decide the optimally matched partial curve. Further, in the vision robot assembly system, only this is not enough. It must be combined with other 3-D information. All these are left to do in the future. REFERENCES [ 1] R.M. Haralick, "Digital step edges from zero crossing of second directional derivatives," IEEE Trans. Patt. Anal. Mach. Intell., vol. PAMI-6, no. 1, pp.58-68, Jan. 1984. [2] R. Mehrotra, K. R. Namuduri and N. Ranganathan, "Gabor filter-based edge detection," Pattern Recognition, vol. 25, no.12, pp.1479-1494, 1992. [3] A. Rattarangsi and R.T. Chin, "Scale-based detection of corners of planar curves," IEEE Trans. Patt. Anal. Mach. Intell., vol PAMI-14, no. 4, pp.432-449, Apr. 1992. [4] P. Zhu and P. M. Chirlian, "On critical point detection of digital shapes," IEEE Trans. Patt. Anal. Mach. Intell., vol. PAMI-17, no. 8, pp.737-748, Aug. 1995. [5] I. Sekita, T. Kurita and N. Otsu, "Complex autoregressive model for shape recognition," IEEE Trans. Patt. Anal. Mach. Intell., vol. PAMI-14, no. 4, pp.489-496, Apr. 1992. [6] P.J. Butt, "Fast filter transforms for image processing," Comput. Vision, Graphics & Image Processing, vol. 16, pp.20-51, 1981. [7] P.J. Butt and E. H. Adelson, "The Laplacian pyramid as a compact image code," IEEE Trans. Commun., vol. COM-31, no.4, Apr., 1983. [8] P. Meer, E.S. Baugher and A. Rosenfeld, "Frequency domain analysis and synthesis of image pyramid generating kernels," IEEE Trans. Patt. Anal. Machine Intell., vol. PAMI-9, no.4, pp.512-522, Apr., 1988.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors)
165
9 1996 Elsevier Science B.V. All rights reserved.
A fast active contour algorithm for object tracking in complex background Chun Leung Lam, Shiu Yin Yuen E-mail: cllam @ee.cityu.edu.hk, eekelvin@ cityu.edu.hk Department of Electronic Engineering City University of Hong Kong 83 Tat Chee Avenue, Kowloon, Hong Kong Abstract- Active contour is a powerful tool to object tracking. However, the existing models are only applicable to track on simple image. Based on the idea of the original greedy algorithm, we present a fast greedy tracking algorithm to face with the problem of tracking on complex real image. We have demonstrated the algorithms on tracking complex shape object on complex background. 1. Introduction 2D object tracking is a hit research topic in dynamic scene analysis. Different methods can be used : (1) image region based tracking algorithms [1 ]; (2) feature point based tracking algorithms [2]; and (3)line segment based tracking algorithms [3]. In general, these methods require an explicit definition of a dynamic model of the moving objects. Many objects cannot be described by simple geometric shapes (e.g. circle, ellipse) but need to be represented with complex contours. In order to model the complex natural shape contours, Kass et al. [4] has introduced the idea of active contour (deformable contour). Active contour models were successfully applied in computer vision problems such as optimal contour detection [5,6,7], and simple shape object tracking on a uniform background [7,8]. D. J. Williams and M. Shah [9] proposed a greedy active contour algorithm which is fast and stable. In section 2, the tracking results using their greedy algorithm are shown, which is useful in summarizing the difficulties of tracking by active contour. In section 3, a new "greedy tracking algorithm" is proposed and the results of using the proposed algorithm to track objects with complex shapes in complex background is given. Finally, a conclusion is given in section 4. 2. Object tracking by greedy algorithm Suppose the contour Ct of a moving object M at time t is known, Ct can be used as an approximate contour of the target object at time t+l, provided that there is only a slight change in the target object. In order to find the best description Ct+ 1 from Ct, an adjustment process is necessary to fine tune the shape of the contour by using the information available at image frame It+l. 2.1. Classical active contour approach The snake equations provide flexible tracking mechanisms that are driven by simulated forces derived from time varying images. Let the contour be represented by v,.=(x(s) ,y(s)). The classical active contour approach involves minimizing an energy function defined by i
(1 )
Esmtke : i Eint (V(S)) "1- Eex ` (v(s))as 0
for the active contour to move onto the object border. The internal energy is written as 1
2
+
2
)
which serves as a smoothness constraint. The external energy force.
Eext consists of the external constraints and the image
2.2. Greedy algorithm Greedy algorithm is a fast active contour algorithm proposed by D. J. Williams and M. Shah in 1991 [9]. Since the algorithm is both stable and fast, it is suitable as being an adjustment process for object tracking. The quantity being minimized by the greedy algorithm is
E = ~a(s)nor(E ......)+ ~(s)nor(Ecurv) + ~y(s)nor(Ei,,~ge)ds
(3)
and the energy terms are defined by
Eco,,.i -I-d-Iv i -vi_,l[
(4)
-Iv;_,- 2v, + m
where d represents the average distance between contour points in the previous iteration cycle, nor(E) represents a normalizing function with respect to the energy value of the neighboring pixels. The values of Eco,,t, E,.,~v and Eim,,g, are
166 all normalized to values between 0 and 1, and o~=1, 13=0 or 1 (depending on whether a corner is assumed at that location) and y=1.2 in the greedy algorithm.
(a) original image (b) result of greedy algorithm, require 0.16s (c) result of greedy tracking algorithm, require 0.21s Figure 1. Translated square and circle. (20 points used, with window size 3x3)
Figure 2.6 degree/frame rotating cup. (31 points used, with window size 3x3) The complexity of the greedy algorithm is O(nm e) for a contour having n points which allows the active contour to move to any point in a neighborhood window of size mxm at each iteration. (The full greedy algorithm will not be listed in this paper, but for more information, please refer to [9].) The results of using the greedy algorithm to object tracking are given in Fig. 1b, 2b and 3b. In Fig. 1b, the circle and the square are both moved slightly to the right and downward. We find that two of the contour points on the right edge of the square are attracted by the border of the circle when the contour in Fig. 1a is used as the initial contour of Fig. lb. In Fig. 2b, the cup has been rotated. The upper and lower portion of the arm contour are attracted by the internal structure and the background structural noises respectively. Although the rough shape of the body of the cup can be successfully extracted, the extracted border on the arm of the cup is not satisfactory. Fig. 3b is the result of tracking the human body silhouette in two consecutive images using greedy algorithm. We can see that only the regions near the shoulders and the left foot are extracted correctly. Results show the active contour model is sensitive to both the internal structure and background structural noises. Therefore, the active contour model can only be applied to track simple object moving on a uniform background.
Figure 3.0.5frame/s walking man. (68 points used, with window size 5x5) 3. Greedy tracking algorithm To incorporate more shape information into the model and reduce both the influence of internal structure and background structural noises when the method is applied to object tracking problems, we propose a "greedy tracking algorithm". The proposed model aims to maintain the shape and gray level values along the contour, in order to minimize the influence of both the internal structure and the background structural noises, due to the complexity of the target object and the background. The structure of the proposed algorithm is similar to the original greedy algorithm.
167 The algorithm is an iterative process. In each turn, each contour point is allowed to move to the neighboring location which has the lowest energy level, and the computational complexity is O(nm2). However, the definition of the energy function being minimized is different to the greedy algorithm. This makes the model has more desirable behavior when applied to the object tracking problem. The form of the energy function being minimized in the proposed algorithm is similar to equation (3). Internal energy of the contour is contributed by the sum of a continuity force and a curvature force. Let the contour be represented by {vi} = Ix(i), y(i)} where i=0,1 ..... n-l, and x(i),y(i) are pixels' coordinates. The continuity energy is redefined as IIu:i lu:+lll where U t E .......' = II/+lllu'I,- lu'+'--'--]] i - Vi-1;i-I in image I t. (6) I i+1 II (Note that all index arithmetic is modulo n) The internal continuity energy is so defined since we allow the points to be unevenly distributed on the contour and we only want to maintain the approximate distribution of the contour points on a newer image frame. The internal curvature energy is redefined as E ..... , =
Ic: -
I
C: +1
with
I
^,1 "]/~: •x u-"/'+l[
C: : /~i'+1 - ui
i - -
Again, the curvature at each point is maintained by minimizing the curvature energy. The curvature vector C: at point i has a magnitude equal to the square of the difference of the unit vector ui+ ~', and ui^', and with direction parallel to the vector ~.' • t~:+l . The continuity and curvature energy are so defined since it is assumed that the shape of the contour does not change a lot in a short time gap, the results of minimizing the continuity and the curvature energy together is that the approximate shape of the contour across any two consecutive frames is maintained. Note that the original assigned contour point can be of any shape (including low and high curvature point), this is a desirable property since many real objects has sharp curvature points, like corners. In the original active contour model, Econt, i (equation (4)) and Ecurv, i will be zero when the contour points are equally spaced and the curvature is zero. Thus the original active contour model biased towards i) equally spaced contour points and ii) low curvature. Moreover, corners have to be specified using the special method of setting fl = 0. This is undesirable since during a motion, i) maintaining equally distort feature points may not be the best strategy to represent a shape most faithfully (compactly); ii) a priori assumption of low curvature is not particularly realistic in a shape representation; iii) this problem is even more pronounced since the motion and view changes may continuously produce points of sharp curvatures as new occluding contours come into view. On the contrary, our method does not suffer from such anomalies. Eco,t.i (equation (6)) and Ec.... i (equation (7)) will be zero merely when i) the spacing ratio of consecutive contour points and ii) the curvature do not change between frames. Also, from equation (7), it is clear that the method does not have to take special care for corner contour points and the appearance or disappearance of a corner point can be gradually accounted for by the equation. On the other hands, the external energy is defined as Eex , = ]Gbl, (v) - abl,+ l (v)]- ]VI,+ l (v)] 2
(8)
where Gbl is the Gaussian blurred image of I. Minimizing the external energy will cause the contour point to move to the new location where the approximate gray level value can be maintained, and the contrast is high. The proposed algorithm is listed below :
Greedy tracking algorithm
Input : Output :
Image It, It+t, contour Ct o f image It Adjusted contour Ct+l o f image It§
ct = [3 = 1, T = 1.2, ptsmoved = 0"
do{
for i = 0 to n { for j = 0 to m-1
//Note: all index arithmetic is modulo n //first point is processed twice
for k = 0 to m-1 calculate Econt,i(j,k), Ecurv,i(j,k), Eimage,i(j,k) ; nor( Econt,i(j,k) ) = Econt.i(j,k) / MAX( Econt,i(j,k ) ) ; nor(Ec~,i(j,k) ) = Ecurv,i(j,k) / MAX(Ecu~v,i(j,k) ) ; nor(Eimage,i(j,k) )=Eimage,i(j,k) / MAX(Eimage,i(j,k) ) ; for j = 0 to m-1 //mxm =size of neighborhood for k = 0 to m-1 Ei(j,k)=o~ nor( Eeont,i(j,k) )+ ~ nor(Eeurv,i(j,k) ) + y nor(Eimage,i(j,k) ) ; Locate smallest Ei(j,k) ; Move vi to location with smallest Ei(j,k) ; ptsmoved += 1 ;
168
} }while ptsmoved < threshold ; Note that the first contour point Vo is processed twice (like the greedy algorithm), since the point v,.l has not been updated when Vo is processed. Reprocessing the point Vo helps to make its behavior more like that of the other points. Results of using the greedy tracking algorithm to object tracking are given in Fig. lc, 2c, 3c and 4. (Note, we use the same weight settings as in the greedy algorithm, ot=fl=l, 7'=1.2. In contrast to the original greedy algorithm, we have no need to set fl=O for corner points) We use gray level images of size 256x256 pixels. The processing time, the number of point and the window size used for each image (using a PC 486DX33) are listed under each picture. Fig. 2c, 4a and 4b are the results of tracking a rotating cup at time frame 2, 5 and 10 respectively, which demonstrates that the proposed algorithm is successful in tracking rigid objects with complex background provided that the motion is slow. Fig. 3c is the result of tracking the human body silhouette in two consecutive images. The upper portion of the body is correctly extracted which shows that the model is applicable to track complex shape non-rigid body. However the right foot is lost, because the proposed algorithm intends to maintain the shape of the contour across two consecutive image frames. This demonstrates that the algorithm only allows a small change of the shape of the contour across different frames.
Figure 4. 9~ rotating cup image sequence: Fig.2a( I1 )--->2c( 12 )--->4a( I5 )--->4b( I10 ). (31 points used, with window size 3x3) 4. Conclusions A fast "greedy tracking algorithm" is proposed in this paper. The proposed model aims to maintain the shape and gray level values along the contour, in order to minimize the influence of the internal structure and background structural noises due to either the surface texture complexity cf the target object or the background. The proposed algorithm has been applied to tracking objects in complex real images, and results manifest that the model is quite successful in tracking rigid or non-rigid object providing the changes are slight. Also, the tracking results is satisfactory even when the shape of the object is complex. On the other hand, although maintaining the shape of the contour is helpful in tracking complex objects, it limits the flexibility of the model since it only allows slight changes to occur. Alternatively, the method requires that successive frames be closely spaced in time. This is a compromise which we have to make in our approach.
References
1. D.S.Kalivas, A.Sawchuk, "A Region Matching Motion Estimation algorithm', CVGIP: Image Understanding, Vol.54(2), 275-288, 1991. 2. S.K.Sethi, R.Jain, "Finding Trajectories of Features Points in a Monocular Image Sequence", IEEE Trans. PAMI, Vol.9(1), 56-73, 1987. 3. R.Deriche, O.Faugeras, "Tracking Line Segments", Image and Vision Computing, Vol.8(4), 261-270, 1990. 4. M.Kass, A.Witkin, D.Terzopoulos, "Snakes: Active Contour Models", Proc. Int. Conf. Comp. Vis., 259-268, 1987. 5. A.A.Amini, T.E.Weymouth, R.C.Jain, "Using Dynamic Programming for Solving Variational Problems in Vision", IEEE Tans. PAMI, Vo1.12(9), 855-867, 1990 6. C.A.Davatzikos, J.L.Prince, "An Active Contour Model for Mapping the Cortex", IEEE Trans. Medical Imaging, Vol. 14(1), 65-80, 1995. 7. D.Geiger, A.Gupta, L.A.Costa, J.Vlontzos, "Dynamic Programming for Detecting, Tracking, and Matching Deformable Contours", IEEE Trans. PAMI Correspondence, Vol.17(3), 294-302, 1995. 8. F.Leymarie, M.D.Levine, "Tracking Deformable Objects in the Plane Using an Active Contour Model", IEEE Trans. PAMI, Vo1.15(6), 1993. 9. D.J.Williams, M.Shah, "A Fast Algorithm for Active Contours and Curvature Estimation", CVGIP: Image Understanding, Vol.55(1), pp. 14-26, 1992.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
169
The Two-Point Combinatorial Probabilistic Hough Transform for Circle Detection (C2PHT) J. Y. Goulermas and P. Liatsis Control Systems Centre, Dept. of EE&E, UMIST, PO Box 88, Manchester M60 1QD, UK e-mail: {goulerma/panos}@csc.umist.ac.uk A novel Hough Transform (HT) for circle detection, the C2PHT, is presented. While, other Combinatorial Probabilistic HTs reduce generation of redundant evidence by sampling point-triples, the C2PHT achieves a much higher reduction in two ways. Firstly, by using the edge gradient information, it allows point-tuples to define circles and consequently decreases the sampling complexity from O(N3) to O(N2). Secondly, the transformation is conditional, that is not all the tuples are eligible to vote. The evidence is gathered in a very sparse parameter space, so that peak recovery is readily despatched. The result is high speed, increased accuracy and very low memory resources. INTRODUCTION The Hough Transform <1'2'3) (HT) is a well-known robust generic method for detecting patterns of points in binary image data whose number of instances, size and spatial positions are unknown. It exhibits considerable immunity to problematic object boundaries, such as ones partially obliterated by occlusion, overlapping effects and breakages, as well as boundaries distorted by interference noise, ambient illumination and object motion. Many circle-HT variations have been proposed in the past (4'5'6'7'8'9) to extend the standard scheme. However, all generate substantial redundant evidence, as every single feature point A is assumed to (potentially) belong to a circle instance. This results to slowing down the transformation process and obstructing peak recovery, without significant reduction of the memory resources. The Combinatorial Probabilistic HTs (CPHTs) (10'11'12'13'14'15'16)are a distinct class of HTs which attempt to reduce the generation of redundant evidence, via transforming minimal subsets of points, that is the least number of points (3 in the case of circle detection) required to define a shape instance. In this way, they force feature data to vote for the most probable, as opposed to all possible shape instances, so that the cast votes accumulate densely in areas associated with the highest probabilities of instances. Nevertheless, the probabilistic nature of the CPHTs in correspondence with the high combinatorial complexity of sampling may cause a severe detection inefficiency. From all the (/if)possible triples of points, if valid ones (i.e. triples that define a real instance) are not quickly sampled, then the algorithm falters and accumulates ineffectual evidence that does not reflect objectively the instances depicted in the feature space F.. This effect becomes more dramatic when the amount of noise in u the number of points of the non-circular objects in the scene is high. Also, because no gradient information is used, it is very likely that false evidence is accumulated by coincidental co-circularities of points that do not reside on true circular arcs. In an attempt towards preserving the advantages of the standard CPHTs, while reducing dramatically their sampling combinatorial requirements, we have developed the C2PHT (2-point CPHT). This algorithm employs the gradient orientation to define potential circles using point-tuples. In this way it samples F 2, instead of F 3 and transforms the sampled tuples to very small sets of parameter vectors in the parameter space P. THE TUPLE-BASED TRANSFORMATION SCHEME Let A~(xA,YA)be a point in Fwith gradient orientation #a and trigonometric measures of cos(#a) and sin(#a), denoted by CA and SA respectively. Assuming that A belongs to a circle, then its centre resides on the line segment LA, as shown in Fig.l, defined parametrically by:
LAULA(rj)=(XA--rj'cA,YA--rj'SA)
VrjElrmin,rmax]
(1)
where [rmin,rma~] is the predefined radii range of the sought circle instances. For another point B (defined similarly as A) to belong to the same circle with A, LA and LB have to intersect at the same centre point. Therefore, there should be a value r~=R that simultaneously satisfies the parametric constraints of the two line segments. However, in order to take into account inaccuracies in the estimation of gradient orientation, we assume that CAand ~ are subjected to an error of _+~. Then, candidate centres suggested by A and B reside within the two triangular shaded areas of Fig.1 and are restricted to lie along the perpendicular bisector P of AB. Each point suggests centres corresponding to different segments of P. Specifically, A suggests centres comprising the
170
segment A~A2, while B suggests centres on B~B2. The common part of the two triangular areas is the polygon STUV and its intersection with P is the line segment A~B~. Therefore, centres of probable circles that contain both A and B are the points of AIBI. This segment may need to be mmcated, so that the related radii are within the predefined range [r,,~,r,,,x].
Figure 1: The tuple-based transformation scheme. The endpoints of AIA2 and BIB2 can be calculated as follows. P is described by: Y - YM = m. ( x - x M)
(2)
its slope. We define LA~ LB:t as the where M - (xM, YM) -- Xa + XB YA + YB is the foot of P and m = ~ 2 ' 2 Ay four line segments passing through A and B with orientations ~A_+~, ~ , where CAe, C~e are their trigonometric measures, respectively, as shown in Fig.1. LA:tand/-,Be are parametrically defined similarly to LA in Eq.1, with rj being the free variable. The next step is to find the four intersection points A~, ,42, B~ and B2 of the four line segments La• and LB• with P. A~ is, for instance, the intersection point La+(rj)c'~'(x,y). Then AI is expressed in terms of a radius value RA+, being its distance from A at angle 4H-~. Hence, by substituting Eq.1 to Eq.2, we obtain: Ya - Ra§ . Sa§ - YM -- m. (Xa - Ra+ . ca§ - xM ) e , RA+ =
m. A x - Ay
(3)
2"(m'Ca+ --St+)
Thus, the coordinates of A~, A2, BI and B2 can be calculated by their corresponding radius values RA• and RBe. By ordering these values, we can readily locate the common segment AIB~ of AIA2 and BIB2, if any. A predicate Fis def'med to test the validity of a tuple (A,B) as: F ( A , B ) = true r {A~BI = aia2 NB~B2 ~PIP2 ~ ~ } (4) where P~,P2~P, d(A,P~)=max(d(A,P),rm~), d(A,P2)=rm~ with d(.) being the Euclidean metric distance. It is obvious that there is no need of selecting points such that d(A,B)>2.rm~. The HT of a tuple (A,B) is then conditionally defined as the set of cells T(A,B), for which F(A,B) is true, as: T ( A , B ) = { ( a , b , r ) ~ P , V(a,b)EA~BI " r = d ( A , ( a , b ) ) } (5) where (a,b) and r are the centre and radius of each suggested circle. The number of circles voted for by a valid tuple (A,B) equals the discrete length of A~B~ and this in turn depends on the predefined angular error ~. The C2PHT Algorithm The accumulator S is implemented as a dynamically allocated 3-D a-b-r sparse arrayC1~ where a set of linked-lists for the three parameters store the cell coordinates and their vote counts. The incrementation strategy employed is adaptive and depends on gradient magnitude(~8). This enables strong edge primitives to outbalance other noisy ones, which usually have lower edge magnitudes, thus resulting to a reduction of noise
171
in S. Let v(A,B) be the vote value cast by (A,B) and bounded by the predefined values Vminand V,,,a~,in order to evade extreme vote counts. To prevent weaker edges from being completely masked off by stronger edges the scaling of the votes is done exponentially. Then, the gradient magnitudes of the participating points are combined as:
( ~]G(A). G(B) - Groin ]I/2 v( A, B) = Vmin+ (Vm,x - Vmin )" ~' -amax--- GZ/ "
(6)
where Gmi~and Gm~ are the minimum and maximum gradient magnitude values in F,, and G(A) and G(B) the gradient magnitude values of A and B. The implemented C2PHT circle detection mechanism works as follows. Initially, a small percentage of n (=5%.N) points is uniformly randomly selected from the N points of F a n d stored in a list L. Next all possible K=n.(n-1)/2 tuple combinations from L are generated and from these, all K" tuples (A,B) which enable the predicate F', together with their produced votes T(A,B), are recorded in S. Once the K" valid tuples are transformed, the instance suggested by the highest peak in S is template-matched against F.. If there is a "hit", i.e. a circle instance is detected, the corresponding peak cell is set to zero. Then points constituting the detected instance are removed from F'and L. Following that, the algorithm template-matches the instance suggested by the second higher peak in S and proceeds accordingly. In the case of a "miss", i.e. the peak is false, a predef'med number of points t (=20%.n) are removed from L. These are selected to be the ones which participated more frequently in invalid tuples; then L is refreshed with an equal number of t new points from F. This heuristic operates efficiently as a penalising mechanism against any likely deficiencies in the initial sampling. To economise processing time, not the entire L has to be reaccumulated. Instead, the removed points are "de-transformed" from S. This means that each removed point is re-evaluated via/" and T with each of the other points in L and negative votes are cast to S. Following refreshing, only the tuples that contain at least one newly inserted point (potentially) record evidence in S, which is added to the already existing one. The algorithm reiterates in the same way until three consecutive misses occur; then it is assumed that no more circle instances exist in the image.
Figure 2: Artificial image (A) and underwater bubble image (B), with detected circular objects. RESULTS For all experiments we used parametrisations of radius range rmin=5 and r,,~=45 (inclusive), angular error of ~=__~o, and vote boundaries of Vmin'-O and Vm~=500. Fig.2.A illustrates a 300x240 artificial grey-scale test image (which produces a feature space F o f size N=8,374 pixels, via edge detection and thresholding) with the 19 detected circles superimposed on the original scene. The dramatic reduction in evidence generation achieved by the C2PHT is manifested by the following: The total number of tuples in F i s D=35,057,751 and out of these the predicate F enables only Dr=299,443. This clearly shows that the proposed conditional tuple/gradient-based voting generates votes from at most a fraction of 0.85% of the D total elements in F 2. Comparing this to the complexity of a no-gradient CPHT, we can see that the total number of triples has to be 97,834,497,124 which far exceeds the transformation elements of C2PHT; in addition, every single triple is potentially capable of generating votes. As described before, a complete enumeration of all D tuples is unnecessary. The sampling list L is of size n=418 and this produces K=87,153 tuples on the whole. Since the sampling is spatially uniform, we assume
172 that L generates votes from K%O.85%.K~-740 valid tuples at each transformation cycle. The average length of the voting pattern T(A,B) of a valid tuple (A,B), for the given r,,,in, r,,,~ and (~, was found to be -3.6 cells which gives at most -2664 non-zero accumulator cells (in practice this number is smaller, since votes tend to overlap and converge to give rise to accumulator peaks). Fig.2.B shows the detection results in a real-world sharply illuminated underwater bubble image, with 29 detected bubbles. CONCLUSIONS A new combinatorial probabilistic Hough Transform, the C2PHT, is proposed for circular object detection. The novelty of the algorithm is that is employs gradient information and makes use of point-tuples as minimal points subsets for defining circles, thus reducing the combinatorial complexity from O(N3) to O(N2). In addition, it introduces the concept of conditional voting, where higher generation of relevant evidence is achieved, by allowing only the "valid" elements of the feature space to vote. The C2PHT incorporates facilely gradient error estimations and is based on computationally simple equations to perform fast evaluation and vote generation of the sampled tuples. The produced transform space is very sparse and hence, simple accumulator structures with Very small memory requirements are only needed. The algorithm was tested with synthetic and real-world scenes of circular objects and yielded fast and accurate results. Overall, the C2FHT manages to balance very well the trade-off between memory demands, simultaneous circle detection, high reduction of generated evidence and detection speed.
REFERENCES i V. C. Hough, "Methods and means for recognising complex patterns", US Patent 3069654, 1962. 2 Illingworth and J. Kittler, "A survey of the Hough transform", Computer Vision Graphics and Image Processing, pp. 87116, vol. 44, 1988. 3 H. Ballard, "Generalising the Hough transform to detect arbitrary shapes", Pattern Recognition, pp. 111-122, vol. 13, no. 2, 1981. 4 C. Kimme, D. H. Ballard and J. Slansky, "Finding circles by an array of accumulators", Communications of the ACM, pp. 120-122, vol. 18, no. 2, 1975. 5 G. Gerig, "Linking image-space and accumulator space: a new approach for object-recognition", 1,t Int. Conf. Computer Vision, London, pp. 112-117, 1987. 6 K. Yuen, J. Princen, J. Illingworth and J. Kittler, "Comparative study of Hough transform methods for circle finding", Image and Vision Computing, pp. 71-77, vol. 8, no. 1, 1990. 7 A. N. Jain and D. B. Krig, "A robust Hough technique for machine vision", Proc. Vision 86, Detroit, Michigan, pp. 475487, 1986. 8 j. lllingworth, J. Kittler & J. Princen, "Shape detection using the adaptive Hough transform", NATO ASI Series, Realtime Object Measurement and Classification, ed: A. K. Jain, Springer-Verlag Berlin HeidelBerg pp. 119-142, vol. F42, 1988. 9 Z. Li, M. A. Lavin and R. J. LeMaster, "Fast Hough transform: a hierarchical approach", Computer Vision Graphics and Image Processing, pp. 139-161, vol. 36, 1986. 10 Xu, E. Oja and P. Kultanen, "A new curve detection method: Randomised Hough transform (RHT)", Pattern Recognition Letters, pp. 331-338, vol. 11, 1990. 11 F. Leavers, D. Ben-Tzvi and M. B. Sandier, "A dynamic combinatorial Hough transform for straight lines and circles", 5th Alvey Vision Conf., Reading, pp. 163-168, 1989. 12F. Leavers, "The dynamic generalised Hough transform", l,t ECCV Conf., Antibes, France, 1990. 13 F. Leavers, "The dynamic generalised Hough transform: its relationship to the probabilistic Hough transforms and an application to the concurrent detection of circles and ellipses", Computer Vision Graphics and Image Processing, pp. 381398, vol. 56, no. 3, 1992. 14N. Kiryati, Y. Eldar and A. Bruckstein, "A probabilistic Hough transform", Pattern Recognition, vol. 24, no. 4, pp. 303316, 1991. 15j. R. Bergen and H. Shvaytser, "A probabilistic algorithm for computing Hough transforms", Journal of Algorithms, pp. 639-656, vol. 12, no. 4, 1991. 16A. Califano, R. M. Bolle and R.W Taylor, "Generalised neighbourhoods: a new approach to complex parameter feature extraction", Proc. IEEE Conf. on Computer Vision and Pattern Recognition, pp. 192-199, 1989. 17M. Brown, "Peak-finding with limited hierarchical memory", 7th Int. Conf. Pattern Recognition, pp. 246-249, Montreal, Canada, 1984. Is j. y. Goulermas, P. Liatsis and M. Johnson, "Real-time intelligent vision systems for process control", Proc. 4th IChemE Conf. Advances in Process Control, York, pp. 69'76, Sep. 27-28, 1995.
Proceedings IW1SP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
173
Modified Rapid Transform Features in Information Symbols Recognition system J. Turan- "K. Fazekas - L. Kovesi- M. Kovesi Department of Radioelectronics Technical University of Kosice Park Komenskeho 13 04021 Kosice Slovakia TeL/Fax." +42 95 6335692 E-mail: s TURAN@CCSUN. TUKE.SK
*Department of Microwave Telecommunications Technical Unwersity of Budapest Goldmann Ter 3 1111 Budapest Hungary TeL~Fax: +36 12 043289 E-mail: T-FAZEKAS@NO V.MHT.BME. HU
Abstract Various transformations have been suggested as a solution of the problem of high dimensionality of the feature vector and long computation time. Transforms which do not change with cyclic shifts in the sequence are called translation invariant. Fast translation invariant transforms are valuable tool for pure shape-specific feature extraction in pattern recognition problems [ 1]. In the field of pattern recognition and also scene analysis is well known the class of fast translation invariant transforms - certain transforms (CT) [2] based on the original rapid transform (RT) [3]. More recently was introduced the modified rapid transform (MRT) [5] which can distinguish many more patterns from one another that the original RT can. The MRT was presented to break undesired invariances of the RT which leads to a loss of information about the original pattern. In the paper application of fast translation invariant modified rapid transform (MRT) in feature extraction stage of Information Symbols recognition system are described. Experimental results are given of applying the proposed recognition system to recognition Airport Passenger Orientation Symbols and Meteorological Symbols, including the dependence of recognition efficiency on the number of selected features and noise.
1. Introduction Transformation methods can be used to obtain alternative description of signals. These alternative descriptions have many uses such as classification redundancy reduction, coding, etc., because some of these tasks can be better performed in the transform domain [ 1]. Various transformations have been suggested as a solution of the problem of high dimensionality of the feature vector and long computation time. More recently the modified rapid transform (MRT) [5] was presented to break undesired invariances of the rapid transform (RT) [3]. In the paper, a new method of recognition Information Symbols using MRT will be presented. We apply the MRT in feature extraction stage of Information Symbols recognition process. Some properties of the RT and MRT will be first reviewed, then the new method of recognition of Information Symbols will be presented. Finally, the experimental results will be given in applying of the proposed pattern recognition method to recognition of Airport Passenger Orientation Symbols and Meteorological Symbols, including dependence of recognition efficiency on number of selected features and noise.
2. M o d i f i e d rapid transform Transforms which do not change with cyclic shifts in the sequence are called translation invariant. Fast translation invariant transforms are valuable tool for pure shape-specific feature extraction in pattern recognition problems. The transforms may be used to extract features of one- or two-dimensional patterns, which are invariant under cyclic permutations to characterize objects independent of their position. In the field of pattern recognition and also scene analysis is well known the class of fast translation invariant transforms - certain transforms (CT) [2] based on the original rapid transform (RT) [3] but with choosing of other pairs of simple commutative operators. The RT results from a minor modification of the Walsh-Hadamard transform (WHT). The signal flow graph for the RT is identical to that of the WHT, except that the absolute value of the output of each stage of the iteration is taken before feeding it to the next stage. This is not an orthogonal transform, as no inverse exists. With the help of additional data, however, the signal can be recovered from the transform sequence, i.e. inverse rapid transform can be defined [4]. RT has some interesting properties such as invariance to cyclic shift, reflection of the data sequence, and the slight rotation of a two-dimensional pattern. It is applicable to both binary and analogue inputs and it can be extended to multiple dimensions. More recently was introduced the modified rapid transform (MRT) [5] which can distinguish many more patterns from one another that the original RT can. The MRT was presented to break undesired invariances of the RT which leads to a loss of information about the original pattern. This is achieved, by combining the RT with preprocessing steps using a
174
asymmetric neighbor operator ~. This operator is used to break undesirable invariances but keep the shift invariance of the MRT. Using the symbolic notation we can introduce MRT as follows: X(0) ~
.
X(1) X(2) X(3) X(4) ~
.
.
.
.
.
.
~. x'(o)
X'(1) ; X'(2)
; x'(3) :- X'(4)
x'(5)
x(5) x(6)
X(7)
; X'(6) .---kPreprocessingsteps ~
X(i)~. X(i+1)
_
~ X(i)~.
~'~Qf0(X(i),X(i+1),X(i+2))
X(i+2) "'"'""
~'(7)
RT j'JO
I,
X(i)~. f~(X(i),X(j))
X(j) ~
_ "~:)f2(X(i),X(j))
X(j) "'"'""
Fig.1. Signal graph of the MRT Signal graph of MRT (Fig.l) results from signal graph of RT with adding in general k preprocessing steps x'=c~x. This maps the element x(i) of input vector x to element x '(i) of vector x' by working on the elements x(i), x(i+ 1) and x(i+2)
x'(i)=fo(x(i),x(i+ 1),x(i+2))
(1)
It is important that the operator j~ be asymmetric because we want to destroy the invariance of RT under reflection. Operator j~ may be realized in the following simple manner
x'(i)-fo(x(i),x(i+ 1),x(i+2))-x(i)+lx(i+ 1)-x(i+2) I
(2)
The transform process of MRT (Fig.l) - identical to the transform process of RT requires N=2" input pixels, where n is a positive integer. Each column of the transform process in Fig.1 corresponds to a particular computational step; n steps are required. In general the variables x (r) in any column (r) are calculated from variables x (r~) in the preceding column (r-l) by
x(r)(i+2js)=fl(X(r'l)(i+2js),x(r'l)(i+(2j+ 1)S))
x(r)(i+(2j+ 1)s)=f2(x(r'l)(i+2js),x(r'l)(i+(2j+1)S))
(3)
where operatorsjq,J) for MRT (or RT) are
fl(a,b)=a+b; fz(a,b)=la-bl
(4)
and S=2n'r; t=2r'l; i=O,...,S-1;j=l .... t-1 and x - x (~ are input data (pixels) and x(") --=~ = MRT{x} are spectral coefficiems of MRT. MRT can be applied in all areas where the RT (or any transform from class CT) can be used. Some undesired invariances of RT can be destroyed applying only one preprocessing step. Experiments from use of MRT [5,6] in character recognition showed, that MRT can distinguish many more patterns from one another than the RT or the Fourier power spectrum.
3. The Information S y m b o l s Recognition @ s t e m M o d e l The recognition system is simulated in digital computer using program package CT-CAD [7]. It contains the following sub-systems(Fig.2):" 1. Original digital picture preprocessing system CSPO-III was used to accepts the physical input picture and then transduce it into a measurable matrix. CSPO-III divides a visual pattern into small elements and after suitable preprocessing produces an NxN matrix over the binary field; the element becomes 1 or 0 depending upon whether it is black or white. 2. The MRT processor according to its function may be also called a feature extractor. A 2D MRT of all binary prototypes is taken in this stage. Than feature selection is carried out in the MRT "spectral" domain on various basis (maximum value of spectral coefficients, variance zonal sampling and interclass standard deviation). 3. The selected MRT features of binary pictures (symbols) are in the teaching process feeded into the memory. Thus the memory unit learn the a priori knowledge of each class before the system can be used to make any decision. In the recognition process the selected MRT features are feeded into the classifier, which discriminates each pattern
175
(symbol) and assigns a category (a class) to it by some decision rule. We use a simple classifier based on cross responses dkl between two different patterns from class k and 1, defined in the next section.
MRT processor
CSPO-III
Input pattern
i :
Digital picture I I I Receptor and I
bq
Selected '
MRT
features
o
MRT features Binary
.
,
,
picture
Classes of patterns
Recognition Classifier
I Teaching
Fig.2. The MRT recognition system 4. R e c o g n i t i o n o f l n f o r m a t i o n S y m b o l s The proposed Information Symbols recognition system was tested on the two classes of selected symbols: 1. Airport Passenger Orientation Symbols (class consist of M=I 1 independent symbols) (Fig. 3).
ZI
Z2
Z3
Z5
Z6
Z7
Z9
ZI0
ZI1
nn nn nn i n n
nn u nn
2. Meteorological Symbols (class consist of independent symbols) (Fig. 4).
Z4
M1
M2
M3
M4
M6
M7
M8
M=16
u
M5
M9
Fig.3. The Airport Passenger Symbols M13
() Q ( M10
M11
M12
MI4
MI5
M16
Fig. 4. The Meteorological Symbols We implemented feature extraction with MRT at the both sets of Information Symbols. In general, the efficiency of feature extraction can be assessed by the system confusion matrix D={dn; k, 1=1..... M} where dkl are cross responses (or the distances between any two different symbols k, l in the feature space) and M is the number of classes or number of different symbols. The confusion matrix can be calculated in two steps shown as follows: A. All M prototypes of Information Symbols, each represented by a binary NxN matrix (xk(i,j), with i, j = l .... N; k--1 ..... Mand M=I 1 or M=16) are transformed to the MRT transform domain Sk(i,j) = r{x(i,j)} where x - MRT. B. The cross response d(1)klbetween two different symbols from class k and l is defined as follows:
(5)
176
N
d~ ) =
~l~k(i,j)- ~,(i,j)l
(6)
i,j=l
The results of experiments of dependence of recognition efficiency on number of selected features and influence of noise are shown in Tab.1. A set of 165 symbols were used for testing and teaching purposes for Airport Passenger Orientation Symbols and a set of 240 symbols were used for testing and teaching purposes for Meteorological Symbols, testing set used on Tab. 1 contains 5 noised symbols for each Airport Passenger Orientation and for each Meteorological Symbol. Tab. 1 Recognition of Airport Passenger Orientation Symbols and Meteorological Symbols
The results of experiments may be summarized as follows: A. Only one preprocessing step in MRT signal graph is sufficient to destroy the undesired invariances and improve significantly capability of MRT distinguish many more patterns from one another than the original RT can. B. Even if avery simple classifier was used, the recognition efficiency 96%- 100% can be obtained with selecting only a couple of features (0.1%-5% of the number of MRT coefficients) in the MRT spectral domain, even if the symbols are corrupted by (1%-4%) noise.
5. Conclusion We apply the MRT in feature extraction stage of Information Symbols recognition system. Experiments with recognition of two classes of symbols (Airport Passenger Orientation Symbols and Meteorological Symbols) demonstrate that even if very simple classifier was used, the very high recognition efficiency can be obtained with selecting only a couple of features in the MRT spectral domain, even if the symbols are corrupted by noise.
References [1] [2] [3] [4] [5] [6] [7]
Chmurny, J. - Turan, J." Two-dimensional Fast Translation Invariant Transforms and Their Use in Robotics. Electronic Horizon, Vol.15, No.5, 1984, 211-220. Wagh, M. D. - Kanetkar, S.V.: A Class of Translation Invariant Transforms. IEEE Trans. on Acoustic, Speech and Signal Proc., Vol. ASSP-25, No.3, 1977, 203-205. Reitboeck, H. - Brody, T.P.: A Transformation with Invariance Under Cyclic Permutation for Application in Pattern Recognition. Inf. and Control, Vol.15, 1969, 130-154. Turan, J. - Chmumy, J." Two-dimensional Inverse Rapid Transform. Computers and Art. Intelligence, Vol.2, No.5, 1983, 473-477. Fang, M. - Hausler, G.: Modified Rapid Transform. Applied Optics, Vol.28, No.6, 1989, 1257-1262. Turan, J.: Recognition of Printed Berber Characters Using Modified Rapid Transform. Journal on Communications, Vol.XLV, 1994, 24-27. Turan, J. - Kovesi, L. - Kovesi, M.: CAD System for Pattern Recognition and DSP with Use of Fast Translation Invariant Transform. Journal on Communications, Vol.XLV, 1994, 85-89.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
177
Image Data Processing in Flying Object Velocity Optoelectronic Measuring Device Jan MIKULEC - Vaclav RICNY, Technical University Bmo, Czech Republic
Abstract: Paper aims at the simulative verification of the new optoelectronic method for measurement of the flying objects' track velocity. This paper also shows the block diagram of the adapter which enables simulations by using mentioned algorithms on PC computers.
1. Introduction The trajectory of aircraft or of other flying means in space is not simple one. Due to the influence of the air masses motion velocity Vw, the direction of the flight is not identical with the aircraft axis. Sufficiently accurate determination of so called track velocity VTand of the track angle r is very demanding procedure. At present it is performed by using methods of exploiting terrestrial or orbital devices. The advantages of the described optoelectronic method should be relative low price of measuring device with respect to the attainable measuring accuracy, as well as the fact, that it deals with the autonomous and radio passive method.
2. Measurement of track velocity vector (TVV) using the light sensitive CCD sensors Total time derivation of two-dimensional implicit brightness function of the earth's surface B(x,y,t) projected to the image plane of sensor (see Fig. 1) is expressed by the relation
d B(x, y, t) dt Part of C.CD sensor structttre N
=
~B(x, y, t)
dx
15x
dt
I
6B(x,y, t) d y
" ~ =
15y
~B(x, y, t)
dt
15x
"VTx
+
~B(x, y, t) 15y
9VTy. (1)
It can be seen from Fig. 1, that the TVV components
lnmge plane
VTx and VTy are in some relation (due to the geometry of the S
-t
the applied optical system) with the components Vox a Voy in image plane according to the relations
122 .L-----~
v
Ox =
m
Fo ~
h
" V Tx
and
v Oy
m
~
Fo
~
h
~
v Ty,
(2)
where the meaning of symbols h and F o is evident from Fig. 1. If we succeed in determining the time derivative dB/dt ....
and both the directional derivatives 5B/Sx, ~SB/Sy,there remain two unknown variables in the equation of total differential namely the searched TVV components Vox a Voy, respectively. It is evident therefrom, that these components can be found by solving the system of two independent equations of total differentials of the earth's surface brightness function, thus those of brightness function from two different places of earth's surface. Measuring point (later MP) represents an arbitrary
Fig. 1: Principle of the method
geometrical arrangement of the photosensitive layer (of pixels) of the CCD sensor (two-line or as a part of an area sensor),
178 from which it is possible to approximate, by suitable algorithm, the directional derivatives in two different directions. It is possible to determine the approximation of time derivative from the change of the magnitude of voltage signal samples of pixels at different time instants. Generally, the position of MP in space is totally arbitrary one, but is necessary to know its rotation angle with respect to the aircraft's axis or to other relative coordinate. For accurate determination of TVV, two strategies are offered. The first one consists in as much as possible accurate approximation of the discretised brightness function and following calculation of time- and two directional derivatives of the continuous total differentials by means of discrete values of pixel signals in measuring point. Then it could be sufficient, for the determination of the TVV components at given instant, to measure only in two MP's. The second variant supposes a simple aproximation of time- and directional derivatives in great number of MP's with simple arrangement of pixels. By a great number of combinations of equation system solutions in different pairs of MP's, it is possible to obtain the sets of not too much accurate values of TVV components, which will be consequently processed by suitable statistical method. The first variant seems to be less suitable due to the stochastic character of brightness function and with respect to its machine time of computations of sutliciently accurate algorithms. Moreover, the resulting accuracy is influenced by the quantization error of the A/D conversion. Therefore, the second variant of TVV determination has been chosen and it was verified by computer simulation. In Fig.2 there is a representation of the part of CCD sensor motion over the discretised brightness function B(x,y). In correspondence with chosen strategy of TVV determining, the time- and both the directional derivatives have been approximated by the simplest relations
d B(x,y)
Ul(n + 1,m)- Ul(n,m )
dt
x
, (3)
~SB(x,y) Ul(n,m + l ) - U l ( n , m ) , (4) 5x 5 and
Fig. 2: Shift of the MP after period x
~SB(x,y)~ U2(n,m ) - Ul(n'm), (5)
~y
where n = 0, 1, 2... is the serial number of measurement, m = 1, 2, 3... is the serial number of pixel in CCD structure, ~5 ...is the size of quadratic
pixel. Time- and directional derivatives from relation (1) are replaced by time- and directional differences (3) to (5). MP in this case contains only three pixels with output signals Ul(m,n ), Ul(m+ 1,n) and U2(m,n ).
,3. Computer simulation of the measuring system activity For the verification of features of the designed system and for the determination of achievable accuracy of the measurement, the computer simulation has been used, which models the optoelectronic transformation of brightness distribution B(x,y) in CCD sensors into the voltage samples Ul(m,n ) and U2(m,n ). After the simulation of A/D conversion of that samples all the calculations are performed in numerical form in such a manner, that the used computing algorithms could be exploited in real measuring device.
179 It is important especially the choice of computing algorithms for the elimination of results of partial measurement with great deviation, caused by the choice of second strategy of TVV determination, according to chapter 2. As the best it was shown the application of dynamic filtration by means of the state estimation of measured object. 4. Dynamic filtration of the set of component velocities
Chosen filtration comes out of the reality, that the aircraft's motion is inertial one. due to its great motive power and therefore, despite of the ignorance of all influences, acting on its motion, it is possible to estimate its next state. As a kernel of the applied dynamic filter there are two models, namely the aircraft's model and that of measuring system. The aircraft model performs its own inertial motion and it is possible to compute by the known equation of that motion the more or less accurate estimation of the instantaneous quantities of the aircraft's state vectors Sox and Soyi The task of the model of measurement system lies in the transformation of the aircraft's state estimation into the estimation of measuring quantities (TVV components), which is consequently compared with the values measured by measuring device on real object (aircraft). The obtained deviation will be then used, after amplification, for the correction of motion of aircraft's model. If the amplification factor is chosen appropriately, the accuracy of the aircraft's state estimation will be improved. However, the parameters of the aircraft's model depend upon the estimated values and they have to approximate to the values of real object (aircraft). The filtration is performed on the set of component velocites Vo~i a Voyi. All the elements of that file have been obtained by the same system of measurement using the evaluation of the brightness distribution of different parts of earth's surface.
Fig. 3: Dependence of relative errors of measured values upon the serial number of measurement n
Fig.3 represents the dependences of the relative
error
5(Voy)and relative error of the estimation of the
component of position vector after the dynamic filtration 1
N
/5(So,,) = ~- 9~ ( / 5 , (Vo,) + 1 ) - 1. n=O
(6)
180 5. Hardware for the simulation of the measuring system function
Computer simulation for verification of the developped measuring system and its attainable results has been done not only with the image data generated by means of computer, but also using digitalized videosignals of camera with special double-line sensor CCD (2 x 128 pixels). The camera is able to scan the moving photographs of the real earth's surface. video 1.] input I ] amplifier I double line CCD sensor
CMOS/I'TL] driver[
-.objective J input I video 2"]amplifier I
moveable photograph
t
control circuits
I/I I
Ic~
bu,
bus
driver
I
clock
IIIIIIII I'.1 I I I I I : !
..""'"-.
[ A/D converter
> >
,L I
I
bus
driver btt fer SR kM btt fer SR kM I
vo, I
ISA bus counters
~
data bus drfi,er
I ha,
[driver control signals status signals
add "ess ~ dee ~der
to PC
Fig. 4: Block diagram of the scanlng unit and PC add-on card
The block diagram of the sensing unit and PC add-on card can be seen from Fig.4. This adapter enables the amplification, A/D conversion and storage of both output videosignals video 1 and video 2 of the sensor (sample frequency approx. 1 MHz, 8 bite representation) into RAM or hard disk of standard PC compatible computers. The data can be then processed using the algorithms mentioned above. 6. Conclusion
Computer simulation of the designed optoelectronic method of the aircraft's TVV proved the possibility of obtaining high accuracy of measurement and the admissibility of the design of real measuring device. It enables the optimization of parameters of algorithms applied for the TVV determination and to quantify and minimize the machine time for computations. References
[1] RICNY, V.-MIKULEC,J.: Measuring Flying Object Velocity with CCD Sensors. IEEE Aerospace and Electronic Systems. Vo.9, Nr.6, June 1994 (pp..3-6) [2] JURIK, R.: PC Add-on Card for the Double-line CCD Sensor. Proceedings of the 6th National Scientific Conference ,,Radioelektronika 96". Faculty of Electrical Engineering and Computer Science TU BRNO, 1996 (pp.95-96)
Session F: TEXTURE ANALYSIS
This Page Intentionally Left Blank
Proceedings IWISP '96,"4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
183
Rotation Invariant Texture Classification Schemes using GMRFs and Wavelets Robert Porter* and Nishan Canagarajah* Image Communications Group, Centre for Communications Research, University of Bristol, UK. Abstract Many texture classification schemes suffer from a number of drawbacks. They require an excessively large image area for texture analysis, use a large number of features to represent each texture and are often computationaUy very demanding. Furthermore, few classification schemes have the ability to maintain a high classification rate for textures that have undergone a rotation. In this paper, we present two new rotation invariant texture classification schemes based on Gaussian Markov random fields and the wavelet transform. These schemes offer a high classification performance on textures at any orientation using significantly fewer features and a smaller area of analysis than most existing schemes.
1.
Introduction
Texture classification is a difficult but important area of image analysis with a wide variety of applications ranging from remote sensing and crop classification to medical diagnosis. A number of approaches to this problem have been proposed over recent years including stochastic models such as Gaussian Markov Random Fields (GMRFs) [1] and autoregression [2, 3], statistical analysis methods [4] and spatial frequency based techniques [5, 6] amongst many others. However, many of the existing methods require a large number of features to describe each texture which can lead to an unmanageable size of feature space [4]. Furthermore, the feature extraction techniques employed are often computationally very demanding [4] and require an excessively large image area for the analysis [4, 6]. This is clearly undesirable if only small texture samples are available or if the features are to be applied to a segmentation problem requiring high resolution. Another drawback of the majority of classification schemes is their inability to maintain a high classification rate when the textures for classification have undergone a rotation [5]. Here, two new classification schemes are proposed, employing features extracted using either wavelet analysis or Gaussian Markov random field modelling on a small area of the image. It is shown that these schemes require significantly fewer features than most others and provide high performance rotation invariant texture classification.
2. 2.1
ProposedSchemes The Wavelet Transform
The first approach derives features from a 3-level wavelet decomposition of a small area ( 1 6 x 1 6 ) of the image. Fig. l(a) shows the 10 main wavelet channels resulting from such a decomposition. A feature vector made up of the average energies within these channels was successfully employed in segmenting textured images in [7]. However, the HH channels in each level of decomposition tend to contain the majority of noise in the image and were found to degrade the performance when used for texture classification. Therefore, only the remaining seven channels were chosen to provide features for texture classification (the numbered channels in Fig. l(b)). The energy in each of the chosen wavelet channels is calculated to create a seven-dimensional feature vector for texture classification. The energy of a wavelet channel is given simply by the mean magnitude of its wavelet coefficients, i.e. ec,,, the energy in the nth channel is given by: 1
M N
eo, = - ~ ,~. ~~[x(i,j) l,
(1)
where the channel is of dimensions M by N, i and j are the rows and columns of the channel and x is a wavelet coefficient within the channel. Unfortunately, these features are not rotation invariant, since different features are used to represent the texture's horizontal and vertical frequency components. Rotation invariance can be e-mail : [email protected] * e-mail : [email protected]
184 achieved by combining the horizontal and vertical frequency components to form single features. Hence, the pairs of diagonally opposite LH and HL channels in each level of decomposition are grouped together to produce four main frequency or scale bands in the proposed scheme, as illustrated in Fig. 1(c). The energy in each of the four chosen bands is calculated (using equation 1) to create a four-dimensional feature vector which is then used in the classification algorithm. This approach is thus based entirely on the composition of spatial frequencies within the texture and is not heavily dependent on the texture's directionality. Although this can have disadvantages in distinguishing between textures of very similar spatial frequency, it provides a robust rotation invariant set of features for texture classification.
8
ltl
6
H
1[~
4
i!111
ltJ
H
(a) (b) (c) Figure I - (a) Ten main channels of a 3-1eve1 wavelet decomposition of an image; (b) Wavelet channels used to produce features for texture classification; (c) Grouping of wavelet channels to form the 4 bands used to produce rotation invariant features.
2.2
Gaussian Markov R a n d o m Fields
GMRFs have been shown to perform well both in texture classification [ 1] and image segmentation. Here, the texture can be represented as a set of zero mean observations, y(s), s E ~ , s = {s = (i, j): 0 s i, j < M - 1} (2) for an
MxM lattice.
The GMRF model assumes the observations obey the following equation [ 1],
y(s) where
Ns
=
r~O rY(S + r)+ e(s)
is the neighbour set, 0 r is the GMRF parameter for neighbour r and
(3)
e(s) is
a stationary Gaussian
noise sequence. The neighbour set is assumed to be symmetric: O r = 0_r, for all r E N s
(4)
The GMRF parameters and the variance, v, of the noise source can be estimated for a given texture using the least squares approach [1 ] and are often successfully employed as features for texture classification. However, these features are not rotation invariant since each pair of neighbours can only represent the texture in a single direction. It was found that in order to achieve rotation invariance, the neighbour set should be circularly symmetric so that each GMRF parameter depends on neighbours in all directions. The neighbour sets for the 1st, 2nd and 3rd order circular GMRFs are shown in Fig. 2. The grey levels of neighbours which do not fall exactly in the centre of pixels can be estimated by interpolation. This model is the GMRF equivalent of the autoregressive models in [2] and [3], but was found to give a high classification performance without the need for multiresolution analysis [3] and is thus more computationally efficient. For the third order circular GMRF, just three parameters exist for the three sets of circularly symmetric neighbours. The features used for texture classification comprise these three parameters and the variance parameter, extracted using the least squares approach from a 16x16 area of the image. The third order GMRF is chosen to balance a high performance with a small number of features.
Figure 2 - Neighbour sets for 1st, 2nd and 3rd order circular GMRFs.
185
3.
Classification Results
Sixteen 256x256 Brodatz textures [8] were used to test the performance of the features. One sample image of each texture was used to provide several 16xl 6 sub-images with which to train the classification algorithm. A further 7 sample images of each texture were presented to the algorithm in a random order as unknown textures for classification. A minimum distance classifier was employed (using the Mahalonobis distance [6]) to perform the actual classification. Training and classification were first performed on the original textures, producing the first column of results in Table 1. The training set was then presented at angles of 0, 30, 45 and 60 degrees and the textures for classification at 20, 70, 90, 120, 135 and 150 degrees, yielding the second column of results in Table 1. The classification results for the two proposed rotation invariant schemes were compared to those using features from the traditional 3rd order GMRF and from the wavelet transform without the combination of channels. Table 1 summarises the results. Although the third order GMRF parameters give 100% correct classification when the textures are presented at their original orientation, they perform very poorly on the rotated textures, classifying only 45.8% of the samples correctly (see confusion matrix in Fig. 3a). This is due to the strong directional dependence of the parameters in the traditional GMRF model. The proposed circular GMRF model uses a circularly symmetric neighbour set to remove this directional dependence, resulting in a high classification performance both for the textures at their original orientations (93.8%) and for the rotated textures (95.1%). The confusion matrix in Fig. 3(b) illustrates this performance for the rotated textures. Misclassifications tend to occur either for visually very similar textures (e.g. paper and sand) or for textures with a high level of directionality which cannot be identified using a circular model (e.g. wood). The wavelet-based features using seven channels of the wavelet transform also have a strong directional dependence. These features give a high classification performance for the original textures (99.1%), but a mediocre performance for the rotated textures (86.5%, see Fig. 3c). By combining the directionally dependent wavelet channels, as in the proposed scheme, a high level of rotation invariance is achieved giving a correct classification rate of 95.5% for the original textures and 95.8% for the rotated textures. The scheme's performance for the rotated textures is illustrated in the confusion matrix in Fig. 3(d). The misclassifications occur only on the highly directional textures such as wood and raffia. This is because the directional information is lost when the wavelet channels are combined. For each of the proposed schemes, there is a slight degradation in their performance on the original textures compared to the non-rotation invariant approaches. This is due to the loss in directional information on making the schemes rotation invariant.
4.
Conclusion
Two novel texture classification schemes have been proposed, the first using the wavelet transform and the second using Gaussian Markov random fields. These schemes exhibit comparable performances to existing methods but both use a significantly smaller feature space. Furthermore, the features are robust and computationally inexpensive (both methods are amenable to fast implementation) and only a small analysis area for feature extraction is required, as desirable for texture segmentation applications. In addition, unlike most existing techniques, the proposed schemes are invariant to rotations of the textures to be classified, attaining the same high classification performance on the textures at all orientations. The traditional GMRF approach or the non-rotation invariant wavelet method are obviously preferable if the textures are guaranteed to occur only at the orientation they have been trained at. However, the proposed schemes are far superior when the rotation of the texture is not known a-priori as is often the case in real applications. The waveletbased approach is especially favourable, since it gives a higher performance, is computationally more efficient and its features are easily derivable from its non-rotation invariant counterpart.
3rd order GMRFs (7 features) 3rd order Circular GMRFs (4 features) Wavelet-Based Features (7 features) Rotation Invariant Wavelet-Based Features (4 features)
Original Textures
Rotated Textures
100.0% 93.8% 99.1% 95.5%
45.8% 95.1% 86.5% 95.8%
Table 1 - Texture Classification Performance Results
186 References [1]
[2] [3]
[4] [5] [6]
[7] [8]
R. Chellappa and S. Chatterjee, "Classification of Textures Using Gaussian Markov Random Fields," IEEE Trans. Acoustics, Speech, and Signal Processing, vol.33, no.4, pp.959-963, Aug. 1985. R.L. Kashyap and A. Khotanzad, "A Model-Based Method for Rotation Invariant Texture Classification," IEEE Trans. Pattern Analysis and Machine Intelligence, vol.8, no.4, July 1986. J. Mao and A.K. Jain, "Texture Classification and Segmentation Using Multiresolution Simultaneous Autoregressive Models," Pattern Recognition, vol.25, no.2, pp.173-188, Feb. 1992. Y.Q. Chen, M.S. Nixon and D.W. Thomas, "Statistical Geometrical Features for Texture Classification," Pattern Recognition, vol.28, no.4, pp.537-552, Apr. 1995. K. Etemad and R. Chellappa, "Separability Based Tree Structured Local Basis Selection for Texture Classification," Proc. International Conference on Image Processing 1995, pp.441-445. T. Chang and C.-C.J. Kuo, "Texture Analysis and Classification with Tree-Structured Wavelet Transform," IEEE Trans. Image Processing, vol.2, no.4, pp.429-441, Oct. 1993. R. Porter and C.N. Canagarajah, "A Robust Automatic Clustering Scheme for Image Segmentation using Wavelets," IEEE Trans. Image Processing, vol.5, no.4, pp.662-665, Apr. 1996. P. Brodatz, Textures: A Photographic Album for Artists and Designers. Dover: New York, 1966. CLASSIFIED AS
T E X T U R E
cloth cotton canvas grass raffia rattan wood leather mcrttlnQ wOOl rep~le sand straw plaskln paper
7
38
3
IE 2cJ I 39 3 357 7 112
25
T
4 15 312 1 1
5 7
CLASSIFIED AS
2 13
21
3 20
5 14
13 28 1913 42
6
15
weave
E X T U R E
i 9
14
4~
cloth cotton canvas grass raffia rattan wood leather mating wool rep111e sand shaw pl~kln paper
2
38
42 37
42
37
3 42 4.2 34
42
(b) CLASSIFIED AS
canvas
42
38
weave
cloth cotton 35
42
42
CLASSIFIED AS
T E x T U R E
39
I
(a)
42 cloth cotton 26 42 canvas ~rass raffia rattan wood leather matflna WOOl rep111e sand straw pl~kJn paper weave
42
4~
7
(c)
9
35
9 2417 42
42
42
32
T r raffia E rattan X wood T leather U mating R wOOl E reD111e sand rtraw 31Clskln paper
2
4,2
4~
42
32 35
42
7 31
42 4~
42
42 42
weave
(d)
Figure 3 - Confusion matrices for classification results of rotated textures using: (a) GMRF features; (b) circular GMRF features; (c) wavelet-based features; (d) rotation invariant wavelet-based features.
Proceedings IWISP '96," 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
187
A NEW METHOD FOR DESCRIBING TEXTURE D. T. Pham* and B. G. ~etiner + *Intelligent Systems Laboratory, School of Engineering, University of Wales, Cardiff, PO Box 917, Newport Road, Cardiff, CF2 1XH, UK. +Istanbul Technical University, Faculty of Aeronautics Engineering, Maslak, Istanbul, Turkey.
ABSTRACT A new method is presented for obtaining feature vectors for describing texture. The method uses grey level difference matrices that are reminiscent of co-occurrence matrices but are much simpler to compute. Textural feature vectors are classified using artificial neural networks (ANNs). Comparative results for the new method and the standard Spatial Grey Level Dependence (SGLD) method are provided. Key words: Texture Analysis, Texture Classification, Neural Networks.
1 INTRODUCTION Texture is a fundamental stimulus for visual perception. Natural image analysis systems, such as the human visual system, use texture as an aid in segmentation and interpretation of scenes. Despite its importance, there is no generally accepted definition of texture and no agreement on how to measure it. This paper describes a new second-order statistics method for computing textural features and provides the results of using neural networks to recognise textures based on those features. Comparative results for the Spatial Grey Level Dependence (SGLD) or co-occurrence matrix method [Haralick et al., 1973] are also presented.
2 GREY LEVEL DIFFERENCE (GLD) METHOD FOR TEXTURE ANALYSIS The method involves computing GLD matrices, each element of which is the sum of scaled grey level differences between neighbouring pixels. Grey levels are quantised into groups to reduce the dimensions of the matrix, the number of groups being the number of rows/columns in the matrix. For each interpixel distance d and direction 0, a matrix can be computed. The concepts of interpixel distance and direction are similar to those adopted in the SGLD method. For example, with d=l, pixels that are immediately next to the pixel of interest are considered and with d=2, pixels that are separated by one pixel from the pixel of interest are used. There is a maximum of 8 directions, namely, 0 - 0 ~ 45 ~ 90 ~ 135 ~ 180 ~ 225 ~ 270 ~ and 315 ~ These define the position of a neighbouring pixel relative to the pixel of interest. For instance, the 0 ~ and 180 ~ neighbours of a pixel are the pixels to its fight and to its left respectively. The GLD matrix for a givend and 0 is computed as follows:(i) Quantise the grey levels inton groups. This fixes the dimensions of the GLD matrices tonxn. (ii) Initialise all elements of the GLD matrix to zero. (iii) Select the pixel to be processed in the image window. Call this pixel 1. (iv) Find the neighbour of pixel 1 at the specified interpixel distance and in the specified direction. Call this pixel 2. (v) Calculate the scaled grey levels of pixels 1 and 2, namely:
188
Pl
= Pl N----g-
pz = P.._2_.2 Ng
where pl and pz are the raw grey levels of pixels 1 and 2. P1 and PE range between 0 and 1. N g is the number of grey levels in the image. (vi)Calculate the scaled grey level difference between pixels 1 and 2:
Thus, GLD is a number between 1 and 2. GLD is equal to 1 when P1 and P2 are the same and to 2 when PI is 1 and PE is 0 or vice versa. GLD is arranged to be between 1 and 2 so that elements representing zero grey level differences are distinguished from ordinary (initialised) zero elements in the GLD matrix. (vii)Determine the GLD matrix element that corresponds to the scaled grey levels P1 and P2 that is to be updated. The position (i, j) of the element is calculated as follows: i = INT[n* P1]; j = INT[n* P2] where INT is a function that converts real numbers n* Pl and n* P2 into the nearest integer numbers. (viii)Update the GLD matrix element found in the previous step by adding to it the GLD value obtained in step (vi), that is: new_GLD(i, j) = old_GLD(i, j) + GLD (ix) If all neighbouring pixels of pixel 1 have been processed then go to step (x). Otherwise, go to step (iv). (x) If all pixels in the image window have been processed then STOP. Otherwise, go to step (iii). As an example, consider the image window in Figure 1 (a). The numbers of grey levels and grey level groups are 64 and 5 respectively. Let pixel 1 (with grey level equal to 48) be element (3, 3) and pixel 2 (with grey level equal to 35) be element (3, 4) in the image window. The scaled grey levels for these pixels are P~=48/64---0.75 and P2=35/64=0.547. The GLD value for these pixels is ]P1P21+1=1.203. The GLD matrix element corresponding to P~ and P2 is element (4, 3). Thus, it is updated from its initial zero value to 1.203. Similarly, let pixel 1 be element (4, 1) and pixel 2 be element (4, 2). The scaled grey levels for these pixels are P~--0.75 and P2=0.5. The GLD value for these pixels is 1.25. Again, the GLD matrix element corresponding to P1 and PE is element (4, 3). Thus, that element now becomes 2.453. The GLD matrix for the entire image window corresponding to an interpixel distance of 1 and a neighbouring pixel direction of 0 ~ is shown in Figure 1 (b).
.I.ID~ ,l~mE (a)
189 Grey Level
aar~ 0-12
0
0
13-25
0
0
0
26-38
2.031
0
0
2.453
1
0
2.219
52-63
0
0 5.219
(b) Figure 1. (a) Image window (b) GLD matrix calculated from the image 3 CLASSIFICATION OF GLD MATRICES GLD matrices were constructed for the 16 texture images from the Brodatz album [Brodatz, 1968]. These were 128x128 images of natural objects or scenes (for instance, reptile skin, grass lawn and beach pebbles). Each image was divided into 32x32 non-overlapping windows. This yielded a total of 256 patterns. The number of grey levels was 256. Eight grey level groups were employed giving GLD matrices of size 8x8. In addition to individual GLD matrices for the eight directions, direction invariant matrices were also computed by adding the corresponding elements in the individual matrices. Interpixel distances of 1 to 5 were adopted. This gave a total of 45 data sets, each with 256 patterns. A GLD matrix was obtained for each pattern. The matrix elements were used directly as features. Half of the feature vectors were selected randomly and employed as training examples. The remainder were used to test the classification accuracy of the trained classifiers. Thus, there were 45x128 feature vectors for training and the same number for testing.
The LVQ2 neural network with a conscience mechanism [Pham and Oztemel, 1994, 1996] was adopted as the tool for classifying the feature vectors into the correct texture class. That network was chosen after comparing its performance with the popular Multi-layer Perceptron classifier [Pham and Liu, 1995] on an experimental group of 9 data sets. The network had 64 inputs (the elements of the GLD matrix), 16 outputs (the texture classes) and 96 hidden Kohonen neurons. The number of Kohonen neurons was chosen empirically. To compare the proposed texture description method against the popular SGLD method, SGLD features were obtained for the same directions as was carried out for the proposed method. A feature vector of five components (energy, entropy, correlation, local homogeneity and inertia) was computed for each direction. An LVQ2 network was also employed for classifying the feature vectors. The network had 5 inputs (the elements of a feature vector), 16 outputs (the texture classes) and 32 Kohonen neurons. Again, the number of Kohonen neurons was found empirically. 4 RESULTS AND DISCUSSION Table 1 gives the results for all the 45 data sets. It can be observed that the classification accuracy using GLD matrices is superior to that using SGLD features for all interpixel distances and directions. The table also shows that with both methods the best accuracies were obtained for an interpixel distance of I.
Note that, although the dimension of the feature vectors in the SGLD method is smaller than that for the proposed method, the computation required to obtain the SGLD feature vectors [Haralick et al., 1973] is much more demanding. Additionally, the time required to train the LVQ classifiers to recognise the information-rich GLD feature vectors was comparable to that for the SGLD feature vectors.
190
Table 1. Number of misclassifications for each data set and average classification accuracies. 5 CONCLUSION A new texture analysis method based on grey level difference statistics has been described and its results have been compared with those of the SGLD method. The new method gave much better texture discrimination accuracies than the SGLD method on the natural texture images chosen from the Brodatz album.
References Brodatz P. (1968) "Textures: A Photographic Album for Artists and Designers", Van Nostrand Reinhold, New York. Haralick R. M., Shanmugam K. and Dinstein I. (1973) "Textural Features for Image Classification", IEEE Trans. Syst., Man, Cybern., Vol. SMC-3, No. 6, November, pp. 610-621 Pham D. T. and Liu X. (1995) "Neural Networks for Identification, Prediction and Control", Spfinger-Verlag, London and Berlin, pp.4-7 Pham D. T. and Oztemel E. (1994) "Control Chart Pattern Recognition Using Learning Vector Quantization Networks", Int. J. Production Research, 32(3), pp. 721-729 Pham D. T. and Oztemel E. (1996) "Intelligent Quality Systems", Springer-Verlag, London and Berlin.
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
191
Texture Discrimination for Quality Control Using Wavelet and Neural Network Techniques. D.A. Karras 1 and S.A. Karkanis 2 and B.G. Mertzios 3 1University of Ioannina, Department of Informatics, Ioannina 45110, Greece, [email protected] 2NRCPS "Democritos", Inst. of Nuclear Technology, Aghia Paraskevi, 15310 Athens, Greece, [email protected] 3Dernocritus Univ.of Thrace, Dept. of Electr.and Comp. Eng., 67 100 Xanthi, Greece, [email protected]
Abstract This paper aims at investigating a novel solution to the problem of defect recognition from images, that can find applications in building robust quality control vision based systems. Such applications can be found in the production lines of textile, integrated circuits, machinery, etc. The proposed solution focuses on detecting defects from their textural properties. More specifically a novel methodology is investigated for discriminating defects in textile images by applying a supervised neural classification technique, employing a multilayer perceptron (MLP) trained with the online backpropagation algorithm, to innovative wavelet based feature vectors. These vectors are extracted from the original image using the cooccurrence matrices framework and SVD analysis. The results of the proposed methodology are illustrated in a defective textile image where the defective area is recognized with 98.48 % accuracy.
I. Introduction Defect recognition from images is becoming increasingly significant in a variety of applications since quality control plays a very important role in contemporary manufacturing of virtually every product. Despite the lot of interest, little work has been done in this field since this classification problem presents many difficulties. However, the resurgence of interest for neural network research has revealed the existence of powerful classifiers. In addition, the emergence of the 2-D wavelet transform [5],[6] as a popular tool in image processing offers the ability of robust feature extraction in images. Combinations of both techniques have been used with success in various applications [10]. Therefore, it is worth attempting to investigate whether they can jointly offer a viable solution to the defect recognition problem. To this end, we propose a novel methodology in detecting defective areas in images by examining the discrimination abilities of their textural properties. Besides neural network classifiers and the 2-D wavelet transform, the tools utilized in such an analysis are cooccurrence matrix based textural feature extraction [4] and SVD analysis. The problem at hand can be clearly viewed as an image segmentation one, where the image should be segmented in defective and non defective areas only unlike its conventional consideration. Concerning the classical segmentation problem, that is dividing an image into homogeneous regions, the discovery of a generally effective scheme remains a challenge. To this end, many interesting techniques have been suggested so far including spatial frequency techniques [9] and relevant ones like texture clustering in the wavelet domain [9]. Most of these methodologies use very simple features like the energy of the wavelet channels [9] or the variance of the wavelet coefficients [3]. Our approach stems from this line of research. However, there is need for much more sophisticated feature extraction methods if one wants to solve the segmentation problem in its defect recognition incarnation, taking into account the high accuracy required. Following this reasoning we propose to incorporate in the research efforts the cooccurrence matrices analysis, since it offers a very accurate tool for describing image characteristics and especially texture [4]. It clearly provides second order information about pixel intensities when the majority of the other feature extraction techniques do not exploit it at all. Two are the main stages of the suggested system. Namely, optimal feature selection in the wavelet domain (optimal in terms of the information these features carry) and neural network based classification. The viability of the concepts and methods employed in the proposed approach is illustrated in the experimental section of the paper, where it is clearly shown that, by achieving a 98.48 % defective area classification accuracy, our methodology is very promising for use in the quality control field.
II. Stage A: Optimal feature selection in the wavelet domain The problem of texture discrimination, aiming at segmenting the defective areas in images, is considered in the wavelet domain, since it has been demonstrated that discrete wavelet transform (DWT) can lead to better texture modeling [ 1]. Also, in this way we can better exploit the well known local information extraction properties of wavelet signal decomposition as well as the well known features of wavelet denoising procedures [7]. We use the popular 2-D discrete wavelet transform scheme ([5],[6] etc.) in order to obtain the wavelet analysis of the original images containing defects. It is expected that the images considered in the wavelet domain should be smooth but due to the well known time-frequency localization properties of the wavelet transform, the defective areas- whose statistics vary from the ones of the image background- should more or less clearly emerge from the background. We have experimented with the standard 2-D Wavelet transform using nearly all the well known wavelet bases like Haar, Daubechies, Coiflet, Symmlet etc. as well as with Meyer's and Kolaczyk's 2-D Wavelet transforms [6]. However, and this is very interesting, only the 2-D Haar wavelet transform has exhibited the expected and desired properties. All the
192 other orthonormal, continuous and compactly supported wavelet bases have smoothed the images so much that the defective areas don't appear in the subbands. We have performed a one-level wavelet decomposition of the images, thus resulting in four main wavelet channels. Among the three channels 2, 3, 4 (frequency index) we have selected for further processing the one whose histogram presents the maximum variance. A lot of experimentation has shown that this is the channel corresponding to the most clear appearance of the defective areas. The subsequent step in the proposed methodology is to raster scan the image obtained from the selected wavelet channel with sliding windows of M x M dimensions. We have experimented with 256 x 256 images and we have found that M=8 is a good size for the sliding window. For each such window we perform two types of analysis in order to obtain features optimal in terms of information content. First, we use the information that comes from the cooccurrence matrices [4]. These matrices represent the spatial distribution and the dependence of the gray levels within a local area. Each (i,j) th entry of the matrices, represents the probability of going from one pixel with gray level (i) to another with a gray level (j) under a predefined distance and angle. More matrices are formed for specific spatial distances and predefined angles. From these matrices, sets of statistical measures are computed (called feature vectors) for building different texture models. We have considered four angles, namely 0, 45, 90, 135 as well as a predefined distance of one pixel in the formation of the cooccurrence matrices. Therefore, we have formed four cooccurrence matrices. Due to computational complexity issues regarding cooccurrence matrices analysis we have quantized the image obtained from the selected wavelet channel into 16 gray levels instead of the usual 256 levels, without diverse effects in defective area recognition accuracy. This procedure, also, renders the on-line implementation of the proposed system highly feasible. Among the 14 statistical measures, originally proposed by Haralick [4], that are derived from each cooccurrence matrix we have considered only four of them. Namely, angular second moment, correlation, inverse difference moment and entropy.
9 energy- Angular Second Moment
fl = Z
Z p(i, j ) 2
i
j
1~ Ne
9 Correlation
Z Z (i *j)p(i, :)-
f2 = i=l j=l
GO"
1
9 InverseDifferenceMoment
f3=~i~l+(i_j)p(i,j
9 Entropy
f4 =- Z Z p(i, j)log(pO, j)) l
)
./
These measures, we have experimentally found, that provide high discrimination accuracy that can be only marginally increased by adding more measures in the feature vector. Thus, using the above mentioned four cooccurrence matrices we have obtained 16 features describing spatial distribution in each 8 x 8 sliding window in the wavelet domain. In addition, we have formed another set of 8 features for each such window by extracting the singular values of the matrix corresponding to this window. SVD analysis has recently been successfully related to invariant paaern recognition [8]. Therefore, it is reasonable to expect that it provides a meaningful means for characterizing each sliding window, thus preserving first order information regarding this window, while, on the other hand, the cooccurence matrices analysis extracts second order information. Therefore, we have formed, for each sliding window, a feature vector containing 24 features that uniquely characterizes it. These feature vectors feed the neural classifier of the subsequent stage of the suggested methodology, next described.
III. Stage B: Neural network based segmentation of defective areas. After obtaining information about the textural structure and other characteristics of each image, utilizing the above depicted methodology, we employ a supervised neural network architecture of the multilayer feedforward type (MLPs), trained with the online backpropagation error algorithm, having as goal to decide whether a texture region belongs to a defective part or not. The inputs to the network are the 24 features of the feature vector extracted from each sliding window. The best network architecture that has been tested in our experiments is the 24-35-35-1. The desired outputs during training are determined by the corresponding sliding window location. More specifically, if a sliding window belongs to a defective area the desired output of the network is one, otherwise it is zero. We have defined, during MLP training phase, that a sliding window belongs to a defective area if any of the pixels in the 4 x 4 central window inside the original 8 X 8 corresponding sliding window belongs to the defect. The reasoning underlying this definition is that the decision about whether a window belongs to a defective area or not should come from a large neighborhood information, thus preserving the 2-D structure of the problem and not from information associated with only one pixel (e.g the central pixel). In addition and probably more significantly, by defining the two classes ill such a way, we can obtain many more training patterns for the class corresponding to the defective area, since defects, normally, cover only a small area of the original image. It is important for the effective neural network classifier
193 learning to have enough training patterns for each one of the two classes but, on the other hand, to preserve as much as possible the a priori probability distribution of the problem. We have experimentally found that a proportion of 1:3 for the training patterns belonging to defective and non-defective areas respectively, is very good for achieving both goals.
IV.
Results and Discussion.
The efficiency of our approach in recognizing defects in automated inspection images, based on utilizing texture information, is illustrated in the textile image shown in fig. 1 which contains a very thin and long defect in its upper side as well as some smaller defects elsewhere. This image is 256 x 256, while the four wavelet channels obtained by applying the 2-D Haar wavelet transform are 128 x 128. These wavelet channels are shown in fig. 2. In fig. 3 the selected wavelet channel 3 of maximum histogram variance is shown. There exist 14641 sliding windows of 8 x 8 size in this wavelet channel. The neural network has been trained with a training set containing 1009 patterns extracted from these sliding windows as described above. 280 out of the 1009 patterns belong to the long and thin defective area of the upper side only, while the rest belong to the class of non defective areas. The learning rate coefficient was 0.3 while the momentum one was 0.4. The neural network has been tested on all the 14641 patterns coming from the sliding windows of the third wavelet channel. The results are shown in fig. 4. Note that the network based on the suggested methodology was able to generalize and find also some other minor defects, while another network of the same type trained with the 64 pixel values of the sliding windows, under exactly the same conditions, was able to find only the long and thin defect. This fact demonstrates the efficiency of our feature extraction methodology based on textural and SVD features. Finally, in terms of classification accuracy we have achieved an overall 98.48 %. The evolution of the training error and of the generalization ability for the class corresponding to defects is shown in fig. 5, 6 respectively.
Figure 1. Original textile image containing a defect
Figure 3. QMF Channel No.3
Figure 2. Wavelet transformation of the original image
Figure 4. Resulted Image - White regions represent the defects
194
Figure 5. Learning Error Evolution
Figure 6. Generalization Performance Evolution
V. Conclusions
We have proposed a novel methodology for detecting defects in automated inspection images based on wavelet and neural network segmentation methods by exploiting information coming from textural analysis and SVD in the wavelet channels of the 2-D Haar wavelet transformed original images. The efficiency of this approach is illustrated in textile images and the classification accuracy obtained is 98.48 %. Clearly, our methodology deserves further evaluation in quality control vision based systems.
References
[1] Ryan, T. W., Sanders, D., Fisher, H. D. and Iverson, A. E. "Image Compression by Texture Modeling in the Wavelet Domain", IEEE trans. Image Processing, Vol. 5, No. 1, pp. 26-36, 1996. [2] Antonini, M., Barlaud, M., Mathieu, P. and Daubechies, I. "Image Coding Using Wavelet Transform", IEEE trans. Image Processing, Vol. 1, pp. 205-220, 1992. [3] Unser, M. "Texture Classification and Segmentation Using Wavelet Frames", IEEE trans. Image Processing, Vol. 4, No. 11, pp.1549-1560, 1995 [4] Haralick, R. M., Shanmugam, K. and Dinstein, I. "Textural Features for Image Classification", IEEE Trans. Systems, Man and Cybernetics, Vol. SMC-3, No. 6, pp. 610-621, 1973. [5] Meyer, Y. "Wavelets: Algorithms and Applications", Philadelphia: SIAM, 1993 [6] Kolaczyk, E. "WVD Solution of Inverse Problems", Doctoral Dissertation, Stanford University, Dept. of Statistics, 1994 [7] Donoho, D. L. and Johnstone, I. M. "Ideal Time-Frequency Denoising." Technical Report, Dept. of Statistics, Stanford University. [8] A1-Shaykh, O.K. and Doherty, J.E. "Invariant Image Analysis based on Radon Transform and SVD.", IEEE Trans. Circuits and Systems, Feb. 1996, Vol. 43, 2, pp. 123-133. [9] Porter, R. and Canagarajah, N. "A Robust Automatic Clustering Scheme foe Image Segmentation Using Wavelets", IEEE Trans. on Image Processing, April 1996, Vol. 5, No. 4, pp.662 - 665. [10] Lee, C. S., et. al, "Feature Extraction Algorithm based on Adaptive Wavelet Packet for Surface Defect Classification", to be presented in ICIP 96, 16-19 Sept. 1996, Lausanne, Switzerland.
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and 'P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
195
A Region Oriented CFAR Approach to the Detection of Extensive Targets in Textured Images C a r l o s A l b e r o l a - L d p e z , Jos6 R a m d n C a s a r - C o r r e d e r a * a n d J u a n R u i z - A l z o l a * * D e p t o . T e o r i a de la Serial y C o m u n i c a c i o n e s e I n g e n i e r l a T e l e m s E T S I T e l e c o m u n i c a c i d n . U n i v e r s i d a d de V a l l a d o l i d , S p a i n C / R e a l de B u r g o s s / n . 47011 V a l l a d o l i d . e - m a i l : c a r l o s @ t e l . u v a . e s * D e p t o . Sefiales, S i s t e m a s y R a d i o c o m u n i c a c i o n e s . E T S I T e l e c o m u n i c a c i d n - U P M C i u d a d U n i v e r s i t a r i a s / n . 28040 M a d r i d , S p a i n ** D e p t o . de Serial y C o m u n i c a c i o n e s . E U I T T e l e c o m u n i c a c i 6 n C a m p u s de T a f i r a s / n . 35017 Las P a l r n a s de G r a n C a n a r i a , S p a i n Abstract In this contribution we address the problem of locating the presence of arbitrarily-shaped extensive objects in textured images. To that end, we propose to introduce spatial constraints within the detection framework by means of a recursive search of connected components of the target to be extracted. With this procedure, every target within the image is ideally detected with a single threshold, and thus the problem of locating the reference of estimation of the parameters of the detector with respect to the pixel under test is bypassed. Our experiments show that extensive targets are properly detected, regardless of their shape and extension. In addition, false alarms are easily cancelled since they will show up as isolated point-like random detections.
1
Introduction
Well known CFAR approaches [5] to target detection in images strive to maximize the probability of detection while keeping the false alarm rate low and constant throughout a non-stationary background, by means of estimating its local statistics to calculate the appropriate threshold in every pixel. However they are ,either directed at detecting very small targets [4] or they make use of some a priori knowledge about the target to be extracted, by using, for instance, a searching template from which the target features can be estimated [6]. On the other hand, if a general purpose extensive-target detection scheme is sought, a template matching procedure is not the solution, since it should use a large number of candidate templates, which would unnecessarily increase the computational complexity of the detector. Additionally, due to the fact that targets typically encountered in real world applications are extensive at practical resolutions, pixel-level detectors might not be the most efficient solution, for decisions are made independently of each other and thus the raw output of the detector will often have no spatial coherence; this makes a postprocessing stage compulsory, in which detections from the target boundaries are to be connected, and false alarms cancelled. These pixeloriented detectors are quite easy to implement in a real time scheme, but the postprocessing might overload the processor. Additionally, when using a CFAR detector for extensive target extraction, care must be taken to properly place the reference of estimation of the parameters of the detector; if this point is not taken into account, some parts of the target can easily lead the detector to miss different portions of itself since the parameters will be biased because of the pressence of target pixels within the reference of estimation. In this contribution we propose a CFAR detection scheme that incorporates region constraints within the detection framework. Our procedure potential stems from the fact that, in the target area, the image statistics will be quite different from those from the background, and will also have a sort of homogeneity, even though the target is fluctuating, that allows us to extract the target as a whole by means of a single local threshold. This way, we benefit from using pixel-level and region-level information simultaneously in the detection stage, and since ideally a single threshold is needed for a given target, we also minimize the above-mentioned effect of target shadowing by its own pixels.
196 2 2.1
C F A R D e t e c t i o n of E x t e n s i v e T a r g e t s A Pixel-Oriented
Approach
As mentioned in the introduction, few proposals of CFAR detectors in images address the problem of locating the presence of arbitrarily shaped and extensive objects; the solutions more often encountered incorporate some knowledge of the object to be extracted. We have developed [1] a pixel-oriented CFAR detector that extracts the outer edges of an extensive target, regarless of its extension and shape, in a g a m m a distributed textured background. The key of the proposal lies in the use of the phase of the estimated gradient in the pixel under test: the reference of estimation of the parameters of the detector is placed orthogonally to the gradient vector, and thus we reduce the possibility of pixels from the target falling into the cells of the reference of estimation. However this philosophy and, generally speaking, all the techniques that make decisions in a pixel by pixel basis without taking into account decisions in their surroundings, bring about spotty results, in which a number of unconnected edge elements are extracted together with a number of false alarms. Thus, a second stage is needed in which edge elements are connected and false alarms cancelled. To that end, optimization techniques have proved useful, although computationally involved [2][3]. 2.2
A Region-Oriented
Approach
If an extensive target is sought, and the image statistics remain approximately constant through the body of the target, a single threshold might be sufficient to properly detect and extract the target as a whole. That is, regardless of the shape of the object, it could be detected by a guided recursive search of its components, using as starting point in the recursion a detection obtained by means of a pixel-level detector. We have applied this main idea to build a detection algorithm in which decisions are dependent of each other and thus, the detector could be regarded as a region-level detector. To that end we proceed as follows: the detection process is started at a pixel level, but, if a detection is encountered, a region-level detection procedure is triggered, which initiates a recursive search in the 8-neighborhood of this pixel; every neighbor is now compared to the threshold that triggered the first detection. Then all of the neighbors that result in detections are recursively examined, using the only threshold calculated so far, and expanding the tree of neighbors one more level. The process keeps going until the search reaches the opposite boundary of the target (opposite with respect to the direction of the search), for all of the decisions that do not exceed the threshold will be labelled as 'background' and no further search is invoked in undetected pixels. This process can be expressed in pseudocode as follows: 1. Label pixels as Unvisited 2. For every unvisited pixel (a) Decide pixel as target/background by any CFAR detector (see, for instance,
[1])
(b) If pixel detected i. Then for every undetected neighbor A. Decide pixel as target/background with threshold in (a) B. If pixel detected 9 Then label pixel as detected and go to i 9 Otherwise label pixel as visited ii. Otherwise label pixel as visited This algorithm benefits from the fact that the recursive procedure captures the whole body of the target accurately: both the outer boundary and the inner details are captured, since the detection threshold has been previously calculated from data outside the target area, and no further threshold calculations are needed. Additionally recursive algorithms are fast and efficient, and the code that implements this algorithm is surprisingly short ant thus easy to store. The main drawback of this procedure is the condition for halting the search: at the present stage we conclude the expansion of the tree of neighbors when no more detections are encountered. Therefore, in those cases that the targets may lie in a rapidly changing background the threshold in the opposite side of the target might not be able to stop the search and noisy results would be obtained.
197
3
Results
In this section we show two exalnples of our detector capabilities. First, an artificial non-stationary background is represented in figure la), which has been synthesized by a 2-dimensional autoregresive filter driven by a white gaussian noise, and its output has been warped to obtain a gamma probability density function. We have let the parameters of the distribution vary during the synthesis process to obtain a non-uniform illumination pattern, as can be seen from this figure. Three targets have been superimposed in the texture, whose brightness content is quite overlapped with that of the background (specially in the two lower circles), but their textural pattern is different; therefore the detection process has been carried out at the output of an adaptive whitening filter (with an assumed quarter plane support). We show this output process in figure lb). Note the evident presence of the three targets in the background (three noisy spikes along the diagonal of the figure). The pixel oriented detector output is shown in figure lc) for a Pfa=10 -3. Note that target boundaries are visible, but detections are mainly isolated; the reason for this result is that the output of the whitening filter is very fluctuating in the surroundings of the targets, and therefore, the estimation of the gradient (for the placement of the ret~rence of estimation) is noisy as well. This leads to an inaccurate placement of the reference of estimation and, as a consequence, to a low detection performance. However, as figure ld) shows, the proposed detection philosophy, due to its inherent functionality, is able to extract much of the body of the target, which inakes any further processing directed at target recognition much easier. This figure also highlights that, due to the filtering process, part of the target power is smeared out of its boundaries, and therefore the detections extend farther from the original target in the filtering direction.
Figure 1: Detection in a whitened domain. Pfa = 10 -3 a) Original image b) Squared output of an adaptive whitening filter with QP support c) Boundary detection in b) d) Region-oriented CFAR detection in b). The second example is an image of a jacket in which four pins have been superimposed (figure 2a). The Pfa is set to 10 -~ in each band (the original is a three-band image. Only one band is shown here), and decisions are fused according to the OR logical function. Figure 2b) shows the result of the iterative search: the four pins are correctly detected, and most of the details in them are also visible. False alarms can be
198 easily removed by a very simple postprocessing, since its extension is much smaller than that from the real targets.
Figure 2: Detection in a natural background. P f a = 10 -3 in each band, fused by logical OR. a) Original image b) Region-oriented CFAR detection in a).
4
Conclusions
In this contribution we have proposed an algorithm for incorporating region constraints in the operative of a CFAR detector for object extraction in a textured background. Our procedure scans the image under analysis in a pixel by pixel basis until a detection is encountered; the detection triggers a recursive search of target components within the neighbors of the detection. This search is continued until the object is compactly extracted. Our results show that the algorithm performs satisfactorily in slowly changing backgrounds, for both targets are properly detected and false alarms are controlled according to the level of the detector. However, we have highlighted the fact that this procedure is sensitive to sudden changes in the image statistics. Our future efforts will be directed at disminishing this sensitivity, by means of conceiving more robust stopping criteria.
References [1] C. Alberola, J. R. Casar, J. Ruiz, A Comparison of CFAR Strategies for Blob Detection in Textured Images, Proc. of the VIII European Signal Processing Con]., EUSIPCO-96, September 1996 (to be held). [2] A. Martelli, An Application of Heuristic Search Methods to Edge and Contour Detection, Communications o] the ACM, Vol. 19, No. 2, pp. 73-83, February 1976. [3] U. Montanari, On the Optimal Detection of Curves in Noisy Pictures, Communications o] the A CM, Vol. 14, No. 5, pp. 335-345, May 1971. [4] T. Soni, J. Z. Zeidler, W. H. Ku, Performance Evaluation of 2-D Adaptive Prediction Filters for Detection of Small Objects in Textured Backgrounds, IEEE Trans. on Image Processing, Vol. 2, No. 3, pp. 327-339, July 1993. [5] C. W. Therrien, T. F. Quatieri, D. E. Dudgeon, Statistical Model-Based Algorithms for Image Analysis, Proceedings o] the IEEE, Vol. 74, No. 4, pp. 532-551, April 1986. [6] X. Yu, I. S. Reed, A. D. Stocker, Comparative Performance Analysis of Adaptive Multispectral Detectors, IEEE Trans. on Signal Processing, Vol. 41, No. 8, pp. 2639-2656, August 1993.
9- ~
,..
Proceedings IWISP '96; 4- 7 November 1996; M~nehester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
199
Generating Stabile Structure of a Color Texture Image using Scale-space Analysis with Non-uniform Gaussian Kernels. Satoru MORITA and Minoru TANAKA Faculty
1
of Engineering,Yamaguchi
University, Ube,755
Japan
Abstract
Coarseness and directionality provides important sources of information for color texture image recognition. Especially, it is important to distinguish between the textures and understand characters of similar color texture. So we proposed new scale-space analysis generated by non-uniform Gaussian kernels in order to find stabile image for coarseness and directionality. We analyze zero-crossing surfaces to generate non-uniform Gaussian scale-space from observations of a limited number. Singular points, where the topology of zero-crossing surfaces changes are plotted in new scale-space. A filter parameter for the biggest size of chunk enclosed by topology change surface is selected as an optimal parameter of a pixel. Optimal filter and the image description are calculated by this approach for natural color image. We show that this method is suited for color texture image recognition.
2
Introduction
Recently, many researchers have carried on the study of color images in the field of computer vision. The segmentation of color images using competitive learning was studied[l]. On the other hand, the segmentation of a color image using multiresolution analysis was proposed[2]. But consideration was not given to the texture in a color image. Coarseness and directionality provide important sources of information for texture image recognition. Especially, it is important to distinguish between the textures and understand characteristics of similar texture. The importance of interpreting an image in various scales was denoted by Marr[7]. Scale-space analysis is proposed using the zero-crossing points of a signal which are observed in various scales[6]. Uniqueness of scale-space based on uniform Gaussian kernels is analyzed[10]. Scale-space analysis using non-uniform kernels is useful for texture analysis and edge detections[8][9]. Image segmentation using a Gabor filter[4] with various directions have been studied for texture analysis[5]. Witkin proposed the method that selects the optimal scale which corresponds to the maximum width of interval in order to generate a stabile one-dimensional signal[6]. So we extend the interval tree of a one dimensional signal to the same approach of a two dimensional color image using non-uniform Gaussian kernel in order to select filter parameters with consideration to coarseness and directionality. In section 2, we define scale-space filtering with non-uniform Gaussian kernels. Especially, we classify the zerocrossing surfaces for a color image and clarify the properties. In section 3. using non-uniform Gaussian scale-space analysis, we denote the algorithm generating a stabile color image without the affect of noise by coarseness and directionality. We extract the stabile color images from some real images and show the effectiveness by matching experiments using the structure of stabile images.
Scale-space Analysis with Non-uniform Color Texture Image 3.1
Gaussian
Kernels
for a
Scale-space Filtering with Non-uniform Gaussian Kernels.
In order to generate the stabile image with respect to the coarseness and directionality of texture, we propose scale-space analysis with non-uniform Gaussian kernels and the algorithm generating the structure of a stabile image. In this section, traditional scale-space analysis with uniform Gaussian kernels is extended to scale-space analysis with non-uniform Gaussian kernels.
O,L = ~1V2 L = ~1 ( ~ 0 ~
+ ~O~)L
L satisfies with the previous diffusion equation.
L(x; t) = Z
ER n
g(a; t)f(x - a)da
200 Non-uniform Gaussian kernel used in scale-space analysis is defined in the following. 1
1. x 2
g(x, y; ax, ay) = ~ exp{-~L~ + 21raxcru This equation is rewritten in
y2 ~1}
1 ~p(_ ~ + y~), g(x, y; ~, F, O) - 2~IMI 2
where,
Yx r=v~r
3.2
=
0
~
sinO x ) -sinO c o s O ) ( y
0)
Zero-crossing Surfaces
With the directional vectors which maximize and minimize the curvature at the point p as (u,v)=(~1,7h), (~2, r/2), the maximum curvature ~z, the minimum curvature ~2, the mean curvature H, and the Gaussian curvature K are defined as the following. a) the maximum curvature at the point p ~1 = ,k(~l, 7h) , b) the minimum curvature at the point p ~2 = )~(~2,~]2), C) the mean curvature at the point p H = ~1+~22, d) the Gaussian curvature at the point p K = Sla2, e) H0 contours H = 0 f)K0 contours K = 0 An image is divided into elements using positive and negative of Gaussian curvature K and mean curvature H . Relationships between elements are described. In this paper, K - 0 and H=0 are called zero-crossing contours and the surfaces composed of zero-crossing contours in (x, y, t) space are called zero-crossing surfaces, x and y are the coordinates of an image and t is the scale. An image divided into elements by positive and negative of Gaussian curvature K and mean curvature H is called a KH-image.
3.3
Scale-space w i t h N o n - u n i f o r m Gaussian Kernels for a Color T e x t u r e Image
A color image is described by three color planes which are red plane(R), green plane(G) and blue plane(B). A pixel in a color image has 24 bit data. A pixel in a plane has 8bit data and 256 densitys. Thus color image I(x, y) was described three planes which are In(x,y), Ia(x,y) and IB(x,y). Next, we define non-uniform Gaussian scale-space for a color texture image. The coordinates of zero-crossing contours on IR(x,y)*G(x, y; tY,0,F), IG(x,y)*G(x, y; 9 ,0,F) and IB(x,y)*G(x, y; ~,0,F) are plotted on a five dimensional space (x,y,~,0,r). The properties of filter G(~,0,F) are decided by distortion ~, direction O and size r. Zero-crossing surfaces on non-uniform Gaussian scale-space are three kinds of manifold, which are S(x,y,~,O,F)in, S(x,y,~,O,r)ia and S(x,y,~,0,F)xB, on a five dimensional space (x,y, ffl,0,r).
3.4
T h r e e kinds of N o n - u n i f o r m Gaussian Scale-space
Three kinds of zero-crossing surfaces are extracted from three kind of these manifolds. Suppose (F, ff~) are constant, the coordinates of the zero-crossing points on I(x,y)*G(~,0,F) are plotted in three dimensional space (x,y,0). The scale-space has cylindrical coordinates in which x and y are in a plane and 0, extends circularly. Zero-crossing surfaces S(F,ffl; 0,x,y) are plotted in this scale-space. Suppose (F, 0) are constant, the coordinates of the zero-crossing points on I(x,y)*G(~,0,r) exist, are plotted on three dimensional space (x, y, if'). The scale-space has rectangular coordinate with three kinds of axes, x, y and ff~. Zero-crossing surfaces S(F,0; r are plotted in the scale-space. Suppose (~, 0) are constant, the coordinates of the zero-crossing points on I(x,y)*G(ff~,0,F) are plotted on a three dimensional space (x, y, F). The scale-space has a rectangular coordinate with three kinds of axes, x, y and F. Zero-crossing surfaces S(~,0; F, x, y) are plotted in this scale-space. The singular points where three kinds of zero-crossing contour topologies change as 0, ff~ and F increase are plotted in three kinds of scale-spaces which are red plane, green plane and blue plane.
201
Figure 1: A sample color image (left) and color planes(left R, middle G, right B) (right).
Figure 2: Filter(Top 9
r = 0.015)
3.5
0.015, O = 2n~ (n = 1, ..., 8) , F = 0.015) (Bottom 9 = 0.125, O = ~2,~ (n = 1, ..., 8),
Topology Change Surfaces
We analyzed the scale-space with non-uniform Gaussian kernels of constant values (xl, yl) to decide optimal filter for a point (xl, yl) in an image. Suppose (x, y) are constant, the singular points where the topology of zero-crossing surfaces changes on three kinds of the scale-space are plotted on three dimensional space (I', ~, ~). The scale-space has the cone coordinate in which ,F and 0 are in a plane and ~, extends perpendicularly upwards and tapers down a cone that the intersection, which ~ is constant, is a circle. Topology change surfaces, which are W(x, y; F,~, {~)IR, W(x, y; r , ~ , {~)i~ and W(x, y; r , ~ , ~)Is, are composed of a set of topology change points which were obtained from three color planes R,G and B. We try to find the maximum size of a chunk enclosed by a topology change surface. Topology of an image does not change in a region. We use log21~l instead of 9 on calculation. Three kinds of optimal filter parameters in a point (xl, yl) in a image, which correspond to color plane R, G and B, are decided. These processes are executed for all pixels of an image. This approach is the extension of the interval tree for a one dimensional signal.
4
The Algorithm
generating
a Stabile Color Texture
Image
We show the algorithm generating a stabile color texture image. 9 Color image I(x, y) is described using three kinds of planes in which are IR(x, y), IG(X, y) and and have 8bit data. (2.3)
IB(X, y)
9 Convolve three kinds of color planes which are IR(x,y), Ia(x, y) and IB(x, y), to the filters of the parameter Y = ~ 9 23""(n = 1, " ' " 5), 9 = ~ 923""(n = 1 , " ' 5) ' t? = T2n~ (n = 1, " " ' 8).(2.1) 9 Classify a filtered planes into regions by K and H parameters. Execute same processes for planes Ia(x, y) and IB(x, y). (2.2)
IR(x, y),
9 Generate three kinds of scale-space in which parameters r and tI, are constant, 0 and r are constant and and 0 are constant using three color planes In(x, y), IG(X, y) and IB(X, y). (2.4) 9 Interpolate between zero-crossing points of the limited number in a scale-space based on x, y and 0. Execute the same processes for the scale-space based on x, y and ~ and based on x, y and r. Find the singular points where the topology of the zero-crossing contour changes. Plot the singular points in a scale-space based on 0, q, and r. The set of singular points for a plane is called by a topology change surface. (2.5)
2n~ / Figure 3: Filtering image(filter~ -= 0.015, O = --K-~n = 1, ..., 8), r = 0:015) (left top R, left middle G, left bottom 2nr/ B) ,KH-image(filter~ = 0.015, O = --~--tn = 1, ..., 8), r = 0.015) (fight top R, right middle G, right bottom B)
202
Figure 4: A segment image. 9 Select the chunk of the maximum size which is enclosed by topology change surfaces generated from each planes as an optimal filter parameter.(2.5) 9 Plot the optimal filter parameters (9, F, O) of the limited number on scale-spaces based on ~ , F and 0 parameters for three kinds of color planes. An optimal filter surface is composed of the set of optimal filter parameters. Extract the discontinuities from the optimal filter surfaces using the technique of a cluster analysis[3]. 9 Describe the neighbor relations between image elements using a graph representation. The discontinuities correspond to arcs and image elements correspond to node on the graph representation. 9 Convolve a plane to the Gaussian filter of the optimal parameter obtained from a pixel. The pixel value of the plane is the pixel value of the filtered image. Execute these processes for all planes and all pixels. Thus, all pixel values of a stabile image are decided. This algorithm is applied for some real color images. Figure 1 shows a sample color image and three color -~n = planes. Figure 2 shows non-uniform gaussian kernels that filter parameters are r 0.015, 0.125, O = -2n~, 1, ..., 8), r=o.o15. Figure 3 shows filtering images and KH-images for three color planes that filter parameters are r = 0.015, O = -2n~, - ~ [ n = 1, ..., 8), r = 0.015. Figure 4 shows segment images generated using the algorithm generating a stabile color texture image. The boundaries between different gray values mean the discontinuities from the optimal filter surfaces. It is confirmed that a stabile color image without the affect of noise by coarseness and directionality is generated.
5
Conclusions
We extend the interval tree of a one dimensional signal to the same approach of a two dimensional color image using scale-space analysis with the non-uniform Gaussian kernel in order to select filter parameters with consideration to coarseness and directionality. Both the selection of optimal filters and the segmentation of an image are executed at the same time by analyzing optimal filter parameter surfaces. The proposed algorithm is applied for some real color images, and it is confirmed that this approach is useful for the the color image with noise by the matching experiments using the structure of a stabile image.
References [1] T. Uchiyama and M. A. Arbib, "Color Image Segmentation Using Competitive Learning," IEEE Trans. Pattern Anal. & Machine Intell. , vol. 16, No. 12, vol. 12, pp. 1197-1206, 1993. [2] J. Liu and Y. Yang, "Multiresolution Color Image Segmentation," IEEE Trans. Pattern Anal. & Machine Intell. , vol. 16, No. 7, pp. 689-699, 1994. [3] D. E. Rummelhart and D. Zipser, "Feature discovery by competitive learning," Cognitive Sci., vol. 9., pp. 75-112, 1985. [4l D. Gabor ,"Theroy of communication," J. Inst. Elect. Engr. , 93,vol. 93, no. III, pp. 429-459", 1946. [5] A. K. Jain and F. Farrokhnia, "Unsupervised texture segmentation usin Gabor filters," Pattern Recognition, vol. 23, pp: 1167-1186, 1991. [6] A. Witkin "Scale-space filtering," Proc. Int. Joint Conf. Argifitial intelligence ", Karlshiruhe, West Germany, pp. 1019-1022, 1983. [7] D. Mart "Vision," W. H. Freeman, San Fransisco, 1982. [8] P. Perona and J. Malik," Steerable-scalable kernels for edge detection and junction analysis," in Proc. 2nd European Conf. on Computer Vision, pp. 3-18, 1992. [9] M. Michaelis and G. Sommer, "Junction classification by multiple orientation detection," in Proc. 3rd European Conf. on Computer Vision, pp. 101-108, 1994. [10l J. Babaud, A. P. Witkin, M. Baudin,and R. O. Duda, "Uniqueness of the Gausian kernel for scale space filtering," IEEE Trans. Pattern Anal. & Machine InteU., Vol. 8, No. 1, pp. 26-33, 1986.
Session G: IMAGE CODING II: TRANSFORM, SUBBAND AND WAVELET CODING
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
205
A P P R O X I M A T I O N OF B I D I M E N S I O N A L K A R H U N E N L O E V E E X P A N S I O N S B Y M E A N S OF M O N O D I M E N S I O N A L K A R H U N E N L O E V E E X P A N S I O N S , A P P L I E D TO I M A G E C O M P R E S S I O N Nello Balossino and Davide Cavagnino Dipartimento di Informatica- Universit/L di Torino C.so Svizzera 185 - 10149 TORINO - Italy E-mail: {nello, davide}@di.unito.it Abstract The paper treats image compression based on Karhunen Loeve expansions approximated by monodimensional expansions. The results prove that the described method leads to a huge reduction of computational complexity and required time. A comparison with the Discrete Cosine Transform is also reported.
Introduction In many applications a capability to compress images is required; so compression algorithms are frequently embedded in software. In order to evaluate an algorithm used to compress images, the compression ratio C is defined as C=n~/no where no is the number of bits that encode the compressed image and n~ is the number of bits in the original image. As is well known, compression algorithms are classed as reversible, or irreversible, depending on whether the decompressed image is, or is not, identical to the original one. A class of reversible compression algorithms is based on bidimensional transformations that perform a spectral analysis of parts of the image (subimages) by means of an orthonormal basis:
F(u, v)= E A(u, v,x, y)f(x, y) x,.v
where f(x,y) represents the original bidimensional image, F(u,v) are the transformed coefficients and A is the kernel of the transformation (A is often called the set of basis images). In order to reproduce the original image it is sufficient to use the following transformation:
f(x, y) = E B(u, v,x, y)F(u, v) u,v
where B is the inverse of the kernel. A bidimensional.transformation is said to be separable if and only if we can write
A(u,v,x,y)= At(u,x)A2(v,y ) If we quantize the coefficients F(u,v) or discard some of them before applying the inverse transformation we expect an information loss in the reconstructed image (in our work we only discard coefficients and round the remaining ones to two byte integers); in this way, the compression algorithms become irreversible. In this paper we concentrate on the Karhunen Loeve (KL) expansion (used also with hybrid encodings in recent works [2]) and the Discrete Cosine Transform (DCT), the latter being used as the core of the JPEG standard (see [3, 6]). Given an image of size NxN, we partition it into non-overlapping subimages of size nxn, which we interpret as a random field [7] with mean m; the autocorrelation matrix K (of size n2xn2) is computed from the centered subimages (given a subimage x, the centered subimage is x-m). The kernel of the KL transform is made up of the eigenvectors of the matrix K. The eigenvalue associated with each eigenvector is the variance of the spectral coefficients belonging to the eigenvector; we can then sort the eigenvectors in descending order respect to their eigenvalues. If we arrange the eigenvectors by rows in a matrix A, then we can write the KL transform in the following way: y=A(x-m), where x, y and m are n• subimages in column form (see [ 1, 4]) and, given that the eigenvectors constitute an orthonormal basis, we have the fiwerse transformation x=A'y+m (the symbol ' meaning matrix transposition). To effect a compression we can discard the coefficients with smaller variances, keeping only the first l eigenvectors from which we obtain (At are the first l rows of A) y, = A , ( x - m )
(1)
206 and = A~y, + m
(2)
where ~ is an approximation of x. The KL transform has the property of being optimum, respect to all others, in the least square error sense, when considering the same number of coefficients. KL thus is adapted to the image from which the eigenvectors are computed and this is the informal proof of its coding efficiency. This method has the drawback that with subimages of nxn pixels, we need to calculate eigenvectors and eigenvalues of a symmetric matrix of dimension n2xn 2, SO the complexity of the problem grows very rapidly (see, for example, [5]) with increasing size of the subimages. However this increase should allow discarding a relatively greater number of coefficients to obtain larger compression ratios; this advantage has to be balanced with the increased length of the eigenvectors to be transferred to the decompression phase.
Method Our goal was a set of basis images (of extension nxn) having the desirable characteristics of the KL ones, but lighter in computational complexity. Thus we considered row and column vectors of dimensionality n by subdividing the image in row and column vectors; we calculated separately a KL orthonormal basis (of size n) for the rows {rl ..... r~} with mean of all the rows rM and for the columns {c~..... c,} with mean of all the columns CM. Computing the eigenvectors involved the inversion of two nxn matrices, one for rows and one for columns. Afterwards, to obtain an orthonormal basis of size n 2 of basis images nxn, we multiplied every column and every row (tensor product): ci I). What is obtained is an orthonormal basis of n 2 subimages; in fact by hypothesis
c,'cs = 6,s rkr; = 6 kl If ~
is the operator that produces a row starting from a matrix, we can write (cirj ) ( ( f ir j ))t =
[c/,r.l...Cilrjn
c,2rjl" ..Cinr.n
][C/151...Cilrjn
(r162
= (c;c,)(r:,~= 8,,8;,
C,251o ..cinrjn ], = (C;C i )(rjrj:) -- 8,i6 jj
while
where ~j is the Kronecker delta. To obtain an ordering for the significance of the obtained eigenvectors, we multiply the corresponding eigenvalues obtaining a fictitious eigenvalue for each basis image. The mean to use when applying equations (1) and (2) can be either the mean of the subimages n• or the mean of the mean vectors rM and cM calculated in this way:
mu =
r., +c~,
(3)
2
where rmi is the i-th pixel of rm and c ~ is thej-th pixel of cM. We obtain a new separable transformation, derived from KL, that requires less overhead information transfer (only 2n vectors of dimensionality n plus their eigenvalues), has a slower complexity growth when increasing the subimage dimensions with respect to bidimensional KL but has the drawback of lesser accuracy when using the same number of coefficients. We compared this method with DCT, and we noted (in our preliminary tests) that when we used only 8% of the coefficients for subimages of size 8x8, the proposed method performed better than DCT respect to the mean square error (4) and relative m.s.e. (5)
( f ' ( x , y) - f ( x , y))2 mean square error = o#_p,xels
(4)
# all_ pixels
l
relative mean square error =
,#_,,x,ts
f '(x, y ) - f (x, y) f (x, y) # all_ pixels
12 (5)
207 where f'(x,y) is the reconstructed and quantized (pixel values converted to integers) image. To compare both methods, one should determine the distortion functions (m.s.e. and relative m.s.e.) for equal bit rates. This comparison is not possible in a precise sense since the Huffman source coding of the same number of coefficients can vary in run length, and therefore in bit rate. We thus base our comparison on an equal number of coefficients, all of which should however be sufficiently well represented in the two byte integer format we used. Moreover, when the image was oversampled (i.e. a pixel was set equal to three of its neighbours), the proposed method performed better than DCT whatever number of coefficients was used when n=8, and in almost all cases when n-16. This can be explained noting that DCT uses general characteristics of the images (our eyes are not very sensitive to high frequency distortions) while the previous method is optimized for high performance with the image under examination: what is needed and obtained is a lower complexity respect to bidimensional KL, in calculating eigenvalues and eigenvectors. In addition to the previous considerations, if images with high spectral components are examined, the proposed method will perform better than DCT, because of its adaptivity to the image it examines. Another important aspect to note is that to obtain higher compression ratios it is necessary to use larger subimages (i.e. increasing n), and the proposed method is faster than bidimensional KL, especially with large n (n = 16, 24 .... ). The testing of the method was performed using MATLAB | [8], a software package that allows a fast prototyping of mathematical models.
Results We present some results obtained applying the proposed method and the DCT to images of size 512x512 with 256 grey levels. The subimages are of size 8x8 and 16x 16. In Figure l(a) and (b) the behaviour of the m.s.e, versus the number of retained coefficients is reported when the transformations are based on subimages of size 8x8 (i.e. n=8). In Figure 2 are shown the same variables for subimages of dimension 16x 16 (i.e. n = 16). Note that in these figures, errors were computed without rounding the coefficients in order to analyze the capability of the methods to compact the energy into few coefficients. If the coefficients were rounded, the error would be slightly increased and the corresponding compression ratios would be those reported in Table 1 and Table 2. Obviously the compression ratio is the same, both for the KL based method and the DCT method (not taking into account, for KL, the little overhead due to the eigenvectors, eigenvalues and mean subimage). If we fix the error then the KL method (in Figures l(b) and 2(b), for example) will use lesser coefficients and so will have a higher compression ratio. Table 1: Compression ratio with n=8
Table 2: Compression ratio with n=16
No. of coefficients 2 4
No. of coefficients 2 4 8 10 16
8
10 16
Compression ratio 16 8 4 3.2 2
Compression ratio 64 32 16 12.8 8
The first image considered is the classical Boat. The second image is a Nuclear Magnetic Resonance image of size 256x256 enlarged to 512x512 by means of pixel replication. We note in the graphs that the behaviour of the errors of the two transformations is similar (in Figure 1(a) and 2(a)), and better for the KL based transform (in Figure 1(b) and 2(b)). Compatible qualitative results are obtained by personally observing the reconstructed images upon reducing the number of retained coefficients. We performed a time test of the classical KL transform versus the monodimensional KL transform using the tic & toc functions of MATLAB | The test was performed on a 120 MHz Pentium running Windows 95. For 8x8 subimages the classical method computed the basis images in 10.93 seconds (average value) while the new method computed the basis images in 5.1 seconds (average value). For 16x 16 subimages the classical method computed the basis images in 55.91 seconds (average value) while the new method computed the basis images in 7.9 seconds (average value).
208
Figure 1: The m.s.e, values in reconstructing the Boat (a) and NMR (b) images using subimages of dimension 8x8.
Figure 2: The m.s.e, values in reconstructing the Boat (a) and NMR (b) images using subimages of dimension 16x 16.
Acknowledgements This work has been supported by the national project of MURST "Sviluppo di una workstation multimediale ad architettura parallela". The authors thank prof. A. Werbrouck for critical comments and textual suggestions.
References [ 1] R. C. Gonzalez and P. Wintz. Digital Image Processing. Addison-Wesley, 1987. [2] F. G. Horowitz, D. Bone and P. Veldkamp. Karhunen-Loeve based Iterated Function System encodings. In International Picture Coding Symposium, Melbourne, March, 1996. [3] K. R. Rao and P. Yip. Discrete cosine transform algorithms, advantages, applications. Academic Press, Inc., San Diego, 1990. [4] A. Rosenfeld and A. C. Kak. Digital Picture Processing, volume 1 II ed. Academic Press, New York, 1982. [5] C. A. L. Szuberla. Discrete Karhunen-Lo6ve Transform. http://foo.gi.alaska.edu/-cas, DRAFT. [6] G. K. Wallace. The JPEG still picture compression standard. Communications of the ACM, 34(4),1991. [7] A. M. Yaglom. An introduction to the theory of stationary random functions. Prentice Hall, 1962. [8] The MathWorks. MATLAB Reference Guide. The MathWorks, Inc., Natick, MA, 1992.
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
209
B L O C K N E S S D I S T O R T I O N E V A L U A T I O N IN B L O C K - C O D E D P I C T U R E S M. Cireddu, F.G.B. De Natale, D.D. Giusto, and P. Pes
Department of Electrical and Electronic Engineering University of Cagliari Piazza d'Armi, Cagliari 09123 Italy [email protected] Abstract In this paper, some of the most significant image quality indexes are reviewed and compared with a new method for block distortion evaluation. At first, a survey is given related to classical measures based on numerical differences between original and reconstructed image data (e.g., MSE and SNR), as well as advanced methods aiming at considering the perceptive aspects of image degradation (e.g., Hosaka Plots, HVS-based methods). Then, four innovative methods for blockness distortion evaluation are described, based on DCT analysis or on the use of gradient operators. 1. Objective Distortion Measures The more classical distortion measure is the Mean Square Error (MSE) between the original image and the decoded one. It measures punctual variations of the image instensity by averaging the squared differences between couples of corresponding pixels 1 MSE : - - ~ ~ [f(i, j ) - fr(i, j)] m n i=lj=l
The Signal-to-Noise Ratio (SNR) and the Peak-Signal-to-Noise Ratio (PSNR) can be directly derived by the MSE by using the following equations, which assume the distortion introduced by the codingdecoding operation as a kind of noise: m
SNR = O2x 9 PSNR - (2b)2 ff2x=~ 1 m n ( f ( i , j ) - ./7); ~, ~., mn i=1 j=l MSE' MSE '
f =
n
~., f ( i , j ) i=lj=l mn
where f ( i , j ) is the original grey level of the pixel (ij)-th, fr(i,j) is the reconstructed grey level, and m, n are the image dimensions. These measures provide a global estimation of the image distortion after co-decoding process. 2.Advanced Methods In this section, three of the most interesting image distortion measures are breafly reviewed, which differ from the above in the sense that human perception parameters are taken into account.
2.1 Hosaka Plots The evaluation process consists of first segmenting (splitting) the NxN blocks of the original image into k classes. The initial block size N is usually chosen as 16, thus leading to 5 classes" all blocks of size k= 1,2,4,8,16 form the k-th class. From each class, two feature vectors are calculated, respectively based on the average standard deviation and on the weigthed mean
where the elements marked with '*' are referred to reconstructed images. The error diagram, or Hplot, is constructed by plotting the corresponding features dS k and dM k in polar coordinates. The area of the H-plot is proportional to the image degradation; in particular, the presence of noise and blurring effects are put in evidence by looking at the left and right side of the plot. 2.2 Information Content (IC) This method is based on the evaluation of the perceptual distortion and therefore takes into account the characteristics of the human visual system (HVS) model. It consists of five stages: (i) the original image is re-mapped by a non-linear transformation; (ii) a linear transformation in the DCT domain is
210
applied to 8x8 image blocks; (iii) a matrix of coefficients is calculated at fixed resolution; (iv) the DCT coefficients are multiplied by the weigths; (v) IC is determined by summing the coefficient magnitudes. 2.3 Perceptual distortion measure The perceptual distortion measure is based on an empirical model of the human perception of the spatial patterns. The model consists of four stages: (i) front-end linear filtering, (ii) squaring, (iii) normalization, and (iv) detection. A steerable pyramid transform decomposes the image locally into several spatial frequency levels; each level is further subdivided into a set of orientation bands ~ (0,45,90,135) degrees. The front-end linear transform yields a set of coefficients A0 for every image region. The squared normalized output is computed, and a simple squared-error norm is adopted as detection mechanism R0= k A20/Z~A 20 +o2.
~,0~ (0,45,90,135)
where k is a scaling constant, a a saturation value, and Rref Rref, images vectors.
Rdist
the original and distorted
3. Blockness distortion measures Block distortion, or tiling effect is typical of any kind of block-based coding systems. It consists in an annoying visual mosaic effect produced by the imperfect matching of neighboring approximated blocks. Some image coding approaches reduce this drawback by using appropriate overlapping or interleaving techniques, but most part of the common methods (included the current standards) prefer to ignore the problem for the sake of simplicity. The methods presented hereafter evaluate the amount of such a particular but very usual image degradation.
3.1 Methods based on DCT analysis Two block distortion measures based on DCT analysis are considered here. Both are targeted to a particular kind of distortion appearing as a step of the luminance function in the horizontal or vertical directions, and consequently analyse the DCT features looking for this phenomenon. In our tests we considered blocks of size 8x8 at 8 bpp and their DCT coefficients matrix. A block characterised by a horizontal or vertical luminance step presents on the correspondent coefficient matrix a predominance in the first column or row. A block that has a double step, horizontal and vertical, on the correspondent DCT matrix has null elements (magnitude <106, thus negligible) on the odd rows and columns (excluded the first one). A particular case is a block that presents a double step (horizontal and vertical) given by the sum of single steps: because of the transformation linearity, the correspondent DCT matrix is the sum of single step DCT matrices. To apply these measures, we divide the image into blocks of size 8x8, called reference blocks. Then, the blocks are shifted of half-block dimension from the reference position in the two spatial directions separately. First, we consider the horizontally shifted blocks and their DCT matrix: since the goal is to detect vertical discontinuities, we calculate the sum of the first row's squared entries but the DC coefficient, weigthed by a factor proportional to the coefficient position. Then, we consider the vertically shifted blocks searching for horizontal steps with the same approach. The previous results are integrated in the following expression Z1=
~
(cl,j DCT'~2 , (i4 + j4)+
(i,j)~ Eh
~cD.CT~2(i4 + j4) E ~, l,J I (i,j)~ =~v
where Eh is the first column of the horizontally shifted blocks, and =v is the first row of the vertically shifted blocks (both excluded the DC element). Reference's blocks are then considered, and the sum first row's and column's squared elements is made, weigthed by a factor proportional to the coefficient position, obtaining a second term =
~ IcDCT,2 (i4 + j4) ~ t,j J (i,j)~ =~r
where =r includes the 1-th row and column of the reference's block, excluded the DC element. Finally, the first DCT-based quality measure can be calculated as: D1 = ~:I/Z1 + Z2
211 The second quality measure considers the 8x8 blocks shifted in horizontal and vertical. For each DCT transformed block, the difference between the mean of the squared coefficients in odd position .=o (first row and column excluded) and the mean of the squared coefficients remaining Ee is computed, both weigthed by a factor proportional to the coefficient position, namely: i, j E3 = (i, j)~ F,o
(i5 + j5) 9
~
i, j
+ jS)
(i, j)~ =~e
54
The second quality measure is obtained by dividing Es by the second term of the previous difference. 3.2 Methods based on Sobel operator Blockness distortion in block coded pictures is visual perceivable as a tiling effect. It can also be viewed as the superposition of false contours in the reconstructed image. To identify and measure such an effect, the Sobel operator can be successfully adopted. It is a quadratic filter based on the convolution of the image whit two directional gradient masks: D x - [---! 0 !1 0
Dy= [ - ! - 2 - ! 1 2
~
The first experimented method consists of three steps: (i) subdivide the image into NxN blocks (N typically ranges from 4 to 32), (ii) apply the Sobel operator to the whole picture, and (iii) apply the Sobel operator only to the block boundary pixels. Fig.la. shows a matrix representation of an image, to which the method is to be applyed.
Fig.1. (a) Image considered, where M is the columns number, while N is the rows number of the matrix rappresentation of image. (b) Rappresentation of block border pels of dimension 4• contained on image
212
If we denote whit
Dsx(i,j) and Dsy(i,j) the convolution
of the picture with the two directional
gradients, the Sobel's operator global magnitude can be computed as:
~,p~D2x(i,j)+D2y(i,j) (ij)~ where P is the set of all the image pels. Then, we considered a generic block of dimension 4x4 (see Fig.lb) and we apply the Sobel operator only to the pixels belongin to the block boundaries. The result of step (iii) is expressed as
Z ~D~x+D2, (i,j~ PB where PB contains all the block border pixels. By dividing the latter result by the former, we obtain the first block distortion index So:
7. ~D2x(i,j)+D2s,(i,j) SD = (i,j~pBJD~+D~,/ (i,j)~P The value of SD is in the range [0-1], being 1 when all the blocks are uniform. A variation of the previous method lies in calculating the difference between the pixels of the vertical block border and those of the horizontal block border. The new blockness distortion index is:
= ~ DSx+ ~., DS,/ ~ ~D2x(i,j)+D2y(i,j) SOD (i,j~ Pv (i,j~ PO (i,j)~P where Pv contains all the pixels of vertical block borders, while horizontal block borders.
Po
contains all the pixels of
4. Results and Discussion Several tests were performed on various test images. The following Table I refer to Lenna image (512x512 pels, 8 bpp). We considered 12 decoded images at different quality factors (the relevant bitrate is given in the table), a low-pass filtered version (5x5 moving average), a median filtered version (3x3 kernel); the noisy version was obtained by adding Gaussian noise (It = 0, ~= 5). The remaining three versions were obtained by using a mosaic filter. The PDM and IC index are successful in evaluating the subjective distortion but cannot distinguish among different kinds of distortion (blurring, noise, tiling, etc.). The Sobel-based and DCT-based indexes are far better in block distortion evaluation; it is sufficiently low in picture that do not present blockness distortion (low-pass filtered, median filtered, noise added image) but have PSNR values lower than those of JPEG-coded pictures. This confirm how this index is efficient for the evaluation of the presence of tiling without being sensible to other types of degradation. Process
RMSE
JPEG 1.75bpp JPEG 1.1 lbpp JPEG 0.85bpp JPEG 0.70bpp JPEG 0.60bpp JPEG 0.52bpp JPEG 0.43bpp JPEG 0.38bpp JPEG 0.32bpp JPEG 0.27bpp
0 0.760 1.020 1.250 2.520 3.830 3.930 4.370 4.920 5.260 6.030
PSNR(dB) oo
50.51 47.97 46.19 40.10 36.47 36.24 35.32 34.30 33.70 32.52
I
IC
)DM
SD
SDD
D1
D2
23.73 23.81 24.64 24.95 25.20 25.23 25.80 26.28 26.16 25.87 26.36
0 0.17 0.28 0.28 0.54 1.53 2.06 2.16 2.34 2.49 2.69
0.75 0.75 0.75 0.75 0.75 0.75 0.76 0.76 0.76 0.77 0.78
0.63 0.63 0.63 0.63 0.63 0.64 0.64 0.65 0.66 0.66 0.68
0.50 0.50 0.51 0.53 0.55 0.57 0.58 0.62 0.65 0.67 0.72
0.54 0.57 0.58 0.60 0.62 0.65 0.71 0.75 0.77 0.83 0.86
Quality level ] reference very ~ood very good very good very 8oo(1 good
8ood fairly good acceptable acceptable por
|
213
JPEG 0.20bpp JPEG 0.16bpp LP filter Add-noise Median Mosaic(2 x 2) Mosaic(4 x 4 ) Mosaic(8•
7.330 10.53 17'99 5.000 17.45 7.150 10.50 14.62
References
,
30.83 27.68 23.03 34.14 23.29 31.04 27.71 24.83 i
I
27.17 27.43 34.03 15.26 31.29 30.20 35.56
2.91 3.29 2.41 1.80
2.23 1.81 2.88
0.79 0.81 0.70 0.75 0.70 0.75 0.99
0'70 0.75 0.52 0.63 0.58 0.63 i
1.08
0.80 0.91 0.04 0.50 0.16 0.51 0.51 ,,
0'92 0.98 0.53 0.51 0.52 0.75 0.99
very po0r fairly $ood ve~ 8ood
~ood
acceptable Ix)or very poor
Table I
[I] K.Hosaka, "A new picturequalityevaluationmethod", PCS'86, Tokyo, Japan, April 1986 [2] $.A.Karunasekera,N.Kingsbury, "A distortionmeasure for blocking artifactsin images based on human visual sensitivity",IEEE Trans. Image Processing,vo. 4, no 6, pp. 713-724, June 1995 [3] X.Ran, N.Farvardin, "A perceptuallymotivated three-component image model-part I. descriptionof the model", IEEE Trans. Image Processing,vo.4, no 6, pp. 401-415, April 1995 [4] J.A.Saghri,P.S.Cheatham, A.Habibi, "Image qualitymeasure based on a human visualsystem model", Optical Engineering, vol.28, pp. 813-818, 1989 [5] P.Teo, D.Heeger, "Perceptual image distortion", Proc. 1st !EEE Conference on Image Processing, vol. 2, pp. 982-986, November 1994
This Page Intentionally Left Blank
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
A NEW
DISTORTION IMAGES
MEASURE ADAPTED
215
FOR THE ASSESSMENT TO HUMAN PERCEPTION
OF DECODED
F. B ock, H. Walter, M. Wilde D a r m s t a d t U n i v e r s i t y of T e c h n o l o g y I n s t i t u t e of N e t w o r k - a n d Signal T h e o r y M e r c k s t r a B e 25, D-64283 D a r m s t a d t , G e r m a n y E-mail: [email protected]
Abstract The development of a new distortion measure for the assessment of decoded gray scale images adapted to human perception is described. The errors are categorized into classes and locally assessed according to a human visual system model. Additionally the summed 9 up errors of each class are globally weighted according to the significance of the distortion for a human observer. The combination of these weighted error sums leads to the image distortion measure adapted to human perception (DMHP).
Psychological phenomena of human perception like knowledge-based recognition and image understanding are not incorporated in this quality metric. original image 1
........... t---;
EDGES .....
INTRODUCTION
The growth of data-intensive digital image applications as in multi-media or video-telephony makes image and video compression a central problem in digital communication and signal-storage technology. This has led to a wealth of lossy signal compression algorithms based on all sorts of data processing. Primary objective is high compression while maintaining sufficient signal quality. In order to compare different compression algorithms a quality metric for decoded images is required. The most common quantitative measure is the mean square error (MSE). But it treats all spatial frequencies and brightness levels in the image uniformly which is not necessarily meaningful, especially when there is a human receiver. A qualitative measure is the rating by trained photo interpreters, obviously a very costly possibility. Desirable would be a tool for automatic quality assessment of imagery which comes to the same results as a human observer. THE HUMAN MODEL
VISUAL
SYSTEM
For an image distortion measure adapted to human perception we first developed a model of the human visual system [1]. Therefore, various physiological phenomena of the human visual system have been considered like the sensitivity to a background illumination level and to spatial frequencies [2], [3].
I
|
assessment
1
error image
I
l
error 1 assessment/ TEXTURES . . . 1 FLATREGIONSJ assessment
EeDg:::g-~~ Tei2htiTun:~]~~p FLATeiR~ghting~ fLadapted distortion measure tohaman perceptionJ] Figure 1: Block diagram of the DMHP-System. The physiological phenomena of the human visual system cause different impressions of errors in an image depending on the contents of the image [4]. This is considered in our model by separating the image into three characteristic classes. Figure 1 shows a block diagram of the system. The decoded image is subtracted from the original image yielding an error image. The original image is separated into three characteristic classes, namely edges, textures and flat regions to design masks for subdividing the error image. Then the errors within these classes can be assessed depending on physiological aspects of the human visual system, individually. In addition, errors of each class can be globally weighted according to human perception. The mean square of all assessed and weighted errors results in the new distortion measure D MHP (distortion measure adapted to human perception). The process can be split up into three problems: separation of an image into characteristic classes, assessment of errors within these classes and globally
216
weighting of the error sums of each class. The following paragraphs will briefly introduce our solutions. 2.1
Separation of an Image acteristic Classes
into Char-
The level of objection for a human observer caused by errors in a decoded image depends on the region in the image where the error occurs [4]. It is easily understood, that an error placed within a wide region of the same gray scale values will give a different impression than the same error occuring at an edge. Therefore in our distortion measure system three different masks are made out of the original image for assigning the errors to the respective classes. For the edge-mask the image is first filtered with a 3 x 3 median filter to emphasize significant edges. Afterwards, difference recursive filtering I [5] and thresholding is applied for edge detection. These edges are widened with morphologic operations to make sure that all relevant pixels are included. Detection of texture is gained by Laplacian operators and thresholding [6]. 9x9 median filtering is applied to determine connected regions and to suppress noise. In the texture-mask all pixels which already belong to the edge-mask are subtracted in order to consider every pixel just once. All remaining pixels are assigned to the mask for flat regions. For a better understanding, the masks for the image "Lena" are shown in figure 2.
Figure 2: Masks for image "Lena". Top left: original image. Top right: mask for edges. Lower left: mask for textures. Lower right: remaining flat regions.
1Algorithm has been written by the team of Professor Serge Castan. Copyright 9 1993, 1994, 1995, Khoral Research, Inc.
2.2
Assessment of E r r o r s to t h e C l a s s e s
According
After separating an image into the characteristic classes, the errors occuring within the classes must be assessed according to h u m a n perception. All assessment values (multiplication factors to the error values) were calculated on the original image and lie within 0... 2 (0 = doesn't disturb at all, 2 - very disturbing). Edges are the most important features of an image because they represent the visual separation of objects. Obviously the kind of an error at edges plays an important role, e. g. smoothed edges m a y be disturbing but won't change the impression as hard as arising new edges or even new structures. In order to achieve a differentiatedassessment adapted to the human visual system the edge-detection is done also in the decoded image. Concatenating the two edge-masks, errors at edges can be additionally categorized in lost edges, i.e. edges which were in the original image and are missing in the decoded image, changed edges and new edges. Consequently, we can sensitize our measure to new and highly disturbing edges, e.g. blocking structures which arise in J P E G coded pictures at high compression ratio. Finally, a limit of tolerance is set for distinguishing between perceivable and non-perceivable errors in order to neglect the non-perceivable ones [8]. In textures, the correlation between neighbouring pixels is more important than the absolute pixel value. For example, errors which occur in the feather of the "Lena"-image will not cause any objection as long as the overall impression of the feather remains the same. Obviously, this effect also depends on the kind of texture, since if the same error values which occured in the feather and were invisible there will occur in parts of the also textured hat they will be recognized by everyone. A simple but sensible parameter for the assessment of an error value is the local variance 2. To deal with a wide range of possible textures, the local variance has to be thresholded and inverse scaled to get assessment values between 0.5 (high variance) and 1 (low variance), i.e. errors at parts of an image with a high local variance count less than those in parts with a low local variance. In flat regions, the human perception is sensitive to changes in gray scales, but a threshold for visibility can be found. Again, the local variance is used for assessment, i. e. errors which are smaller than the standard derivation 3 will be neglected. Furthermore, the local variance increases for a region containing edges which leads to a lower assessment or even neglect, i. e. the assessment with the local variance evokes the spatial masking of human perception 4 [8]. It is addi2The local variance is calculated over a region of 9 • 9 pixels of the original image. 3Standard derivation: a -- x/variance. 4This means reduced visibility of, e.g. noise on both sides
217
tionally taken into account, that the human sensitivity for noise depends on the background illumination level [9]. In figure 3, the final error assessment values for flat regions of the image "peppers" can be seen.
Figure 3: The error assessment values for flat regions of the image "peppers" (bright: high assessment, black: ignoring).
2.3
Globally Weighting sed error sums
of the
asses-
In figure 4 the percental unvalued error distribution 5 over the three characteristic classes of an JPEG series from the image "Golden Hill" is depictured. For an increasing compression ratio the percental amount of detected errors belonging to edges increases exponentially. This corresponds to arising blocking structures which are indeed clearly visible and disturbing. Even by a simple linear combination of these three unvalued error sums in a way that the edges are more important (weighting factor 2) than the other two classes (weighting factor 1) the result is a lucid better distortion measure than the traditional MSE 6. The globally weighting factors are normalized in a way that the effective number of pixel values remains unchanged. This normalization assures that the results for different weighting factors are still somehow connected to the MSE, so that the DMHP is comparable with the MSE.
3
Figure 4: Percental unvalued error distribution of the image "Golden Hill" belonging to the different classes depending on the J P E G compression ratio.
RESULTSAND CONCLUSIONS
The image distortion measure was tested on various images including "Lena", "Golden Hill", "Mandrill" and "Peppers" (figure 5 ) w h i c h were compressed with different coding techniques (e. g. JPEG, Wavelet-Coder). In addition, artificially generated test-images were used to control the assessment adapted to human perception. The separation of images into three characteristic classes worked successfully even for extremely different images. In figure 6 the amount of pixels allocated of a large change in the background luminance. 5Error distribution without assessment within the classes. 6The MSE is equal to the sum of the three unvalued and equally weighted error sums!
Figure 5: The images "Golden Hill", "Mandrill" and "Peppers". to the three characteristic classes is given for some of the original test-images.
Figure 6: Pixel-allocation for the three characteristic classes: edges, texture (pre_texture: texture before subtracting the edge-pixels) and flat regions. To compare the results of the new DMHP with the MSE an JPEG-compression series of the image "Golden Hill" is shown in figure 7. For a small compression ratio when the quality of the image is not visibly decreased the DMHP results in even smaller values than the MSE while it yields clearly higher values when the distortion becomes visible. In conclusion one can say that a clear adaption to the subjective impression of a human observer has already been achieved even without an optimization of the assessment and weighting parameters. In figure 8
218 edges and edges within textures should result in a still higher performance.
References [1] M. B. Barlow, "Understanding natural vision" in Physical and Biological Processing of Images. O. J. Braddick and A. C. Sleigh, eds., Springer Verlag, 1983.
Figure 7: DMHP and MSE for the JPEG compressed image "Golden Hill". two images with the same MSE are shown: at the top the image due to compression with JPEG and at the bottom due to added white Gaussian noise. Obviously, the results of the DMHP are more adapted to a human observer. In case of the JPEG compressed image the highly disturbing block-structures lead to a high value of the DMHP (even higher than the normal MSE). In contrast, for errors hardly perceivable to the alert eye of a human observer the value of the DMHP is low and remains clearly below the MSE.
[2] N. Jayant, J. Johnston and R. Safranek, "Signal Compression Based on Models of Human Perception", IEEE Proceedings Vol. 81, No. 10, pp 1383-1422, 1993. [3] J. A. Saghri, "Image quality measure based on a human visual system model", Optical Engineering, Vol. 28, No. 7, July 1989. [4] W. Xu and G. Hauske, "Perceptually relevant error classification in the context of picture coding", IEEE Image Processing And Its Applications, Conference Publication No. 410, pp 589593, 1995. [5] Retro-Manual von Khoros 2.0.2., Khoral Research Inc. 1995. [6] H. Ernst: Die digitale Bildverarbeitung. Franzis-Verlag: Miinchen 1991. [7] J. S. Goodman and D. E. Pearson "Multidimensional scaling of multiply-impaired television pictures" IEEE Trans. Sys., Man, Cybern. 9, 1979. [8] A. N. Netravali and B. G. Haskell: Digital Pictures. New York: Plenum Press, 1995. [9] H.-M. Hang and J. W. Woods: Handbook of visual communications. Academic Press: San Diego, 1995. [10] J. L. Mannos and D. J. Sakrison, "The Effects of a Visual Fidelity Criterion on the Encoding of Images", IEEE Transactions on Information Theory, Vol. IT-20, No. 4, April 1974.
Figure 8: Two images with the same MSE of 59. Top: JPEG compressed, here the DMHP yields 72. Bottom: image with added white Gaussian noise and a D MHP of 45. For future work an optimization of the assessment and weighting parameters has to be done by comparing the results with those of different human observers. In addition, a more detailed analysis of texture and a better distinction between significant
[11] N. B. Nill, "A Visual Model Weighted Cosine Transform for Image Compression and Qualtiy Assessment", IEEE Transaction on Communications, Vol. COM-33, No. 6, June 1985. [12] X. Ran and N. Farvardin, "A Perceptually Motivated Three-Component Image Model: Part I & Ir', IEEE Trans. Image Proc. Vol. 4, No. 4, pp 401-415 and 430-447, April 1995.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
219
Image Compression with Interpolation and Coding of Interpolation Errors
Abstract
Jian Yi and Ferdinand Arp D e p a r t m e n t o f Electrical Engineering University o f Wuppertal D - 4 2 0 9 7 Wuppertal, G e r m a n y
This paper presents a new coding scheme for still image compression which works in the spatial domain with consideration of the human visual system (HVS). The method transmits a subsampled image, restores it by interpolation and corrects residual visible interpolation errors by transmission of additional information. Further redundancy and irrelevancy reduction is achieved by combining the above method with the BIGCHAIR technique. The results achieve reduction rates which are comparable to those of transform coding, but the implementation is much simpler.
1 Introduction Image coding with classical DPCM has the disadvantage that the HVS is instdticiently considered. Thus its data reduction rate remains far behind to that of transform coding techniques although these do not consider redundancy reduction. On the other hand, DPCM has the inherent advantage of very simple implementation. The coding scheme presented in this paper is designed on the basis of DPCM techniques including an extensive consideration of the HVS. This scheme works on the following principles: Image areas in which the intensity is constant or only varies slowly, can be sufficiently represented by locally sparsely distributed pixels. These pixels can be uniformly subsampled from the image matrix. The omitted pixels can be subsequently interpolated by the receiver. Generated interpolation errors remain invisible. Around edges and other areas where the image intensity changes rapidly, interpolation introduces large and partly visible errors. These should be corrected to achieve satisfactory reconstructions. Fortunately the HVS is much less sensitive to errors in high activity areas than in smooth areas. This makes possible to correct visible errors in high activity areas with only limited accuracy and thus by few bits to make them invisible. Thus we can represent an image with less bits in slowly varying areas, where interpolation can do a good job, and correct the residual visible interpolation errors in high activity areas with a few additional bits, where the human eye is less sensitive. An improved irrelevancy reduction can be introduced to DPCM by using the BIGCHAIR technique extension [1][2][3]. This technique has been successfully used for data reduction of still as well as moving images [4][5]. It consists of a blurring filter before and an inversely deblurring one alter a standard closed-loop DPCM codec. The blurring filter is a non-ideal low-pass which attenuates small details of the image but does not remove them completely. The deblurrmg filter works inversely to its blurring counterpart. The function of these filters is to raise the high spatial frequencies of the DPCM quantization errors in the decoded image. In this way the quantization noise is better matched to the HVS and becomes irrelevant. Moreover this procedure allows access to the blurred image which enables extended visually relevant processing. We adopt this pair of filters in our system so that all reconstruction errors generated in the coding process can be spectraUy shaped.
2 System Description These principles lead to the image coding system depicted in Fig. 1. The original image is first filtered by the multidimensionally acting blurring filter so that the details of the image content are attenuated. Subsequently it is uniformly subsampled with ratios D = 2 in both the horizontal and vertical direction. A DPCM encoder is applied to the subsampied and blurred image in order to reduce redundancy as well as irrelevancy. The DPCM encoded differential signal is transmitted to the receiver which DPCM decodes the received signal and interpolates the omitted pixels by means of a linear filter. The procedure of uniform subsampling and interpolation of the whole blurred image does not
220 -i I
I
t ',
| I I I I I I I
d Subsampler I 7 D:I I
"J DPCM-Encoder 7
~O
I I I
,
,'
Interpolator I:D
~
]l DPCM'Dec~
Blurring Filter .] Encoderfor Blurred En'or Image II
ENCODER
O
~'I DPCM'Dec~
II
"1"1Interpolatorl:D
I I
§ Deblurring Filter
DECODER Fig. 1 Blockdiagram o f the coding system distinguish between its smooth and high activity areas. The described scheme is easily implemented, but there remain some visible interpolation errors in high activity areas of the inversely filtered reconstruction. This vestige of visible interpolation errors must be corrected in order to make them visually irrelevant in the final reconstruction. In order to do so we first generate a blurred error image at the transmitter side. It is the difference between the blurred original image and its blurred interpolated replica. The correction information is obtained from this error image and will also be tranmfitted to the receiver. The receiver adds this correction information to the interpolated image. Finally it becomes inversely filtered to the deblurred and visually relevant reconstruction. Several methods can be used to deduce the correction information from the error image. In this paper we studied the following different coding methods of the error image for the elimination of the residual relevant reconstruction errors: 9 DPCM coding. We studied the correlation characteristics of the blurred error image with respect to its matched DPCM coding, designed the predictor of the codec and experimentally adjusted the quantization characteristic of the encoder. 9 Subsampling and interpolation. The blurred error image also becomes uniformly subsampled and non-linearly quantized before it is transmitted. The receiver interpolates the blurred and subsampled error image by means of a linear filter. This method halves the total amount of sampling points. 9 Simple quantization. The blurred error image is directly quantized with an experimentally chosen quantization characteristic without any further processing.
2.1 Blurring and deblurring filters The blurring and deblurring filters are identical to those in the BIGCHAIR-DPCM system described in [ 1]. The blurring filter is a recursive one with the transfer function, written in one dimension for simplicity, U(Z) - K ( 1 - a ( z ) ) - '
where a ( z ) is a transversal filter with identical coefficients A k = A, for all k ( 0 < k _< M ) so that its transfer function becomes a ( z ) - E k e 1 A z - k " This so-called equiweighting aperture is the result of a derivation to achieve the minimum reconstruction error power [ 1]. The factor K is chosen in such a way that any constant input signal will not be attenuated by the blurring filter. This requires the condition A - (1 - K ) / M . On the other hand, the high
221
frequency amplitudes of the input signal are approximately attenuated by the factor K ( 0 < K < 1 ). The deblurring filter exactly realises the reciprocal transfer function of the blurring filter so that the original signal can be completely recovered when passing the error-free blurred signal through the deblurring filter. 2.2 Subsampling and interpolation o f the blurred image The used subsamplmg scheme of the blurred image is depicted in Fig. 2. The linear interpolator for recovering the omitted samples is determined by using the correlation functions of both the blurred image and the superimposed quantization noise of the DPCM encoder as a priori information. The optimisation of the interpolator is carried out using the LMS error criterion. It turns out to be a classical Wiener filter extended to signal interpolation obtained from noisy samples. The following normal equation system can be derived for the determination of the interpolator coefficients {leo+. } oo
I~+,(l~t-k)n + Q(t-k)n ) -- RID+,, 0 < n < D, for all l , k=-oo
where RID+, is the covariance function o f ht e
9
9
9
9
blurred signal while P~z_k)D and Q(~_k)D are the
covariance functions of the subsampled signal and the superimposed quantization noise, respectively. The z-transform of the above equation results in
i(z) =
9
r(z)
9
9
9
9
9
m
*
'
where i(z) is the transfer function of the interpolator, r ( z ) is the spectral power density of the blurred image, rn (z n ) and qn ( zD ) are the spectral power densities of the subsampled signal and the additive noise, respectively.
9 Subsampling positions of the blurred orighaal image Fig. 2 S u b s a m p l i n g o f t h e b l u r r e d
image
2.3 Cyclostationarity o f the interpolation error image On the assumption that the input signal is a stationary process, the interpolation error will be a cyclostationary process. It means that its covariance function is dependent on the shift n of the starting point of the interpolated subsequence:
E{ASiD+.AS(~_j)D+._,.} :x E{ASiD+.AS(~_j)D+,~_,.}, for n ~e n, e D, where {ASin+. } is the interpolation error sequence. On the other hand, this covariance is a periodic function of the starting point n with period D :
E{ASiD+,,AS(~_j)D+._.,} - E{AS(i+k)D+.AS(i+k_j)D+._,,,} , for any k. Each subsequence {ASiD+, } for any fixed starting point n (0 < n < D ) is stationary:
E {AS~D+.AS(,_j)D+.} - E {AS(,+k)D+.ASo+k_j)D+. } . The covariance function of the interpolation error sequence can be determined from the covariance function Rjn+m of the blurred signal and the interpolator coefficients {lw+ . } by application of the normal equations: oo
E{hS~+,AS(~_j)z~+,_m} - RjD+~ - ~ IW+,/~k_j)D+,_,, k=-oo
0 --< m < n.
222 2.4 DPCM coding of the blurred interpolation error image In the following sections we discuss operations applied to the differencial image between the blurred original one and the DPCM decoded and subsequently interpolated blur-image. First we describe DPCM coding of the interpolation error image, which is used to reduce its redundancy as well as irrelevancy. Due to the cyclostationary nature of the interpolation error, different predictors for DPCM coding of the interpolation error image must be designed for each of the interpolation error subsequences. There are four of such subsequences in our case having the subsampling ratios D = 2 in both the horizontal and vertical direction. Three of them are essential concerning the correlation of the interpolated pixel values. The predictors for the these three subsequences are designed according to the LMS error criterion using the knowledge of the covariance functions of the interpolation error as well as the DPCM quantization noise.
2.5 Subsampling and interpolation of the blurred interpolation error image In this section we discuss subsampling, quantization and interpolation of the interpolation error image. The intervening subsampling scheme of the error image is depicted in Fig. 3. Different interpolators must be designed for different subsequences to match the cyclostationary interpolation error sequence. The error subsequence at the existing subsampling positions of the blurred image is not needed. The remaining two interpolators are designed according to the LMS error criterion with the covariance functions of the interpolation error and its quantization noise. This interpolation only incompletely restores the whole error image. However, experimental results show that this erroneously interpolated error image matices for the visually relevant correction of the interpolation error in the interpolated blurred image. The reason is that the HVS is not sensitive to errors in high activity areas.
9
9
9
9
9
9
9
9
9
9
A
9
9 Subsampling positions of the blurred original image 9 Subsampling positions of the blurred error image
Fig. 3 Subsampling of the error image
3 Results and Conclusions Computer simulations of the described coding procedures have been carried out. The smallest transmission rate is achieved when applying the method of subsampling and interpolation to the error image (section 2.5). The test image PLAYBOY can be encoded with a transmission rate of 0.93 bits/pel without visible errors. Applying DPCM-coding to the error image (section 2.4) results in a transmission rate 1.1 bits/pel. When using simple quantization of the error image, we achieve the result of encoding the whole image with 1.2 bits/pel without visible errors. Generally the achieved coding efficiency is comparable to that of transform coding techniques. Our implementation however is much simpler.
References: [ 1] F. Arp, "BIGCHAIR-DPCM, a new method for visually irrelevant coding of pictorial information", in Proceedings of the 1988 IEEE International Symposium on Circuits and Systems, Espoo (Finland), June 7-9, 1988, pp. 231-234. [2] F. Arp, "System properties of BIGCHAIR-DPCM compared with other coding schemes for data reduction of visual information", in Proceedings of the IEEE Workshop on Visual Signal Processing and Communications, Raleigh (NC USA), Sept. 2-3, 1995, pp. 183-187. [3] F. Arp, "DPCM extension considering the human visual system", in Proceedings of the IEEE Workshop on Visual Signal Processing and Communications, Rutgers University, Piscataway (NJ USA), Sept. 19-20, 1994, pp. 120125. [4] F. Arp and J. Wassermann, "Irrelevant data reduction by BIGCHAIR-DPCM", in Proceedings of the 1991 Picture Coding Symposium, Tokyo, Sept. 2-4, 1991, pp. 147-150. [5] J. Wassermann, DPCM-Kodierung von Bildsequenzen mit Irrelevanzreduktion durch Unscharffilterung. VDI Forschrittberichte, Reihe 10, Informafik/Kommunikationstechnik Nr.219, VDI Verlag, Diisseldorf, 1992.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
223
Matrix to vector transformation for image compression Djamel Ait-Boudaoud Department of Electronics, Bournemouth University Talbot Campus, Fern Barrow, Poole Dorset BH12 5BB United Kingdom Email" dj [email protected] Abstract This paper presents a new algorithm for approximating a rectangular matrix (Z) with a new matrix 0P) derived from the product of a column vector (X) and a row vector (Y). The result of this transformation will enable the compression of images from N 2 to 2N, where N is the size of the sub-image block. This problem is also solved using neural networks, and a comparative analysis of both solutions is provided.
1
Introduction
The large amount of data used for the representation of digital images is still a significant concern in many applications as large channel bandwidths and high capacity storage devices are required for their transmission or storage in their original form (raw data). Techniques for reducing inherent data redundancy existing in many digital pictures are being continuously researched and developed. The proposed method is based on a well-known mathematical concept of constructing a rectangular matrix from the product of equal length column and row vectors. The approach introduced in this paper considers the reverse process, i.e. starting from a rectangular matrix (Z), construct the approximate column vector (X) and row vector (Y) to closely restore the original matrix (X.Y~Z). This method, in many instances, will lead to an over-determined problem as the number of known data far exceeds the number of unknown variables. Hence approximation techniques are used to solve the problem. Section 2 presents a brief overview of image compression techniques. Section 3 details the mathematical approximation solution. Results and experiments using the proposed algorithm are provided in section 4. Section 5 contains neural nets solution together with results and comparative analysis. Conclusions are provided in section 6.
2
Image Compression Techniques
A number of image compression schemes have been developed, examples include, transform coding [1], Differential Pulse Code Modulation [2], hierarchical image decomposition and vector quantisation [3], etc. The primary objective of compression techniques is to reduce the average bits per pixel for transmission or for storage whilst minimising the distortion of information and preserving the quality of the reconstructed images. In general, the techniques used in compression are performed in two phases. The first phase consists of mathematically transforming images into new representations suitable for compression. The second phase is concerned with encoding the new representations. Typical examples of mathematical transformation include the conversion from the spatial domain to the frequency domain (DCT is a typical example), whilst encoding method include Huffman and nm-length encoding.
3
Proposed mathematical algorithm
The proposed algorithm falls into the first phase of the compression process. Essentially, the algorithm is based on the assumption that each square sub-block in an image represents a square matrix (Z) whose values can be computed from the product of a column vector (X) and a row vector (Y) such that,
ix!
Izll gin! ~.%1
%.)
224
Assuming that the best approximate fit is measured by minimising the sum of the linear separations for all elements of XY-Z. For this, we compute the sum of the columns and rows, ci = Y', . z ji , and ri = j=l
f
j=l
(2)
z O.
then ci and r i are made to be in the same ratio to each other as are corresponding Yi and x i such that, (3)
cl = Yl ,Cl = Yl , ... cl = yl
c2
Y2 c~
r_/
)G
Y3
cm
Y,,
x/
rl
= ..a_,, ~ = ._a_-,....z_~= r2 x2 r3 x~ r. x,
(4)
The next stage consists of selecting any non zeros of ci and r i (say c 1 and rl) and express all members of X and Y as multiples of x 1 and Yl as follows, . . x l rn ) rl
(5)
= ( Y l , Y l - ~ I , " ' Y l cm )
(6)
X t =(Xl,X2,...Xn)=(xi,xl-~l
Y= (Yl,Y2,'"Ym)
r2
,.
c2
Cl
Then the error on each matrix element can be expressed by,
%. =
r~c/ w.--
---
rlCl
-
z o
(7)
where w is the value of xl.y lthat makes eij=0. For n=m=N, the total error for each w is evaluated by,
a~xE (w) = /N 2 ZN ZN(%.)2
(8)
It can be seen that there will be N2 values of w and MSE(w). The value of w that gives the minimum MSE is the best solution to the problem. Once w is chosen, x 1 and Yl are derived such that W=Xl.Yl, and the remaining terms of the vectors x and y are obtained using equations (5) and (6).
4
Experiments and results
This method has been applied to the compression of grey scale images. Several experiments were carried out to analyse performances such as best block size, compression ratio, Peak Signal to Noise Ratio (PSNR), and subjective quality. A typical test image is presented in Figure (1). The reconstructed image at a ratio of 8:2 is shown in Figure (2). It should be noted that the ratio given is only concerned with phase 1 of the compression process. Subjective analysis of this reconstructed image indicate a good reconstruction using the proposed method.
F i g u r e 1: Original image (256x256:8bpp)
F i g u r e 2: Reconstructed image (cr=8:2)
225 The compression ratio is highly dependent on the block size. Increasing the block size, however, results in too many equations for few variables (over-determined problem). Consequently, the reconstructed image begins to get blocky effect near the edges (see Figure 3). The Peak Signal to Noise Ratios (PSNR) of the reconstructed images for block sizes varying from 2 to 32 were analysed, and the best block size with respect to the PSNR was found to be block size 4 (see Figure 4). This is also confirmed with the subjective analysis of the reconstructed images. This block size achieves a compression ratio of 8:2 with a high PSNR of 41.84 dB. Note that the vector elements are coded with half the number of bits used for the matrix elements.
Figure 3: Reconstructed image (cr=8:1)
5
Figure 4: Selection of the block size
Neural nets solution
Neural networks have been used extensively in image processing. It is also known that given a sufficient number of hidden layers a multilayer feedforward neural net can be used as a universal approximator [4]. Consequently, an alternative approach to that developed above based on neural networks is proposed to solve our problem. A simple multilayer perceptron architecture is adopted with a single hidden layer. Input and output layers of the network consist of a fixed number of neurons to reflect the chosen block size, while the number of neurons in hidden layer is flexible and determined by an iterative process. The network was trained using known vectors, with backpropagation assigned as the training algorithm. These vectors were arranged to represent the output training image and the matrix formed by the product of the vectors is used as the training input image. Figure 5 illustrates the architecture of the neural net and the training images.
Figure 5: Multilayer Perceptron Network with the training images Using a single training set, the network converged to a steady state after 200 epochs. This network was then simulated using real images. Figure 6 highlights the encoding process. Upon completion of encoding, the resulting vectors are rearranged in the correct order. The decoding process simply performs vector multiplication operation. The results of decoding process is depicted in Figure 7. A comparison of both analytical and neural solutions was undertaken to evaluate the efficiency of the encoding process and the quality of the reconstructed images. A compilation of this comparison is summarised by the following points: 9 9 9 9
Analytical solution does not require training Quality of the reconstructed images is perceivably better using analytical process. Reconstructed images using neural network solution showed some artefacts on the edges. Encoding using neural networks is much faster than the analytical method.
226
Figure 6: Encoding process of 'Lenna'
Figure 7: Reconstructed 'lenna' image
6
Conclusions
In this paper, a new method for image compression based on the decomposition of matrices into column and row vectors has been presented. The proposed solution is based on approximation techniques because the problem lends itself to solving over-determined simultaneous equations. The assumptions made based on preserving the ratio of the columns and row has proved to be efficient in speeding up the computation. The second solution using a simple neural networks architecture, also proved adequate. The promise shown by this method suggest further investigations of the neural approach with respect to optimising the network architecture, and improving the performances of the method. Factors such as number of hidden layers and their associated number of nodes are critical to avoid loss of memory as well as overtraining, and these will be considered together with other neural network architectures. The major advantage of both proposed solutions is the simplicity of the decoding stage as the decoding process consists simply of vectors multiplication. This is an important factor particularly for consumer electronics where the size and the cost of the decoding stage must be kept to a minimum, and it is also critical to real-time encoding and decoding schemes. Further improvement of the compression ratio has been achieved using the pyramid encoding/decoding scheme. However, subjective analysis must be performed to identify the appropriate depth of the pyramid, as the accumulation of the errors could lead to unacceptable quality of the reconstructed images. Needless to say that any reduction in the approximation errors may have an effect on the pyramid scheme.
References [1] Clarke R.J. (1985): ' Transform coding of images', Academic Press [2] Jayant N.S., NoU P.(1984): 'Digital coding of waveforms, Principles and applications to speech and video' Prentice Hall [3] Gersho A. and Gray R.M.(1992): 'Vector quantizafion and signal compression' Kluwer Academic Publishers, Boston. [4] Hornik, K., Stinchcombe,M., White, H.(1989): 'Multi-layer feedforward networks are universal approximators', Neural Networks, 2, pp 359-368
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
227
A SPEECH CODING ALGORITHM BASED ON WAVELET TRANSFORM Xiaodong Wu, Yongming Li, Hongyi Chen Institute of Microelectronics,Tsinghua University,Beijing 100084,China Abstract: Though wavelet transform has begun to be used in audio compression and speech parameter abstraction, it has not been used in speech compression. In this paper, we give a speech coding algorithm based on wavelet transforn. It has the features of high compression gain, good quality, wide applicability and no artifacts. It is effective not only to speech signals but also to music signals, so it will help to implement low bit rate audio compression. 1.Introduction In recent years, wavelet transform has been widely used as a new analytical tool in many research areas [1][2] . Having a good time-frequency resolution , wavelet transform is especially suitable for processing time-varying signals. Researches in image compression and audio compression show that wavelet transform does help to do these tasks [3][4]. In this paper we give a speech coding algorithn based on wavelet transform. The paper is organized as follows . First we introduce wavelet transform. Then a speech signal is decomposd into several bands with wavelet transform, centering on how to choose a wavelet function. To compress the signal , a dynamic bit allocation algorithm is given to assign bits to each band. At last, some discussions on the results are given. 2.Wavelet transform Wavelets are a new family of basis functions for the space of square integrable signals [ 1][2]. In this paper we consider only orthonormal wavelets. Given a wavelet function W(x), its dyadic dilates and integer translates
generate a Hilbert basis for L2(R). Decompose a given signal f(x)eL2(R) on the basis we can get the wavelet coefficients +oo (x)q~(x )dx,j,keZ Wj,k=~_~f which provide a-multisolution original signal
analysis of the signal, and from these coefficients we can reconstuct the
f (x)=Y.ZWj,kq'j,k (x) . jk V(x) is not arbitrarily chosen, and it must satisfy some conditions. For details, see [1][2]. 3.Speech coding algorithm based on wavelet transform A speech signal is often sampled at no more than 10kHz, with a bandwidth of about 4kHz. Using wavelet transform one can decompose the speech signal into L+I bands, where L is the levels used in the wavelet decomposition. We use the generalized wavelet transform .... wavelet packets transform to decompose the signal in up to 2 L bands. Since there are about 16 critical bands in the 4kHz bandwidth, L needs not to be large , and we take L to be 5. The width of each band is 125kHz, which approches the narrowest width of critical bands. One problem in the process is that which kind of wavelet function should be chosen, because wavelet theory gives us too many bases. At first we use the adapted wavelets with a finite support length as in [4], but in experiments we found that wavelets of a given support and a maximal number of vanishing moments give about the same results. This is mainly because the coding method we use in this algorithm. To get a high frequence resolution, we should increase the support length of the wavelets, but this will result in a large amount of computation and a low time resolution. So it is a good trade-off to use the Daubechies wavelet with a support length of more than 16, and we take 20 in our algorithm. tn order to compress the signal, there are two ways. One is to use the coefficients in some bands while throwing away the others, as A.H. Tewfik, etc. did in [4]. Obviously most informations will be lost this
228 way. The other way is to use a bit allocation aigorithm, and quantization is done according to the bits allocated. This technique is more often used in audio compression[5][6]. We use the average energy in each band as a criterion, and calculate the number of bits assigned to each band dynamically. The block diagram of the encoder is shown in the figure 1. pcm sample .._,
v bit allocation I Fig. 1.
frame
encoded bit-stream
.~ formatting
The diagram of the encoder.
4.Conclusions We use the algorithm to compress different kinds of speech and music signals, and the compression gains are 4, 8 , 1 6 , 3 2 respectively. Some of the results are given in figure 2.
Fig.2. (a)(b) A segment of speech signal and its spectrum. (c)(d) The reconstruted signal of (a) and its spectrum with this algorithm. (e)(f) The reconstruted signal of (a) and its spectrum with LPC_I 0e. The compression gain is 32. Subjective tests show that it is very promising to compress speech signals with wavelet transform. Of course, the quality of the reconstructed signal degrades with the increaseing of the compression gain. But even at a low bit rate less than 3kbps (including the side information) , the reconstructed signals still have intelligiblity and naturalness, and its quality is better than that of the reconstructed signal with LPC-10e. The algorithm is robust to speeches of the young and old, men and women, and is effective to music signals. There is no artifact in the reconstructed signals. The shortcoming of the algorithm is that, because of the high compression gain, too much of the high frequence part of the signal is lost. Therefore, the reconstructed speeches sound low and deep.
229 From the frequence spectrums of the signals, we can get the same conclusions.The spectrum of the reconstructed speech matches that of the orignal speech better than that with the LPC-10e at places where the amplitude of the spectrum is high (often this is low frequence part). At places where the amplitude of the spectrum is low (often this is high frequence part), too much of the spectrum is thrown away, which results in the fact that the reconstruted speeches sound low. Because this algoritm belongs to waveform coding methods of transform domain in fact, the use of perceptual effect is necessary, and this will be sure to improve the quality of the reconstructed signals greatly. More research work is being done. REFERENCES: [ 1]I.Daubechies, " Orthogonal bases of compactly supported wavelets", Commun. Pure Appl. Math., vol.41, Nov. 1988, pp.909-996. [2]S.Mallat,"M~ltisolution approximations and wavelet orthonormal bases of L2(R)",Trans.Amer.Math.Soe., vol.315, Nov,1989, pp.69-87. [3]D.Sinha, A.H.Tewfik, "Low bit rate transparent audio compression using adapted wavelets", IEEE Tran. Signal Processing, vol.41, No. 12, Dec. 1993, pp.3463-3479. [4]A.H.Tewfik, D.Sinha, P.Jorgensen, " On the optimal choice of a wavelet for signal representation", IEEE Trans.Information Theory, vol.38, No.2, Mar. 1992, pp.747-765. [5]ISO CD 11172-3 Coding of moving-pictures and associated audio for digital storage media at up to about 1.5Mbit/s, part 3: Audio [6]Digital Audio Compression (AC-3) Standard
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
231
Automatic Determination of Region Importance and JPEG Codec Reflecting Human Sense Rina Hayasaka, Jiying Zhao, Yoshihisa Shimazu, Koji Ohta and Yutaka Matsushita Matsushita Lab., Dept. of Instrumentation Engg., Faculty of Science and Technology, Keio University, Japan 1. Introduction Looking at a picture, human beings only pay attention to important objects, while they may ignore less important ones in the same picture. There are many image processing techniques proposeduntil today, but very few are considering the local importances of images. So, it may be saidthat such imageprocessingtechniques did not consider the human sense. Presented here is a technique which determine the importance of regions automatically. If we can get the importances of regions, many image processing techniques may be improvedandcomputers may get more closer to hu man beings. We also developed a new compression method, an importanceadaptive baselineJPEG, compressing important parts using higher quality while lower the unimportant parts and lower bitrate as a whole, andit surely reflects the sense of human beings.
2. System Outline The outline of this system is shown in Fig. 1. This method take these three steps. 9 9 9
segment out regions from an image determine the importance of each region compress the image using the importances
Figure 1 : System Outline The following sections will explain each of these step.
3. Segmentation of an Image First, the original image is transformedfrom RGB into CIE L'a'b'representation, andsegmented into dusters by clustering in CIE L*a*b* space[1 ]. Then, write back theclusters in CIE L*a*b* spaceinto original image. Considering the contiguous relationship and the color difference between theconnectedregions, fuzzy reasoning is used to merge regions and get the final regions.
232
4. Determination of Region Importance After segmentation, theregion's feature is calculated, andimportance is automatically determinedthrough fuzzy reasoning. 4.2
Importance
Determinative
Region
Features
Objects may be considered visually important when they are outstanding to the eyes of human beings, or they are meaningful or attractive to human beings. Whether a region is outstanding(pop-out [2]) to human beings, depends on the feature of theregion. We found through experiment that the following features contribute to the importance: 9 Area R a t i o The percentage that the region takes over the whole image area. ArearatiO m
pixel m :
Width • Height
where Z p i x e l m i S the pixel total belonging to r e g i o n m . 9
Position Describes how far is from the center of gravity of the r e g i o n m to the center of the image. P o s i t i o n m = ~ ( m x - C x x ) 2 + (my - Cy
)2
where C x , C yis the coordinate of the image center, and m x , m of gravity of the r e g i o n m. 9
Y is
the co ordin at e of the center
Compactness Compact m =
4to x a r e a m perimeter m
where a r e a m and p e r i m e t er m are respectively the area and perimeter of r e g i o n m . T h e compactness
show how compact the r e g i o n m is, it equals to 1 when the regi on is round, andwill be smaller when 9
the boundary of the region is complicated. Border C o n n e c t i o n ~ Border m =
connect m
2(Width + Height)
where Y ' , c o n n e c t m is the number of pixel which arean theboundary of r e g i o n m andconnect with 9
image border. Region Color The mean value of each color component to CIE L * a * b * of region is used separately. Ou ts tan di ngn es s Describes how outstanding the region is comparing to its neighbor regions. O u t s tan d m = ~
l=l
IIcolorm-
c o l o r k 112 x ( 1 - d i s tan Cemk )X A r e a r a t i o k
l~m
where ~(mx-kxx)2+(my-ky)
2
d i s tan Ce mk ~Width 2 + Height 2 and Ilco lo r m - co lo r k II 2 is defined as Euclidean distance in CIE L * a * b space.
233
4.2 Automatic Tuning of Fuzzy Reasoning Rules Fuzzy reasoning is used to determine the importance of regions, u s i n g t h e a b o v e 8 features as input. Although fuzzy logic can encode expert knowledge directly and easily using rules with linguistic labels, it usually takes a lot of time to design andtune themembership functions which quantitatively definetheselinguisticlabels. So, we choose the automatic tuning proposed by Nomura[3]. In this method neural network learning techniques can automate tuning rules and substantially reduce development time and cost while performance[4]. To reflect human knowledge, the reasoning rules are tuned through learning from a large number of data of subjective assessment experiment. To the experimentees, both an original image (Fig.l(a)) and a segmented image(Fig.l(b)) are shown, and they give theirimportanceevaluation to each region by 3 levels. Theexperiment is carried out on 15 experimentees and 20 frames of images, andeach data is averagedto be the input/output data. Using the dataset, we can get therule reflecting thehuman knowledge, which give the importance of each region. One evaluated image is shown in Fig. l(c). This presents the important level of region as a whiteness, the whiter, the more important.
5. The I m p o r t a n c e A d a p t i v e Codec S c h e m e Now, we've got an information of importance of each region. Using this, the image can be compressed keeping the important part high quality, while the compressed file size doesn't get much because the nonimportant part is made to be low quality. First, we have to add the information of important level of each region to the data of encoded image, because it is necessary for decoder to decode the compressed stream. Then wemployedadata structure calledMDU-MaP", which is described by two-dimensional array. Each element of it stands for one MDU, defines the quantization table scale for all the blocks of the MDU, and indirectly describes the importance level of the MDU. MDU is Minimum Data Unit, the smallest group of interleveddata units. 0 in MDU-Map is of background, not important. In region or object basedcompression the source model is no longer the fixedsize squareblock, but it has become that of arbitrarily shaped region. It becomes necessary to code not only region content but also region contour/shape descriptions. In ourscheme, regions arerepresentedas aboundary described by using Freeman chain codes for its minimal storage requirements, good curvature description, and simplicity. Our encoding scheme is based on the JPEG coding system whichhas been widely used in diverse applications in still image compression. In JPEG data structure, thereis a part called application data segment can be ignored in standard JPEG cording. To be in accordance with JPEG standard[5], we use this segment to storethe region description. In baseline JPEG syntax, first, each component of the image is groupedinto 8 • 8 pixel blocks. Each block is then independently transformed by an 8 • 8 Forward-Discrete-Cosine-Transform(FDCT), andeach the64 DCT coefficients is uniformly quantized in conjunction with a 64-element Quantization Table, which must be specified by the application (or user) as an input to the encoder. Each element can be any integer valuefrom 1 to 255, which specifies the step size of the quantizer for its corresponding DCT coefficient. The purpose of quantization is to discard information which is not visually significant. This Quantization Tables enables us to compress important parts using higherquality whileunimportant using lower quality. Encoder goes through each block in original image to compress, check whether the block is important or not, if important, what is the scale for quantization table, by referring to MDU-Map. Different level of block is compressed using different quantization scale. It is encoder's responsibility to translate MDU-Map into a boundary description using chaincodes, and put it into application data segment of data stream. All important regions of an imagearedescribed as non-zeroelement in the MDU-Map. Decoder takes the compressed data stream as input and gets the region boundary description from application data segment, through region filling reproduces the region, does the same to all the regions and through pasting forms a MDU-Map that would be exactly thesame with original one. By referring to themap, decoder can decode each block accordingly.
234
(a) JPEG
(b) our method Figu re. 2: re su It im ages
6 . E x p e r i m e n t Result and C o n c l u s i o n An experiment result image is shown in Fig.2. (b) is the image compressed by our method, and(a)is the one compressedequally by JPEG. They have almost the same data size. Compared with the image equally compressedby JPEG, we can easily see the backgroundpart has obviously lower quality, while other important parts keep good quality. This method allocate only the important regions higher quality, reduce file size by further compressing the unimportant parts, and ensure the best visual quality at given compression ratio. Besides compression, many image processing techniques are expectedto be improvedusingregion importance co ncept.
References [1 ] A. Khotanzad and A. Bouarfa. Image segmentation by a parallel, non-parametric histogram basedclustering algorithm. Pattern Recogn., Vol. 23, No. 9, pp.961-973, 1990. [2] J. Davidoff. Cognition through Color. MIT Press, 1991. [3] H. Nomura, I. Hayashi, and N. Wakami. A self-tuning method of fuzzy reasoning by delta rule and its application to a moving obstacle avoidance. J. of Japan Society for Fuzzy Theory and Systems, Vol. 4, No.2, pp.379-388, April 1992. [4] K. Asakawa and H. Takagi. Neural networks in japan. Comm. ACM, Vol. 37, No. 3, pp. 106-112, March 1994. [5] IS O/IEC 10918-1. Information technology - Digital compression andcoding of continuoustone still images: Requirements and guidelines. 1994.
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
235
Directional Image Coding on Wavelet Transform Domain Dong Wook Kang Kookmin University.
E-mail:r
Abstract A novel method of directional image coding on the wavelet transform domain is devised for efficient compression of image data: Directionally decomposed (=filtered and decimated) versions of an image are obtained by manipulating the wavelet transform coefficients, and then the coefficients of each version are segmented into vectors of reasonable dimension, and finally the segmented vectors are quantized based on the gain-shape vector quantization. The proposed method yields excellent quality of reconstructed images at very low bit rates, much better than JPEG or other transform domain vector quantization algorithms.
Background Recently, vector quantization techniques have been widely studied by a lot of researchers to efficiently encode images. One of them is transform-domain vector quantization, which-utilizes the transformation of image to compact most of the energy of the signal into a few of almost-but-not-perfectly decorrelated coefficients and then vector quantization (VQ) to exploit the remaining correlations to the utmost. The typical procedure of transform-domain vector quantization is as follows: It first transforms an image or a sub-block of it with a two-dimensional kernel. After that, it segments the corresponding coefficients into a few number of vectors of reasonable dimension. And finally it quantizes the vectors by replacing them with building blocks or codewords pre-designed with a training sequence. For example, the classified vector quantization on the discrete cosine transform domain uses a block of size 8 • 8 with the 2-dimensional DCT kernels, adaptively segments the DCT coefficients based on the classification of that block, and finally quantizes each and every segment with the best-matched codeword in its corresponding VQ codebook. Another example is the wavelet domain vector quantization: It first decomposes an image frame into a set of subband images with the wavelet kernels. And then it segments the coefficients based on the affinity of the coefficients in the corresponding spatial domain, and finally quantizes the segments with VQ codebooks. As expected from the information theory, transform domain VQ techniques reveal better performance than transform domain scalar quantization techniques in the sense of the optimum achievable performance. The performance of a transform domain VQ scheme mainly depends on the vector configuration and the methodology for constructing codebooks. The directional image coding has been known to be very good at very low bit rate image coding. It exploits the characteristics of the human vision system that contains direction-sensitive neurons in the visual cortex [1]. Therefore, the reconstructed images of the directional image coding are of better
236 subjective quality than those of other first-generation image coding algorithms. In this paper, in order to inherit the advantages of directional image coding which are outstanding when an image is encoded at very low bit rate, we introduce the directional decomposition of images using the wavelet coefficients and gain-shape VQ technique to encode the decomPOsed vectors based on the threshold coding principle.
Fig. 1. Directional decomposition of an image and the vectorization of decomposed versions.
Encoding Algorithm Fig. 1 shows the procedure to construct the directionally decomposed versions from an original image. First, we construct 64 subband images by filtering and decimating the input image three times. At each time, all the subband images high-band as well as low-band are filtered and decimated to 2:1 horizontally and vertically. Fig. l(b) shows the decomposed subband images. Aider that, 64 coefficients each of which locates at the same position in one of the 64 subband images are gathered to construct an 8x8 block which we call as basicblock. Fig. l(c) shows the segmented basicblocks. Next is directional decomposition of a basicblock. It is necessary that the resulting subvectors not only are of reasonable dimension but also convey key information about one of the directional images. A basicblock is decomposed into 17 subvectors of which 15 ones are
representatives of directional images and the
others are for the low-pass and the high-pass images. Fig. l(d) shows the windows for directional
237 decomposition of a basicblock. Since each subvector conveys one of the directional images, it can be independently encoded on the threshold coding method: Each subvector is tested whether it deserves transmission or not, and then it is quantized if and only if it is significant enough. In this case, both the directions and the VQ indices of the significant subvectors are transmitted. The test of significance is accomplished by quantizing the gain of a vector with a deadzone: If the quantized gain is not zero, then the vector is automatically considered
significant, and then the shape of it is quantized. To improve the efficiency of the encoder, the directions of the subvectors are variable length encoded. In addition, to equalize the quantization distortion from each vector, the different sizes of the shape
codebooks are alloted according to the statistical characteristics of vectors.
Simulation Results We applied the proposed directional vector quantization technique to encode the test image, Lena, which is outside the training set of the shape codebooks. We compared the PSNR performance of the proposed scheme with those of the DCT domain classified vector quantization (DCT-CVQ)[2], the DCT domain directional vector quantization (DCT-DVQ)[3], and Shapiro's embedded zerotree wavelet algorithm (WT-EZT)[4]. The simulation results show that the proposed scheme yields the best performance among them at almost every bit rate of below 0.5 bpp. For example, at about 0.25 bpp, it produces 33.4 dB, while both JPEG and the DCT-CVQ produce 31 dB, and the WT-EZT 33.2 dB. Fig. 2 shows the results. The subjective quality of the reconstructed images is also significantly improved by the proposed scheme. Fig. 3 shows the magnified versions of reconstructed images. The reason of high subjective quality is that the proposed scheme reproduces the edges or contours of the image even at a very low bit rate encoding.
Conclusions We proposed a new directional image coding technique on the wavelet transform domain. The scheme consists of directional decomposition of an image with the wavelet coefficients and threshold coding of the decomposed vectors based on the gain-shape vector quantization. Simulation results show that the proposed scheme yields excellent encoding performance in the objective as well as the subjective sense. In addition, the new scheme reveals several advantages. First, it is very practical because it can easily encode an image at various bit rates according to the budget of the encoder. Second, since it uses the fixed decomposition windows and a single set of VQ codebooks regardless of differing bit rates, the complexity of the encoder is much lessened, in comparison with the conventional VQ schemes.
References M. Kunt, A. Ikonomopoulos, and M Kocher, "Second generation image coding techniques," Proc.
oflEEE, vol. 73, pp. 549-574, Apr. 1985.
238
2.
J.W. Kim and S. U. Lee, "A transform domain classified vector quantizer for image coding," IEEE Trans. Circuit ans Syst. - Video Technology, vol. CASVT-2, pp. 3-14, March 1992.
3.
D.W. Kang, J. S. Song, H. B. Park, and C. W. Lee, "Sequential vector quantization of directionally decomposed DCT coefficients," Proc. IEEE ICIP-94, pp. 114-118, Austin, TX, Nov. 1994.
4.
M. Shapiro, "Embedded image coding using zerotrees of wavelet coefficients," IEEE Trans. Signal Processing., vol. 41, pp. 3445-3462, Dec. 1993.
Fig. 2. The P S N R performances.
Fig. 3. The magnified versions of reconstructed image:
(a) 0.126 bpp, 30.46 dB: (b) 0.369 bpp, 35.04 dB
Session H: VIDEO CODING I: MPEG
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
241
An Universal MPEG Decoder with Scalable Picture Size Ram P r a b h a k a r
Wei Li
Cirrus Logic, Inc.
Logitech, Inc. 6505 Kaiser Dr. Fremont, CA 94555
3100 W. Warren Ave.
Fremont, CA 94538
knicks~corp, cirrus, corn
Wei_Li~logitech.
com
1. Introduction In recent years, MPEG has become a video compression standard widely accepted by both hardware and software multimedia compression/decompression professionals. The current hardware MPEG decoders (e.g. GD 5518) decode a compressed bitstream into a single format such as 4:2:0, 4:2:2 or 4:4:4 format specified in the encoder side. However, in practice, the display could be either a VGA terminal (mostly 4:2:2 format) or a D-1 recorder (4:4:4 format). In order to display the decoded video with appropriate resolution, the problem of down-scaling and upscaling of chrominance components (Cb and Cr) must be addressed. This paper proposes a unified hardware decoding solution to the image scaling, which will cover the whole MPEG market from broadcasting down to CD-ROM video games. The decoder is programmable to output video images with 4:2:0, 4:2:2 or 4:4:4 format. The widely practiced method for image upscaling and down scaling are performed in the spatial domain. When scaling is implemented in spatial domain, it requires large memory size and large latency. A new scaling approach in DCT domain [2][3] has been recently reported. It performs the scaling by manipulating the DCT coefficients based on the format one would like to have. The following figure is a simplified MPEG decoder.
iCoded Data i
1J InvrsScan I! !
Variable Length Decoding
Inverse
Quantisation
~__~
"
inverse DCT
H
Frame Store Memory
Motion
Compensation
0......................... !Decoded Data
9.......................
Figl: Simplified MPEG Decoder The functionality of the highlighted blocks in Fig 1 will be affected when scaling in DCT domain is performed. In the following paragraphs, functionality's of each of the highlighted blocks will be discussed. 2. Inverse DCT There are two different types of IDCT's. The MPEG committee suggests type-2 IDCT. The one dimensional type-2 IDCT is given as
N-1 x(n) = 2/N g C(m) X(m)cos(2n+ 1)ml-l/2N m=0 1/~/2
m = 0 or N
1
m=1,2,3 .....N-1
C(m) =
(1)
242 Type II IDCT, for 4:2:0 to 4:2:0 is performed o n an 8x8 block, first horizontally and then vertically, which adds upto 64 coefficients. Whereas the type II IDCT, for 4:2:0 to 4:4:4 will be performed on an 16x16 block (upscaling) or for 4:4:4 to 4:2:0 will be performed on an 4x4 block (downscaling), which adds upto 256 coefficients. The following figure is the modified IDCT block diagram.
I Desired Picture I Format
i I
i
i
,
,
I J .
I
.
.
.
.
.
.
.
.
.
.
.
..,
Anti-aliasing/ Anti-imaging filter Co-efficients IDCT
i
I Inverse i i Quantised DCT Co- i . efficient I I i
i To Motion i C~176
i i
i
i
i .
Upsampling/ Down sampling
H
.
.
.
.
.
I Pr~
Coefficients
Fig:2 Modified IDCT Block Diagram For Upscaling/Down Scaling 2.1 Changes for Upscaling: The upscaling/down scaling of images will only affect the chrominance components(Cb, Cr). For example to upscale a 4:2:0 format to 4:2:2 format, the chrominance components will be doubled in the vertical direction only. This is accomplished by first upsampling an 8x8 inverse quantised 4:2:0 image in the vertical direction and then multiplying the upsampled block with antiimaging low-pass filter coefficients for interpolation. Now IDCT is applied to the 16x8 interpolated block. Manipulation of Pixel Values in DCT Domain The DCT coefficients are manipulated using the following equation. The subscript ii stands for type II IDCT. xU (m) = (Xii (m)- Xii (N-m))/~/2 m = 0,1 .... N-1
(2)
Anti-Imaging Low-Pass Filter The anti-imaging low-pass filter is an even-length symmetric filter with the cut-off frequency of 1/21-1. It is designed using the Then the filter right-half is defined by Remez exchange algorithm. Let the low-pass filter be h(n), n = -I.~.. O,.... I.r h(n) n = 0,1,2 ....L/2-1 hr(n) = (3) 0 n = L/2 ....... N-1 Where L is the number of filter coefficients for h(n) and N is the block size of IDCT. In order to apply this filter in the DCT domain, the filter transform coefficients in the DCT domain are computed using the following equation. N-1 m = 0,1 ..... N-1 (4) Hr (m) = 2 Z hr (n) cos (Hm(n+l/2)/N) n=0 Inverse DCT The manipulated coefficients are multiplied with the filter coefficients and the resulting coefficients are IDCT transformed to obtain the interpolated block. 2.2 Changes For Down Scaling To downscale images, a similar procedure is used. For example to downscale a 4:4:4 format to 4:2:0 format, the chrominance components will be halved in both horizontal and vertical directions. The 8x8 inverse quantised block is first multiplied with the anti-aliasing low-pass filter coefficients and then down sampled in the horizontal and vertical directions. The IDCT is applied to the 4x4 decimated block. Anti-Aiiasing Low-Pass Filter The inverse quantised DCT coefficients are low-pass filtered in the DCT domain using the following equation: Y(m) = I-It(m) X(m), m = 0,1 .....N-1. 0, m=N. Manipulation of the coefficients in DCT domain The filtered coefficients are manipulated as follows: Yd(m) = (Yii (m)- Yii (N-m))/~/2 m = 0,1 .... N-1 Inverse DCT
(5)
(6)
'
243
The inverse DCT of the above equation will result in the dowsampled spatial block.
3. Motion Compensation Motion estimation is performed on the luminance and chrominance components in the MPEG encoder. But the motion vectors are computed only for luminance components. The image scaling affects only the motion compensation for the chrominance components. When the motion compensation is performed in the MPEG decoder, the motion vectors for chrominance components are derived from those of the luminance components, based on the scaling factor. For example when an image is upscaled from 4:2:0 to 4:4:4, the motion vectors for the Cb and Cr will be the same as that for Y. 4. Frame Store Memory Because of the scaling of the images, the frame store memory will have to be large enough to store the biggest format i.e. 4:4:4. For a frame size of 352x240 and with 4:4:4 resolution (24-bit/pixel), one would need a buffer size of 253k-bytes. 5. Simulation Results: Upon examining the 512x512 interpolated image Lena (original image is of size 256x256) in spatial domain and the interpolated image in the DCT domain, the DCT domain interpolation out performs the spatial domain interpolation (see Fig. 3). For symmetric convolution, the maximum number of filter coefficients can be twice the DCT block size. The longer the filter, the sharper will be its frequency response, thus smoother resized image. Since the number of operations involved by using the longest filter does not add-up, for a DCT block size of 16x16, we can use a 32-tap filter and for symmetric convolution with the filter right half defined as in Eq. (3). Note that by using DCT domain interpolation, the longest possible tap filter can be used, without any extra hardware or latency involved, and because of which the interpolated image in the DCT domain will out perform the 7-tap spatial domain interpolation. On the contrary, increasing the spatial domain interpolation filter to 32 taps will significantly add to the hardware cost. Although we have discussed interpolation by two in each direction, decimation and other combinations for interpolation are possible. The following is an approximate comparison for the number of operations required to interpolate a 176x120 SIF picture using DCT domain interpolation (32-tap filter) and spatial domain interpolation (7-tap filter).
1. Interpolation in spatial domain: 4:2:0 to 4:2:2 : Interpolation is performed on a 4:2:0 SIF picture, whose chrominance size is 176x120, by filtering the pixels using a 7-tap filter whose coefficients are [-12 0 140 256 140 0 -12]/256 (ISO recommended SIF to CCIR 601 interpolation filter) [6]. To interpolate one pixel, it takes 17 operations, assuming 3 multiplies and 2 adds, and each multiply as 3 shifts and 2 adds. To interpolate 176x120= 21120 pixels, it takes 359040 basic operations per chrominance component. For both the chrominance components, interpolation in spatial domain for 4:2:0 to 4:2:2 format takes approximately 720000 operations. 2. Interpolation in DCT Domain: 4:2:0 to 4:2:2 : After manipulating the inverse quantized DCT coefficients in the DCT domain, we perform type-2 IDCT on a 16x8 block. The basic operations per 16x8 block is 160 multiplies and 864 adds, which translates to 1724 basic operations per block, assuming one multiply as 4 shifts and 3 adds. For a chrominance size of 176x120, after manipulating the DCT coefficients, the number of blocks are 330. Interpolating in the DCT domain per chrominance component for 330 blocks, the number of basic operations is 570000. For both the chrominance components it takes 1040000 operations.
Figure 3. (a) Interpolation by spatial filtering; (b) Interpolation in DCT domain. Note that (b) is sharper and more visually pleasant than (a).
244 Conclusions In the rapidly evolving multimedia technology, there is a need for higher resolution picture quality. We have clearly shown that interpolating in the DCT domain is programmable, by just changifig the IDCT coefficients and the block size. Even though the number of operations in the DCT domain increases significantly over the spatial domain, the extra hardware involved to perform interpolation in the spatial domain far exceeds that required by the DCT domain interpolation. The resized image in the DCT domain has a better resoultion when compared to the resized image in the spatial domain. Unlike the spatial domain interpolation, the DCT domain image resizing can upscale and down scale different MPEG picture formats, without actually changing the hardware required to do it. This unique architecture can process MPEG bitstreams encoded in any format, and decode the bitstreams into any desired format. References: I. Coding of Moving Pictures And Associated Audio. ISO/IEC JTC1/SC29/WG11 N0702 March'94. 2.Stephen A. Martucci Image Resizing In the Discrete Cosine Transform Domain, in Proc. 1995 International Conference on Image Processing, pp 244-247, Washington Oct-1995. 3.Balas K. Natarajan and Vasudev Bhaskaran A Fast Approximate Algorithm For Scaling Down Digital Images In The DCT Domain, in Proc. 1995 International Conference on Image Processing, pp xxx-xxx, Washington Oct-1995. 4. Vasudev Bhaskaran and Konstantinos Konstantinides, Image and Video Compression Standards - Algorithms and Architectures. Kluwer Academic Publishers, 1995, Chapter 10, "Architectures for the DCT". 5. S.K. Rao and P. Yip. Discrete Cosine Transforms - Algorithms, Advantages, Applications. Academic Press, 1990. 6. ISO-IEC/JTC1/SC29/WG11, Coded Representation of Picture and Audio Information- Test Model 5, pp 19, April 1993.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
245
The Influence of Impairments from Digital Compression of Video Signal on Perceived Picture Quality Sonja Bauer, B r a n k a Zovko-Cihlar, Mislav Grgik University o f Zagreb Faculty o f Electrical E n g i n e e r i n g and C o m p u t i n g U n s k a 3, 10000 Zagreb, C R O A T I A , E-mail: [email protected] Abstract- Picture impairments in digital video systems are different from those that occur in analogue systems and
depend on the methods of coding and redundancy reduction employed With the increasing applications of MPEG coding, the assessment of MPEG coding impairments becomes very important. Subjective methods for convenient television picture quality and impairment assessment given in ITU-R Rec. BT.500-4 were modified and applied to MPEG codec. The aim of this paper is to present results of picture quality assessment in relation to impairments from MPEG-1 coding for workstation and personal computer applications where pictures are displayed within a window on a monitor. Statistical analysis of the test results was performed Based on this analysis the observations and conclusions are given. 1. Introduction
With the increasing applications of digital compressed video signals for transmission, storage and processing, the assessment of coding impairments are of growing importance[ 1]. Generally, the performance of a video compression system can be evaluated objectively or subjectively [2]. The objective methods are based on computable distortion measure such as mean squared error or signal to noise ratio (SNR). Objective measurements [8] have limited effectiveness in predicting the quality of compressed images as seen by the observers. On the other hand, subjective assessment is geared directly toward properties of the human visual system. Subjective assessments are controlled psycho-physical experiments designed to f'md out how the observers would judge picture quality. It is the reason why subjective assessment is the most effective method for determining the influence of video compression method on picture quality [9]. Video compression algorithms [4] are designed to reduce the bit rate of the original source as much as possible and still maintain the subjective picture quality required by the application. Bit rate reduction can be obtained by removing spatial (intraframe) and temporal (interframe) redundancies of the pictures. ISO/IEC Moving Picture Experts Group (MPEG) standardized video compression algorithm for digital storage media [3] (so called MPEG-1) which uses both interframe and intraframe coding to reduce temporal and spatial redundancy and achieve high quality full motion video at low bit rates. MPEG-1 was originally designed for storage application but it covers a variety of other applications especially in multimedia services. MPEG-1 video compression is presently more increasingly accepted as a tool that brings digital video and computers together because MPEG-1 compression cards (hardware) and software are available for integrating digital video into workstations or personal computers. With the increasing applications of MPEG coding, the assessment of MPEG coding impairments becomes very important. Subjective methods for convenient television picture quality and impairment assessment given in ITU-R Rec.BT.500 [5] can be applied to MPEG codec but the test sequences should be chosen very carefully with different picture contents and scene types. 2. Problem of Choosing Test Material
The fundamental difficulty in designing subjective evaluations is knowing which pictures or sequences to use for the evaluations. The scene content being viewed influences the perception of quality irrespective of technical parameters of the system. Normally, a series of pictures which are average in terms of how difficult they are for system being evaluated has been selected. The guideline is that the pictures or sequences chosen should be "critical but not unduly so", [5]. This general philosophy worked well for many years with analogue TV systems and with low compression digital television systems. It reaches the limit of its usefulness with high compression digital systems. The reason is that the quality available from high compression systems depends very much on the content of the picture. To obtain a balance of critical and moderately critical material four kinds of test materials were identified: S 1. One person is facing the camera directly and talking (a video with little motion), Fig. 1(a), $2. The same as S 1 but the title in the bottom of the pictures is appeared, Fig. 1(b), $3. Two persons talking (a video in which the motion of the speakers is relatively large), Fig. l(c), $4. The group of people (a video in which the motion of people is large with many details), Fig. 1(d). This sequences may be considered as a good approximation of the typical test materials which were identified in [6] where the random sampling procedure was developed to collect representative materials of TV program.
246 3. Video Capture and Compression System The Sun Video system [ 10] was used to provide the test sequences with multiple compression ratios (bit rates) and various levels of video quality. Sun Video is real-time video capture and compression system that consists of a Sun Video card for Sun SPARC station which provides on board hardware video compression and supports several video compression techniques including MPEG-1. The Sun Video card and its supporting software captures, digitizes and compresses unmodulated NTSC, PAL or component Y/C video (S-Video) signals from video sources such as video cameras, VCRs and videodisks. The Sun Video system is designed to work closely with the XIL 1.1 Imaging Library [ 11 ]. By itself XIL provides functions for image processing and software based image compression and decompression. When used in conjunction with the Sun Video system, XIL provides functions to access and control the video capture and compression facilities of the Sun Video card. In our program the COMPRESSOR_BITS_PER_SECOND attribute was used to tell the encoder how many bits it can use to encode one second's worth of pictures. This attribute controls the output data rate of the MPEG bit stream. Similar, the COMPRESSOR_PATTERN attribute was used to specify the pattern of picture types which are employed by the compression. In our example intraframe and predictive coded pictures were used (IP). For the compressor bit rate of 1152000 decompression attribute DECOMPRESSOR_QUALITY was changed. The value of this attribute provides the trade-off between the quality of reconstructed pictures and the speed of decoding. The MPEGdecompression increases speed by decreasing the number of quantized coefficients that it uses in reconstruction. The valid values for this attributes are integers in the range of 1 to 100. A value of 100 is a request that the decoder produced the highest quality pictures possible and a value of 1 is a request that the decompressor decode picture as fast as possible. The "DECOMPRESSOR_QUALITY" was set to three values: 1, 50, 100 (Q 1, Q50, Q 100). The Sun Video card digitizes video signal as specified in the ITU-R BT. Rec. 601 [7]. It gives the full picture resolution of 768x576 pixels (PAL). The frame size is reduced to the standard interchange format (SIF) of 384x288 pixels which is the input picture format for MPEG-1 compression [3]. The size of test sequences was 300 frames.
Figure 1. Test sequences (a) SI., (b) $2., (c) $3., (d) $4.
247 4. Test Method
The testing methodology was the double-stimulus impairment scale method with five-grade impairment scale ("EBU Method") described in ITU-R BT.Rec. 500-4 [5]. In [9] was shown that it is very suitable method when impairments are small. In our trials MPEG-1 codec performance was examined in terms of basic decoded picture quality and impairment associated with coding process. The test sequences are made using Sun Video system. The level of impairments in test sequence depends on allowed bit rate. The impairment associated with decoding process depends on the number of quantized coefficients that are used in picture reconstruction. All the test sequences are recorded directly on videotape using S-VHS videocamera and played using S-VHS videotape recorder. Component Y/C output signal fi'om videotape recorder was delivered to the Sun Video card. The compressed video sequences are stored on hard disk, decompressed using software based decompression and displayed within a window on a workstation. The size of window was 15x20 cm. The viewing distance was chosen to be 60 cm (4H, H=window height). A total of 32 observers participated in the tests. They were non-experts with normal visual acuity. A test session contained alternating 12 seconds (300 framesx40ms) presentations of the reference and test sequences divided by mid-gray interval (3s). The original source sequence without compression was used as the reference. Assessors were asked to grade the test sequence during the mid-gray interval (10s) that comes atter test sequence. A test session comprised 44 presentations. Three type of test sessions with different order of presentations were arranged to balance out effects of tiredness or adaptation from session to session. 11 presentations were shown twice within the same session to check coherence. It means that 33 different presentations were used in every test session.
5. Test Results and Conclusions
Statistical analysis of the test results was performed. The mean opinion score (MOS) and variance were computed for every combination of the test condition and the test sequence. The analyses of the data confirmed that the observers performed within accepted limits of consistency. The coherence of the results was checked by examining the grades given by the same observer to the same picture in the same test session. The grand mean score (average value of all grades) is 3.193. It indicates that the test material was carefully chosen so that all grades were used by the majority of observers. Test S e q u e n c e $3.
Test Sequence S l .
4 o
4-
3-
3-
1-
1-
I
lOk
I
I
I
I
64k 5eOk 8eek I M Sl - Bit Rate
I
I
1,4M 2M
I
I
I
I
5M
IOM
10k
64k
ibpsl
Figure 2. Mean Opinion Score for Test Sequence S1.
I
I
500k 800k
I
I
I
I
I
IM
1,4M
2M
5M
10M
$3 - Bit R a t e
[bpsl
Figure 3. Mean Opinion Score for Test Sequence $3.
The MOS of test sequences with low level of activity and test sequences with high level of activity coded with different bit rates show that MOS increase with increasing the bit rates and trend to saturate for higher bit rates, Fig.2. and Fig.3. Grades measured on the y-axis are expressed as values between 1 and 5 which correspond to the MOS for each test sequence. Statistical results show that sequence S 1. with low level of activity gives higher grades then the sequence $3. with high level of activity for the same bit rates. It is more obviously in the low bit rate region. The content of $3. sequence is more difficult for coder to handle because $3. sequence contains less redundant information then S 1. When MPEG-1 coder extracted redundant information, if it is still not enough to reduce the bit rate to the required level, it makes a series of approximations on the picture until the bit rate comes down to the value needed. The result is typically noise and blocking in the picture when a given "bit rate threshold" (determined by redundancy) is exceeded. The bit rate threshold is the value of bit rate which is achieved when all redundant data are extracted. The sequence with more redundant data (S 1.) has lower value of the bit rate threshold than sequence with less redundant data ($3.). Beyond this threshold things get worse for the quality, because more approximations need to be made. It means that codec works better for pictures without details and subject movement.
248 The grades probability distributions for the different bit rates and different test sequences show that the distributions for the sequences which contain less redundant information ($3. and $4.) are shifted to the low grades region for the bit rates beyond the bit rate threshold. The Fig.4. shows the grades probability distributions of the bit rates of 64 kb/s and 2 Mb/s for the test sequences S1. and $3. The distribution for $3. sequence and bit rate of 64 kb/s is shifted to the low grades region 9It is consequence of many faults which coder has done in $3. sequence to meet required low bit rate. The grades distributions for bit rate of 2 Mb/s have similar shape for both sequences because their bit rates thresholds are not exceeded and coder need not to make approximations. The same conclusions can be done from Table 1. which shows the grades of test sequences decompressed with different quality of decompression (Q 1, Q50, Q 100). For the lowest decompression quality $4. sequence (many details and subjects movement) has the lowest MOS. This sequence can be reconstructed only with many approximations that result in very low picture quality. For the highest decompression quality all sequences have MOS larger then 4. For all quality levels the sequence $2. with the most redundant data has the largest MOS. It confirms that codec works better for sequence with more redundant data. The evaluation of quality for digital systems is a complex affair, and it is mistake to believe that a few simple quality grades characterize a system. However, it may be helpful to evaluate how valuable picture quality is to the users of multimedia windows applications.
0,8
'
0,6
'
0,4
0,2
0,8
'
0,6
'
'
0,4
'
'
0,2'
~S1 ---<>~S3
1
2
3
4
1
5
2
3
4
5
Grades
Grades (a)
(b)
Figure 4. Grades Probability Distribution for the Bit Rates (a) 64kb/s, (b) 2 Mb/s for S1. and $3. sequences Table 1. Mean Opinion Scores for Various Decompression Quality Factor Mean Opinion Scores
Test sequences
Q1
Q50
QlOO
$2
2.553
3.553
4.547
$3
2.056
3.324
4.361
$4
2.013
3.272
4.278
REFERENCES [ 1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]
M. Drury, Picture Quality Issues in Digital Video Compression, IBC'95, Conf. Pub. No. 413, Amsterdam, 1995, pp. 13-18. M. Ardito, M. Visca, Correlation between Objective and Subjective Measurementsfor Video Compressed Systems, IBC'95, Conference Publication No. 413, Amsterdam, 1995, pp. 7-12. ISO/IECIS 11172-2, Information Technology-Coding of Moving Picture and Associated Audio for Digital Storage Media at up to amount 1.5 Mbit/s; Video, Aug. 1993. N. Jayant, P. Noll, Digital Coding of Waveforms: Principles and Applications to Speech and Video, Prentice Hall, 1984. ITU-RRec.BT.500-4, Methodsfor the subjective assessment of the quality ofTVpictures, ITU, Geneva, 1993. Y. Zou, K. Ellsworth, J. A. Kutzner, P. J. Hearty, Subjective Testing of Broadcast-Quality Compressed Video, SMPTE Journal, pp. 789-800, Dec. 1994. ITU-RRec.BT.601, Encoding Parameters of Digital Televisionfor Studios, ITU, Geneva, 1993. ITU-RRec.BT.813, Methods for Objective Picture Quality Assessment in Relation to Impairments from Digital Coding of Television Signals, ITU, Geneva, 1993. N. Narita, Subjective-EvaluationMethods for Quality of Coded Images, IEEE Trans.on Broadcasting,, vol.40, No.I, pp.7-13, March 1994. ___, "Sun Video 1.0 User's Guide", Sun Microsystems Publication, Oct. 1993. ____,"Solaris XIL 1.1 Imaging Library Programmer's Guide", Sun MicrosystemsPublication, Nov.1993.
Proceedings IWISP '96, 94-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (FAitors) 9 1996 Elsevier Science B.V. All rights reserved.
249
On scalable coding of image sequences Erwan LAUNAY, Thomson Multimedia R&D France
Abstract : It is now quite admitted that the only interesting way to perform scalable coding of TV or HDTV image sequences relies on the concepts and techniques introduced in the MPEG2 scalable extensions [1]. However the performances achievable by such systems, and the conditions for having an efficient use of scalability still remain unclear. The focus of this work is to try to shed some light on the mechanisms and domains of validity of scalable coding.
Introduction Scalable coding of image sequences transmits data in two distinct bitstreams, corresponding to two transmission layers : the base layer and the enhancement layer. These two layers can be decoded together, thus yielding maximum quality, or the base layer can be decoded separately yielding a lower quality or lower resolution image sequence. The idea behind this partitioning is double : 9 When transmitting data through a noisy channel this enables us to distribute the information between two separate bitstreams according to its importance in the final decoding process. We then obtain graceful degradation of quality with channel noise power by providing higher protection to the base layer through channel coding. 9 It eases interworking of video services and compatibility with existing standards. For example depending on the connection used, the type of receiver or the fees paid, each consumer will be able to decode both layers or just the base layer. He will then have access to the baseline service (Digital Television), or to a high quality service
(HDTV).
However, as underlined in [2], scalability has a cost. A scalable decoder, decoding both layers will be more complex than a simple decoder, intended for non-scalable transmission and yielding the same quality. Furthermore, the transmission bit rate necessary for achieving a given full resolution image quality will be higher when splitting the bitstream into two scalable streams than when transmitting one single non-scalable bitstream. As we will see in the next section, the question of which scalabilities are valuable to implement and how to optimize their implementation has been extensively addressed in the past and led to <<scalable extensions >>in the recent MPEG2 ISO norm [ 11 ]. However with this coding norm, a new problem arises on which this work will focus. Since MPEG2 norm gives manufacturers the possibility to implement some form of scalability in their decoders, it is important to know for which services and in which cases these <~are helpful. However until now literature gave a very incomplete answer to this question : most articles only gave isolated performance measures, based on PSNR vs. rate evaluations on one sequence and most of these studies were only concerned with contribution coding. Recently, one article tried to analyze the performances of scalability on several sequences with some theoretical support [2]. However it failed to take the influence of motion compensation into account thus leading to a somewhat biased evaluation of scalability. This contribution is intended to complete and shed a new light on the results of [2] by providing a new analysis of scalability focused on the concept of <
Baseline coding scheme There are several kinds of scalable coding [ 1], but here we will mainly focus on what is called <(spatial scalability )). In this technique the base layer represents the original sequence at a lower (quarter-) resolution, and the enhancement layer contains the additional information necessary to reconstruct the original full-resolution sequence. This scalability is the most interesting to study for two reasons. First, as underlined in [2], among all those retained by MPEG2, it is the only one to really require some additional complexity of the decoder. Secondly a careful study shows that results obtained on the complex problem of spatial scalability can easily be extended to the other main scalable extensions defined in MPEG2 9SNR scalability and data partitioning. The question of spatial scalable coding was first addressed in [8], [9], and after having tried to develop some customized coding schemes for scalable coding such as in-band motion compensation [6], [10], [14], 3D subband coding [3], [4], [5], it was concluded that only a simple extension of hybrid coding schemes could yield some valuable implementations of scalable coding. The problem was then to optimize the implementation of these extensions. A lot of work was done on this subject, and led to several interesting results" 9 Among the several possible ways to share information between the two layers, Bosveld [13] showed, basing his demonstration on rate-distortion theory, that ~
250 information associated with low resolution is transmitted in two steps. First we transmit it with a rough quantization in the base layer. Secondly the base-layer quantization error is quantized with a finer step and transmitted in the enhancement layer 1. 9 If one wants to implement a (( drift-flee >>base layer decoder, one has to use two imbedded motion compensation loops in the scalable coder, each of them corresponding to a resolution. 9 The transformation used [12], [9] has to enable easy high-quality reconstruction of lower resolutions using only part of the transform coefficients. A very interesting transformation was provided by [12], yielding subband PQMF filters with complexity similar to DCT but splitting the frequency space in a more adequate way than DCT for hierarchical coding. 9 The quantization [3] used in the two layers are linked by some <>. In the case of uniform quantization, the quantization step of the base layer has to be a power of two times that of the enhancement layer. Based on these considerations and in order to have as precise an analysis as possible of scalable coding, we used as a baseline for our experiments a scheme that significantly deviates from MPEG2 specifications. In this scheme, depicted in Figure 1, we use a PRMF subband transformation instead of an 8x8 DCT and the quantization step of scalar quantization (SQ) is constrained to be a power of two. The rest of the scheme however, conforms to MPEG2 specifications, uses hierarchical block matching, an IBP GOP structure, and, what is most important, takes practical implementation constraints into account (such as limited precision of number representation, real VLC encoding and costs based on the construction of a structured bitstream). We also had to restrict ourselves to the study of TV and 88 "IV spatial scalability for implementation purposes. Though it would have been more accurate to directly work on TV and HDTV, our conclusions on TV and 88TV can easily be generalized to this case.
Scalable coding scheme The scheme in Figure 1 could be called a <<simulcast >>coding scheme since it codes separately each resolution. We used this scheme as a reference for our study since simulcast coding is the only way to transmit several resolutions with non scalable schemes. Another reference is the <<standalone >>scheme using the total bit rate for coding only the full resolution. Based on the simulcast scheme spatial scalability can be introduced as an extra refinement of the high resolution temporal prediction using the decoded low resolution sequence. We first wanted to study the influence of this <~refinement>> on the scalable performances. So we designed several spatial scalable schemes, using several <[ 1]. Instead of choosing between spatial and temporal prediction in the spatial domain, this choice is made in the transform domain. This technique is not only simpler (no interpolation required), it is also more accurate since PQMF filtering leads to perfect correspondence between low and high resolution transform coefficients [12]. 3. A third possibility investigated is to directly predict the high resolution residual by the low resolution coded residual [ 15]. This approach has a simple implementation but suffers from the mismatch of the low resolution and highresolution temporal predictions. It tends to decorrelate the residuals and makes the refinement less efficient. These three schemes were tested on several sequences 2. As expected, the more we try to have a spatial prediction coherent with the high resolution predicted signal, the more efficient the scalable scheme is. So as can be concluded from the results in Table 1, the f'~t scheme performed better than the second and much better than the third. However as shown by the performances on MOBCALgiven in Table 1, the performances of spatial scalable coding are very dependent on the content of each sequence. The repartition of the scalable coding gain, shown in Table 2 illustrates the influence of the motion content of each sequence on scalability. Sequences with high motion like HORSE lead to a poor motion estimation and thus a high scalable coding gain, while sequences with less motion like MOBCAL yield good motion estimation and thus the only scalable coding gain is provided by the prediction of Intra-coded pictures. We also notice that no gain is achieved on chrominance planes, too poorly coded in the base layer.
tayer:2 m/tot t-4 I MOBZLe
FLOWER
Frequency scalability ] 14% 25% 15%' e MPEG-like ~ interpolation [ 12.5% 14% 12% ......... Residual pred iction l ....... 12% ...................... 8% .................. 10% ...... Table I : Entropy gained by spatial scalable coding vs. base layer bit rate
251
B~e layer=2MB . . . . . . . . . . .
MOBILE CALE,NDAR ...........] . . . . . HORSE (total : 100%)................. l (total" 100%) Full resolution=4Mb ...... 1I (~ P (~ B (%)....... 11 (%) P (%) B (%) .............. 5 -12 31 38 ~ Z G - i i ~ ; i n t e r p o l a t i o n [ 108 Frequency scalability ] 115 5 -20 ! 57.5 33 10 Residualprediction . . . . . . . . 125............ -6 -19 ..... 194 ....... 13 -7 .................. Table 2 9Repartition of entropy gained by spatial scalable coding between each type of image
130.5
We then investigated the influence of coding rate on scalable performances. As shown in Figure 2, the efficiency of scalability tends to saturate when the lower layer bit rate is too low (in this case the only gain comes from Intra pictures) and is also a non-increasing function of the total bit rate. The reason for this is that in these cases the spatial prediction obtained from lower layer is unable to compete with the prediction obtained from high resolution, high quality temporal prediction. We also notice that the maximum gain for FLOWERseems to saturate around 35% even when coding the lower layer more precisely than the full resolution 2. This loss is linked to the partition of frequencies between layers in scalable coding and proves the importance of high frequencies even for estimating the low frequencies of the next image.
Conclusion In this paper we studied spatial scalability and showed that its performance is mainly determined by the quality of the motion compensation in the main layer compared to the coding quality of the base layer. We also showed that, in opposition to what is conjectured in [2], the performances of spatial scalability are not a non-increasing function of the base layer bit-rate. However the analysis in [2] completes our approach, taking the rate-distortion function of hybrid coding into account, and the combination of our results with those of [2] gives a very pertinent analysis of the general problem of scalable coding.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [ 12] [13] [14] [15]
J. Delameilleure, S. Pallavicini, <<Scalability in MPEG2>>,Proc. HAMLET Race 2110 Workshop, pp.69-75, February 27-28 th, Rennes, France, 1996. J. Delameilleure, S. Pallavicini, <>, Signal Processing : Image Communication 4, pp. 245-262 1992. J. F. Vial, ~t Multiresolution coding schemes with layered bitrate regulation >>, Proc. VI th InternI Workshop on HDTV, Ottawa, Canada, 1993.
252
Figure 1 9Baseline scheme
Figure 2" Efficiency of scalability as a function of rate (FLOWER)
i The work of Bosveld confirms the statement of [2] that SNR scalability is in general more interesting than data
~ artitioning
We evaluate the spatial scalability gain as the entropy spare by adding spatial prediction based on the decoded base layer to the simulcast coder. Thus we evaluate this gain as a rate gain, the quality being kept constant.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
253
IMAGE TRANSMISSION PROBLEMS BETWEEN IP AND ATM NETWORKS Prof. Mkrttel/V~. Ass. Prof.Eramlan JLV. Shdmt i w ~ a n LL Republic of Armenia Abstract: There is a need everywhere for fast data communication in public and private networks. Asynchronous Transfer Mode (ATM) is decided by CCITT to be the target switching and multiplexing technique for the B-ISDN. Traditional Local Area Networks (LANs) like Ethernet, Token Ring and Token Bus are limited in speed (10Mb) and thus are limited to particular type of (mainly data) applications. For multimedia applications, the bandwidth requirement is high and the information is a combination of voice, video and data and it requires a transfer mode capable of transporting and switching these different types of information. ATM satisfies this requirement for LANs. ATM allows transmission capacities up to 622 Mbit/s which is enough for most LAN applications. ATM carries all types of information - voice, data and video - in a common cell(packet) with standard format of 53 bytes ( 48 bytes of user information and 5 bytes of control information). And also ATM offers increased bandwidth and greater flexibility and manageability. In the other hand nowadays most of us are surrounded by powerful computer systems with graphics-oriented input and output including the entire spectrum of PCs, professional workstations and supercomputers. The existing local area networks are primarily based on shared media interconnections, which are likely to become potential bottlenecks not only because of new multimedia applications but also because of rapid growth of services employing simple data transfers. ATM, a switching and multiplexing standard for broadband integrated networks, is viewed as an emerging technology capable of removing this bottleneck. But its success as a LAN technology depends on its ability to provide LAN-like services compatible with existing protocols and applications. This approach is taking the idea of LAN emulation, and allows ATM switches to be transparently interconnected with shared media legacy LANs running the IEEE 802 family of LAN protocols. This article has aim to resolve a lot of problems advising when transmitting images between IP and ATM networks. The problem is beans reserved in the network layer. INTRODUCTION Most of existing LANs are based on shared media interconnections and employ the IEEE 802 family of LAN protocols, which includes the Ethernet, the Token Ring. And in the other hand we have ATM which is connection-oriented. Having counted that with the current interest in ATM technology, it is likely that ATM switches and interfaces will become faster and cheaper than the shared medium technologies it is growing the problem to configure IP over ATM. The goal of this article is to allow compatible and interpretable implementations for transmitting IP datagrams and ATM Address Resolution Protocol (ATMARP) requests and replies over ATM Adaptation Layer 5 (AAL5) [ 1]. Any reference to virtual connections, permanent virtual connections or switched virtual connections applies only to virtual channel connections used to support IP and address resolution over ATM and thus is assumed to be using AAL5. This article describes the initial deployment of ATM within "classical" IP networks as a direct replacement for local area networks (ethane's) and for IP links which interconnect routes, either within or between administrative domains. The "classical" model here refers to the treatment of the ATM host adapter as a networking interface to the IP protocol stack operating in a LAN-based paradigm. Characteristics of the classical model are: -The same maximum transmission unit (MTU) size is used for all VCs in a LIS [2]. -Default LLC/SNAP encapsulation of IP packets. -IP addresses are resolved to ATM addresses by use of an ATMARP service within the LIS. -ATMARPs stay within the LIS. From a client's perspective, the ATMARP architecture model presented in [ 1].
254
~MAIN BODY The deployment of ATM into the Internet community is just beginning and will take many years to complete. During the early part of this period, we expect deployment to follow traditional IP subnet boundaries. Initial deployment of ATM provides a LAM segment replacement for Local Area Networks (e.g., Ethane's, Token Rings and FDDI). In such case local IP routes with one or more ATM interfaces will be able to connect islands of ATM networks. Characteristics and features of ATM networks are different than those found in LANs: 1) ATM provides a Virtual Connection (VC) switched environment. VC set-up may be done on either a Permanent Virtual Connection (PVC) or dynamic Switched Virtual Connection (SVC) basis. 2) Data to be passed by a VC is segmented into 53 octet quantities called (5 octets of ATM header and 48 octets of data). With respect to IP and other network layer protocol or as a MAC (the medium access control) protocol below the LLC (logical link control) [3]. The recent approach is the key idea behind LAN emulation, and allows ATM switches to be transparently interconnected with shared media legacy LANs running the IEEE 802 family of LAN protocols. To clarify above said, I must explain that LAN emulation simply means that the point-to-point ATM switch should give the appearance of a virtual shared medium. Also although ATM is connection-oriented, the broadcast feature can be emulated in an ATM network using dedicated servers. Each ATM host is assigned an ATM address that can be based either on a hierarical 8-byte-long ISDN telephone number or a 20-byte address proposed by the ATM Forum [2]. That is why implementing IP directly over ATM will require translating an IP address to an ATM address. Thus we need to maintain a server referred to as the IP-ATM-ARP Server. This server needs to maintain tables that can translate an IP address to ATM address. The interaction between the hosts and IP-ATM-ARP Server can be implemented using a simple query/response protocol. The host interface will be the same as that in the case of the LAN Emulation except that the address cache to send a packet to a destination IP host, it obtains the corresponding ATM address from the address cache and passes the IP packet and the ATM address to the processing entry that reforms the connection management function. 9 In the LIS scenario, each separate administrative entity configures its hosts and routers within a closed IP subnetwork. Each LIS operates and communicates independently of other LISs on the same ATM network. Hosts, connected to ATM, communicate directly to other hosts within the same LIS. Communication to hosts outside of the local LIS is provided via an IP router. This route is an ATM endpoint attached to the ATM network that is configured as a member of one or more LISs. This configuration may result in a number of disjoint LISs operating over the same ATM network. Hosts of differing IP subtends MUST communicate via an intermediate IP route even though it may be possible to open a direct VC between the two IP members over the ATM network. The requirements for IP members (hosts, routers) operating in an ATM LIS configuration arte: -all members have the same IP network/sublet number and address mask; -all members within a LIS are directly connected to the ATM network; -all members outside of the LIS are accessed via route; -all members of a LIS must have a mechanism for resolving IP addresses to ATM addresses via ATMARP and vice versa via InATMARP (based on [4]); -all members within a LIS MUST be able to communicate via ATM with all other members in the same LIS; i.e., the connection topology underlying the intercommunication among the members is fully meshed. The following list identifies a set of ATM specific parameters that must be implemented in each IP station connected to the ATM network: -ATM Hardware Address. The ATM address of the individual IP station; -ATM ARP Request Address, that is the ATM address .of an individual ATMARP server located within the LIS. The default MTU size for IP members operating over the ATM network shall be 9180 octets. IP members must register their ATM endpoint address with their ATMARP server using the ATM address structure appropriate for their ATM network connection.
255 In an SVC environment ATMARP requests are sent to this address for the resolution of target protocol addresses to target ATM addresses. That server must have authoritative responsibility for resolving ATMARP requests of all IP members within the LIS. I must also note that if the LIS is operating with PVCs only, then this parameter may be set to null and the IP station is not required to sent ATMARP request to the ATMARP server. ATM does not support broadcast addressing, therefore there are no mappings available from IP broadcast addresses to ATM broadcast services. ATM does not support multicast address services, therefore there are no mappings available from IP multicast addresses to ATM multicast services. As to ATM switching it is also known as fast packet switching. ATM switching node transports cells from incoming links to outgoing links using the routing information contained in the cell header and information stored at each switching node using connection set-up procedure. Two functions at each switching node are performend by the connection set-up procedure: 1. the unique connection identifier at the incoming link and the unique connection identifier at the outgoing link are defined for each connection; 2. routing tables at each switching node are set up to provide an association between the incoming and outgoing links for each connection. VPI and VCI are the two connection identifies used in ATM cells. Thus the basic functions of an ATM switch can be stated as follows: - routing (space switching) which indicates how the information is internally routed from the inlet to outlet, queuing which is used in solving contention problems if 2 or more logical channels content for the same output; - and final function in header translation that all cells which have a header equal to some value on incoming link are switched to outlet and their header is translated to a value, say k. There are also many functions involved in the traffic control of ATM networks[5]. For example Connection Admission Control. This can be defined as a set of actions taken by the network during the call set-up phase to establish whether a VC/VP connection can be made. A connection request for a given call only be accepted if sufficient network resources are available to establish the end-to-end connection maintaining its required quality of service and not affecting the quality of service of existing connections in the network by this nem connection[5]. CONCLUSIONS In this article we have addressed the issues that are involved in implementing IP in ATM LANs. In LAN emulation, ATM is configured as an IEEE 802 MAC protocol below LLC. This not only allows ATM switches to be transparently integrated with the IEEE 802 family of LAN protocols. The key issue in configuring this involves resolving IP addresses to ATM addresses and providing transparency of ATM can provide the full functionality's of the network layer and data link protocols, and as a result, it is possible to implement transport layer protocols such as TCP directly over ATM. And this the IP/ATM problems image processing and transmitting difficulties at the network layer. R E F E R E N C E S
[1]
[2] [3] [4] [5]
J. Heinanen, Multiprotocol Encapsulation over ATMAdaptation Layer 5,RFC 1483, Telecom Finland, July 1993. ATM Forum, ATM User-Network Interface Specification Version 3.0, Prentice Hall, 1993. M. Laubach, Classical IP and ARP over ATM, ATM, RFC 1577, Information Sciences, January 1994. D. Plummer, An Ethernet Adress Resolution Protocol or Converting Network Address for Transmission on Ethernet Hardware, STD 37, RFC 826, MIT, November 1992. V. S. Mkrttchian, A.V. Eranosian and H. L. Karamyan, Resolving the Problem oflP on ATMLocal Area Networks, The Problems of the Efficiency Improvement of the Control Systems of Technological Processes(in Russian), AACC. vol. 4, 1995.
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
257
A Scalable Video Coding Scheme Based on Adaptive Infield/lnframe DCT and Adaptive Frame Interpolation MasatoshiAsada
and KatsutoshiSawada
Department of Information Network Engineering Aichi Institute of Technology Yakusacho, Toyotashi, Aichiken, 470-03 Japan
ABSTRACT This paper describes a spatio-temporal resolution scalable coding scheme. Resolution scalability means a coding property where lower partial resolution pictures can be obtained by decoding only subsets of the total coded bit stream, while the full resolution picture is reconstructed by decoding the total bit stream. This scheme employs frame subsampling associated with adaptive interpolation for temporal scalability and adaptive infield/inframe DCT for spatial scalability. The proposed scheme provides four different spatio-temporal resolutions of a video sequence -- two temporal resolutions, each consisting of two spatial resolutions. This can be applied to interlaced video sequences effectively. Computer simulation results have demonstrated that this scheme has better coding performance compared to conventional non adaptive methods. 1, INTRODUCTION In the resolution scalable coding [ 1]-[6], lower resolution pictures can be obtained by decoding only subsets of the total coded bit stream, while the full resolution picture is reconstructed by decoding the total bit stream. This has some important applications such as the compatibility [2] between different resolution video systems, progressive resolution transmission and image database browsing. It is also useful for error resilient transmission in the case of digital television terrestrial broadcasting [3] and ATM video transmission. There are two kinds of resolution scalability -- spatial scalability and temporal one. A spatio-temporal scalable coding scheme based on subband technique is presented [4]. A spatio-temporal scalable coding scheme proposed in this paper employs flame subsampling associated with adaptive interpolation for temporal scalability and adaptive infield/inframe DCT for spatial scalability. This scheme can be applied to interlaced video sequences effectively. Section 2 describes the outline of the scheme, and sections 3 and 4 discuss the details of temporal and spatial scalable coding, respectively. Section 5 presents computer simulation experimental results. 2. OUTLINE OF THE SPATIO-TEMPORAL SCALABLE CODING SCHEME Fig. 1 shows the block diagram of the proposed spatio-temporal scalable video coding scheme. An input interlaced video sequence is fLrst separated to odd-frames and even-frames. The DCT-based spatial scalable coding is performed in each frame. At the decoder, temporally low resolution pictures are constructed by decoding only odd frames and interpolation of even frames, while temporally full resolution pictures are reconstructed by decoding both odd-frames and even-frames. Spatially low resolution pictures are obtained by decoding only low frequency DCT coefficients, while spatially full resolution pictures are reconstructed by decoding both low and high frequency DCT coefficients. Thus, this scheme provides four different spatiotemporal resolutions of a video sequence - two temporal resolutions, each consisting of two spatial resolutions. Table 1 shows four kinds of spatio-temporal resolutions and corresponding coded data.
258 Spatially ~ _ Temporally
Low Low
Odd frame
m .T..__Hinterpolation~.... t
>
Spatially Full Temporally Low
o: Odd frame e: Even frame L: Spatially low component H: Spatially high component
Spatially Low Temporally Full Even frame
Coder
____> Spatially Full Temporally Full >1<
_1
Decoder
"1
Fig. 1 A spatio-temporal scalable coding scheme Table 1 Reconstructed pictures and corresponding coded data.
~t
Odd
Even
frame
Odd
frame
Odd
frame
frame
~t
Even
Odd
frame
frame
~
Resolution Spatial Temporal Low Low Full Low Low 84 Full Full Full
Corresponding coded data Lo Lo, Ho LO, Le Lo, Ho, Le, He
_
'
~_
(
, :
)
(
)
~
0 0
~ ~
o
..
)
o: 0: o
)
6 ~
(a)Stationary portion O: Decoded pixel
(b)Moving portion x. Interpolated pixel
Fig. 2 Adaptive interpolation of even frames. 3. ADAPTIVE INTERPOLATION IN TEMPORAL SCALABLE CODING In this scheme, odd frames and even frames are used as temporal resolution base layer pictures and enhancement layer pictures, respectively. Temporally low resolution pictures are obtained at the decoder by decoding only odd frame data and interpolation of missing even frames from adjacent odd frames. This scheme uses two different interpolation methods adaptively for moving and stationary portions, as shown in Fig. 2. Each even frame space is first segmented to moving portions and stationary portions based on the interframe difference values between the previous and the next odd frames. For the stationary portions, simple frame interpolation is applied. That is, the 1st and the 2nd fields of the missing even frame are interpolated from the 1st and the 2nd fields of the previous odd frames, respectively. In this case, the spatial resolution can be preserved in the stationary portions. This conventional frame interpolation, however, occurs annoying "backward and forward motion" in the moving portions. In order to avoid this degradation, the 1st and 2nd fields of the missing even frame are interpolated from the 2nd field of the previous odd frame and the 1st field of the next odd frame, respectively. 4. SPATIAL SCAIABLE CODING BASED ON ADAPTIVE INFIELD/INFRAME DCT 4.1 Spatial scalable coding based on DCT and MC prediction This scheme employs a spatial scalable coding based on DCT [1],[3],[5] and motion compensation (MC) prediction as shown in Fig.3. The coder configurations for odd frames and even frames are different, because MC prediction method are different in these two cases. For both cases, the DCT is first performed on an input picture, then DCT coefficients are separated to low frequency (L) and high frequency (H) components. MC prediction coding is carried out for each DCT coefficient domain. Fig.3(a) shows the configuration for the odd frames, where forward MC prediction using the previous odd frame is employed. Because the MC prediction is
259 carried out in the picture domain, inverse DCT (IDCT) and DCT OL I IVLCL " dd__~ DCTI are positioned before and after MC prediction, respectively. In fO rame/ order to improve the MC prediction performance for H H ~.=o--4 I component, not only decoded H component but also decoded L IDCTI component are fed to the IDCT [5] in the MC prediction loop of MCPL~ l . * H component. That is, the MC prediction for H component is carried out in the full resolution picture domain. Fig.3(b) shows the configuration for the even frames, where bidirectional MC prediction using previous and next odd frames is employed. At the decoder, spatially low resolution pictures are obtained by decoding only L component data. On the other hand, spatially full resolution pictures are reconstructed by decoding both L and H component data. (a) Coding of odd frame pictures.
I o ,l
E v e n ~
I,o ,l
QL 1
4.2 Adaptive infield/inframe DCT frame i H% I " This scheme employs an adaptive infield/inflame DCT [6], where DCT block construction is switched adaptively between field-based block and frame-based one. Fig. 4 shows a block construction method for the adaptive DCT. An input picture frame is first divided to frame-based blocks with 8xl 6 size. Then each block is again separated to two field-based blocks with 8x8 or two frame-based blocks with 8x8. The decision of field-base or frame-base is carried out according to intrafield and interfield vertical difference absolute values, respectively. This adaptive (b) Coding of even frame pictures. DCT can improve the coding performance, especially in the case of low resolution picture quality in stationary portions. Fig. 3 Spatial scalable coder configuration based on DCT and MC prediction.
I IOCTI
[DCTI T
5. SIMULATION EXPERIMENTS Computer simulation experiments were carried out in order to present the performance of this scalable coding scheme. It was confirmed that four different resolution pictures were obtained by this proposed coding scheme. Concerning the temporally low ( and spatially low or full ) resolution pictures, adaptive interpolation showed better picture quality than the conventional frame interpolation for moving portions. No "backward and forward motion" was observed. It also showed better picture quality than field interpolation for the stationary portions. Coding performance of the adaptive infield/inframe DCT was compared to those of infield DCT and inframe DCT. The results are shown in Fig.5 and Fig.6. For the spatially full resolution pictures, there were no significant performance differences. However, adaptive DCT showed better performance for spatially low resolution pictures. 6. CONCLUSION This paper has described a spatio-temporal scalable video coding scheme which employs frame subsampling associated with adaptive interpolation for temporal scalability and adaptive infield/inframe DCT for spatial scalability. The proposed scheme provides four different spatio-temporal resolutions of an interlaced video sequence. Computer simulation results have presented that this scheme has better coding performance compared to the conventional non adaptive schemes.
260
Fig. 5 Experimentalresults for temporally-full/spatially-full resolution pictures.
Fig. 4
Block construction of adaptive infield/inframe DCT.
Fig. 6 Experimentalresults for temporally-full/spatially-low resolution pictures. REFERENCES
[1] C.Gonzales and E.Viscito, "Flexibly Scalable Digital Video Coding," Signal Processing: Image Communications, vol.5, nos. 1-2, pp.5-20, 1993. [2] T.Chinang, H.Sun and J.W.Zdepski, "Spatial Scalable HDTV Coding," Proc. ICIP'95, vol.2, pp.571-574, 1995. [3] G.Schamel, "Graceful Degradation and Scalability in Digital Coding for Terrestrial Transmission," Proc. HDTV 1992, vol.2, pp.72/1-72/9. 1992. [4] G.Lilienfield and J.W.Woods, "Scalable High-Definition Video Coding," Proc. ICIP'95, vol.2, pp.567-570, 1995. [5] M.Nakamura and K.Sawada, "Scalable Coding Schemes based on DCT and MC Prediction," Proc. ICIP'95, vol.2, pp.575-578, 1995. [6] M.Asada and K.Sawada, "MC-DCT Scalable Coding for Interlaced Image," Proc. Tokai-section Joint Conference, no.553, 1995 (in Japanese).
Session I:
IMAGE SUBBAND, WAVELET CODING AND REPRESENTATION
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (EAitors) 9 1996 Elsevier Science B.V. All rights reserved.
263
U n i f i e d I m a g e C o m p r e s s i o n U s i n g R e v e r s i b l e a n d Fast Biorthogonal Wavelet Transform H y u n g J u n K i m a n d C. C. Li D e p a r t m e n t of E l e c t r i c a l E n g i n e e r i n g U n i v e r s i t y of P i t t s b u r g h P i t t s b u r g h , PA, 15261 U S A
Abstract
We choose
We present a fast image compressor using biorthogonal wavelet transforms which gives high computational speed and ezcellent compression performance. Special spline biorthogonal wavelets are used where the filter coefficients are dyadic rational numbers and convolutions with these filters can be performed by using only integer arithmetic shifting and addition operations. Following the transform, the Hilbert scanning is used for encoding to gain additional compression. 1
Introduction
Wavelet transforms are known to have excellent energy compaction characteristics and, therefore, are ideal for signal and image compression. Although this approach has been vigorously developed during the recent years, general orthogonal(or biorthogonal) wavelet transform filters and most of subband coding filters have many taps and require many floating point multiplications. That is one of the reasons why wavelet transform/subband coding often takes longer processing time in comparison to :IPEG. Another problem is that an additional step must be taken to implement lossless compression, even though wavelet transform can achieve perfect reconstruction in theory[l]. In the proposed compressor, special spline biorthogonal wavelets are used where the filter coefficients are dyadic rational numbers and convolutions with these filters can be performed by using only arithmetic shifting and addition operations without any multiplication. Wavelet-based image coding prefers smooth filters of relatively short support with some degree of regularity. If some quality degradation is acceptable after choosing a short-support wavelet filter,our main concern will be the support length for fast processing. The other concern is the regularity, especially that of the synthesis filters, as it influences the quality of the reconstructed images. In biorthogonal filter banks with Ho(z) and H~(z) as lowpass and highpass
~n~ly.is~lt~r~,~ d ao(~)~nd C~(~)= lowp~ ~nd
highpass synthesis filtersrespectively, perfect reconstruction with FIR filtersmeans that
Ho(~)Co(~)+ Ho(-~)Co(-~) = 2.
(~)
Ho(z)
=
1 4V~(-2
Hi(z)
-
i 2%/~(-z-{- 2 - z-i),
(3)
Go(z)
-
1 2vr~(Z -{- 2 -{- z-l),
(4)
a~(~)
=
4 ~ ( - ~ ~ - 2~ + ~ - 2. -~ - .-~). (5)
+ 2z + 6 + 2z - ~ - z-~),(2)
We avoid the factors 1/V~ in the expressions by multiplying 1 / v ~ in the analysis parts and ~/2 in the synthesis parts so as to make filter banks with dyadic integer coefficients. 2
Shift and s u p e r p o s i t i o n
operations
Let us first consider a data sequence for reconstruction given by {.., a, b, c, d, e, f, ..} and a 5-tap 1-D filter with dyadic rational coefficients { ~, 88 ~, 88 ~} as illustrated in Figure 1. The first step of reconstruction is interpolation of the data sequence, which is to put zeros in between two successive data points, resulting a new data sequence {.., a, 0, b, 0, c, 0, d, 0, e, 0, f, 0, ..} for convolution with the 5-tap filter. The computation can be done by the shift and superposition operations as follows. We can fetch one data point, for example, b, which is aligned with the center of the filter, and then scale it using either one(for 89 two(for 88 three(for ~) shift-right operations to obtain the first intermediate data set{1ib, ib, 1 5b, 1 ib, 1 sl-b}. This is illustrated in
l/S I~i41112 I~/~ Ills I'"
It looks Figure 1 as "b• I like the center value is spreading out toward both ends of the filters and is weighted respectively by the filter coefficients. At the next clock cycle, we do not need to calculate the intermediate data set since
-0• 11/8I~/~ 111211/4Ii/s l'. At
the input is zero: the third clock cycle, we do the similar shift operations resulting another data set {}c, 1~c,~C,1lC, lC)"
After 5 clock cycles, we add up all three intermediate data sets(exclude two zero sets) to obtain a value of "lb + lc -t- ~d" at the position of the input data point c, which is the same as what we would get from the usual convolution operation. At the next clock cycle, one can get another output: "~c~-1 _ - _ld,,4_"Repeat
264
the same procedure until we cover all the input data points in the sequence. The benefit of shift and superposition operations is to achieve a gain in processing speed. While in the usual convolution computation, even though many data points are zero such as in the highpassed data of the wavelet decomposition, we have to perform the convolution operation which is obviously a waste of time. On the other hand, in the shift and superposition case, if the center value is zero, we can skip the whole shift and addition operations. This saving well be more pronounced in the 2-D case, and will be very useful in accelerating the reconstruction process.
3
Reconstruction
using
2-D masks
In 2-D processing, if the wavelet transform is performed first horizontally and then vertically, the inverse transform must be performed vertically first and then horizontally. We may use the tensor product of two separate 1-D filters for decomposition, but for reconstruction we have developed a fast reconstruction algorithm using 2-D filter masks to take advantage of the characteristics of the wavelet transformed data. Although the processing by 2-D filters generally takes more processing time than processing by two separable 1-D directional filters, our special algorithm based on 2-D masks and shift-addition operations provides a fast reconstruction process. In order to minimize the data fetch operations and only process the nonzero-valued pixels in the decomposed images, we perform the inverse wavelet transform by using 2D filter masks as shown in Table 1, in stead of using two 1-D(vertical and horizontal) processes. These masks are the four tensor products of two 1-D filters 1 ~} which correspond { 89189 ~, ~x, ~s, ~, to v / 2 G o ( z ) and v/2Gl(z), respectively. Analogous to the 1-D case, the center pixel value in the mask spreads out to the neighboring pixels via shift and superposition operations as illustrated in Figure 2. For illustration purpose, let us consider a 3 x 3 input image for reconstruction having pixel values {a, b, c; d, e, f; g, h, i} and a reconstruction filter(LL-filter) as shown in Table l(a). The input image is first interpolated by putting zeros in between two neighboring pixels to obtain a 7 x 7 fine scale image for convolution with the LL-filter. The center pixel value is spread out to the eight neighbors(for the LL-filter case) after being weighted by the mask values. In the figure, the dark squares represent pixel values at the coarser level and the white squares shows the interpolated pixels at the finer level. In the case of the center pixel value e, the center weight of the LL-filter mask is 1, therefore the center value remains to be "e" without any change. The weights of north, south, west, and east are all 1/2, so the e value itself shifts right once and becomes "e/2". Similarly, the weights of northwest, northeast, southwest, and southeast are all 1/4, thus the cen-
ter pixel value e itself shifts right twice and become "e/4". Repeat the same process for the next center pixel until we finish the shift-and-superposition processing for the whole input image. The similar processing will be performed by using LH(3x5), HL(5x3), and H H ( 5 x S ) filters and the corresponding input images until we complete the processing at one reconstruction level; and then go to the next level. Since there are many zero-valued pixels in the three high-passed subimages at each level, the reconstruction process is carried out only for the nonzero values. Thus the whole reconstruction process can be significantly accelerated as compared to the usual tensor product of two 1-D processing. Because all four subimages are obtained simultaneously with four masks, this method can be easily adapted to progressive reconstruction. 4
Hilbert
scanning
An additional improvement in compression can be achieved by the Hilbert scanning of wavelet transformed data[2]. Scanning while maximizing correlation is an important consideration in image compression, since the higher correlation we get at the preprocessing, the more efficient data compression we will have. The main advantage of the Hilbert scanning is that the scanning curve remains in a region as long as possible before moving to the neighboring area. The Hilbert scanning is built recursively within an image using the basic definition on one scale level and the recursive definition on consecutive levels. Let us denote the upper left, upper right, lower left, and lower right quadrants by (0,0), (0,1), (1,0), and (1,1), respectively. We define four basic scans;
R
: (o,o)-.
D
9 (0,0) ---, (1,0) ---, (1,1) --, (0, 1),
L u
(o, 1) -0 ( 1 , 1 ) - 0
(1,o),
9 (i, i) -~ (i, 0) - , (0, 0) - , (0, i), : (1,1) -0 (o, 1) - , (o, o) -0 (1, o).
(s)
The four recursive scans are defined by the following; R D
: D " R
L" U
:
---, R ---, D
---, R ---, D
---* U, --4 L,
U
----,
L
----*
L
----,
D,
L
---,
U
---,
U
---,
R.
(7)
An image is divided into four quadrants, each of which is further divided into four quadrants, etc. until subimages of size 2 x 2 are formed. The Hilbert scan of a 2-D surface, instead of a linear scan, has the advantage of exploiting more efficiently the twodimensional correlation existing on the surface. To take advantage of this scanning, a run-length coder is used prior to Huff'man coding, since the performance of the latter is not affected by the sequence of the data. 5
Experiments
A unified compression, which was not achieved by JPEG, was developed and reported in[l, 3]. Its
265 performance on lossless compression was compared with that of JPEG and was shown in Table 2; a 10% improvement was obtained. We have studied three cases for lossy compression. The first method(Lossyl) involves just tensor products of regular 1-D decompositions and of regular 1-D reconstructions. The second method(Lossy2) uses the ordinary tensor products in decomposition processing, but uses 2-D masks in reconstruction for accelerating the reconstruction process. The third one(Lossy3) is the same as Lossy2 except that it includes the Hilbert scanning after the wavelet decomposition for improving compression ratio. Obviously, there is no difference among Lossyl, Lossy2, and Lossy3 on reconstructed image quality, since Lossy2 only speeds up the reconstruction process. However, Lossy3 can improve the compression ratio over Lossy 1 and Lossy2 at the expense of a little more processing time, since it exploits the correlation in the wavelet decomposed data. Performances of all three methods are compared with that of :IPEG[4]. In Table 3, we compared the processing time of the proposed compressor with that of :IPEG for the same reconstruction quality in terms of peak-topeak signal-to-noise ratio(PSNR.-~30dB). The test was performed on a SUN Microsystems SPARC-10 station using the standard gray-scale images, such as Lena, Barbara, Peppers, and Airplane (512 x 512 x 8). The compression/decompression time means the total processing time from the beginning to the end, and W.Dec./W.Rec. means wavelet decomposition/reconstruction time only. Lossy compression performances are shown in Table 4 and a graph of compression ratio versus PSNR for Lena image is shown in Figure 3. An example of reconstructed images('-'30dB) using :IPEG and the proposed method is also shown in Figure 4.
References [1] A. Zandi, :I. Allen, E. Schwartz, and M. Boliek, "CREW: Compression with Reversible Embedded Wavelets," Proceeding8 of IEEE Data Compression Con]., March 1995, pp. 212-221.
[2] S. N.
Efstratiadis, B. Rouchouze, and M. Kunt, "Image Compression Using Subband/Wavelet Transform and Adaptive Multiple Distribution Entropy Coding," Proc. SPIE, v. 1818, Visual Communication8 and Image Proee88ing'9~, 1992, pp. 753-764.
[3] H. Kim and C. C. Li, "A Fast Reversible Wavelet Compressor," Proc. SPIE, v. 2825, Mathemati-
cal Imaging: Wavelet Applications in Signal and Image IV, August 1996.
[4] W.
B. Pennebaker and J. L. Mitchell, 3 P B G Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993.
Table i: 2-D flter masks for fast reconstruction (a) LL-filter 1/4
1/2
1/4
1/4
1/2
1/4
1/2
1
(b)/)H-filter
-1/8 -1/4 3/4 -1/4 -1/8
1/2
-1/4 -1/2 3/2 -1/2 -1/4
-1/s -1/4 3/4 -1/4 -1/8
(c) H/J-filter
-1/8-1/4 -1/4-1/2
3/4-1/4-1/8
3/2-1/2-1/4
-1/8-1/4 3 / 4 - 1 / 4 - I / 8 (d) HH-filter
6
Conclusions
The compression performance of the proposed method is superior to that of :IPEG and the visual quality of reconstructed images is better even if the PSNR values are the same. We have improved significantly the processing time of wavelet transform, but overall processing time is still longer than JPEG. This is because that JPEG divides an image into small subimages and then process them using diagonally zigzag scanning and the fast 1-D DCT. Therefore, JPEG is actually 1-D processing while the wavelet transform method is 2-D processing. Although :IPEG processing speed and compression ratio are good, there is a noticeable blocking artifact at high compression. However, there is no blocking effect at all in reconstructed images by wavelet-based methods. In fact, the performance of the proposed method is far superior to that of JPEG at high compression ratios. The processing speed may be improved further in hardware implementation since the proposed method uses only integer shift and addition operations.
1/16 1/8-3/8 1/8 1/16 1/8 1/4 -3/4 1/4 1/8 -3/8 -3/4 9/4 -3/4 -3/8 i/s i/4 -3/4 I/4 i/s 1/16 1/8-318 1/8 1/16 Table 2" Lossless compression ratio of test images
Ima. Len. Bar. Pep. Air.
Lossless & Huff. 1.67:1 1.54:1 1.68:1 1.85:1
Method Lossless Lossless Baseline & Arith. :/PEG :/PEG 1.68:1 1.51:1 1.48:1 1.55:1 1.38:1 1.45:1 1.69:1 1.46:1 1.51:1 1.87:1 1.46:1 1.51"1
Table 3: Processing time(seconds) for test images Prop. JPEG
Com. 1.10 0.50
Decom. 1.00 0.40
W.Dec. 0.35
W.Rec. 0.25
266 Table 4: Lossy compression ratio(CR and bpp) of test images(PSNR_~30dB) Method Lossy3
Lossyl&2
Ima.
38.64:1(0.21) 18.22:1(0.49) 46.25:1(0.17) Air. 34.23:1(0.23) 39 83:1(0 20) Len. Bar. Pep.
...
a
33.34:1(0.24) 14.91:1(0.54) 40.10:1(0.20)
0
b
0
c
0
d
0
e
0
JPEG
27.89:1(o.29) 11.39:1(o.7o) 32.19:1(o.~s) 29 s8:1(o.27)
9 9 9: dam
sequence
1/4 x b + 1/4 x c i
PSNR for Lena
: 1/8 x b + 1/2 • c + 1/8 x d
c x 11/8 11}4 I 1~ I 1~ I 1/8 :
Figure 3: Compression ratio vs. image
i
1,81 1"11'81
1/4 x c + 1/4 • d
1/4 X b + 1/4 x c v 1/8 x b + 1/2 x c + 1/8 x d 1/4 • c + 1/4 • d
Figure 1" An example of shift and superposition operations for a 5-tap 1-D filter
Figure 2: An example of 2-D shift and superposition operations for LL-filter(3x3 2-D filter mask)
Figure 4: Reconstructed images(PSNR~_30dB) using JPEG(top) and the proposed method(down)
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
267
Subband Image Coding Using Adaptive Fuzzy Quantization Step Controller Peter Planinsic, Franc Jurkovic, Zarko Cucej, Dali Donlagic Faculty of Electrical Engineering and Computer Science Smetanova 17, 2000 Maribor, Slovenia E-mail:[email protected]
Abstract The adaptive fuzzy quantization step controller for achieving desired picture quality in PSNR is described. The possibilities of construction other picture quality measure and fuzzy bit allocation is proposed too. 1. INTRODUCTION Subband coding using multirate filter banks is an effective method to achieve data compression in image signal processing, Safranek et al., 1988 [1 ], Planinsic, 1993 [2]. The compression rate in bpp is derived by the choice of the quantization stepQ. The larger the quantization stepQ, the larger is the compression factor and the quantization error. For different pictures, this relationship is different and depends, as the quality of decompressed picture, on its statistics. The goal of the digital image compression is the representation of a source digital image with as few bits as possible, still maintaining adequate fidelity for the particular application. The mostly used quality measure of compressed image is mean square error (MSE) per sample or reconstruction error which is defined by:
MSE = (1/N). (X- ~)r .(X- ~:)
(1)
where x is grey level representation of the original image ~: and of the reconstructed image. The peak-signalto-noise ratio (PSNR) for 8-bit resolution is defined as: Error! Objects cannot be created from editing field codes.
(2)
This measure is tractable, and it usually, but not always correlates reasonably well with subjective criteria. Since the human observer is the final receiver of most of the image information transmitted, an objective measure based on visual perception would be very useful to predict picture quality. To find such a measure still remain a future research area. There have been several attempts in the past to derive visual models to predict picture quality by means of an objective measure. We suggest the use of the fuzzy quantization step control to achieve desired picture quality. For the sake of simplicity we used PSNR as the picture quality measure. 2. SUBBAND IMAGE CODING In the one-dimensional pyramid subband decomposition scheme with two-band QMF we used a linear-phase FIR filter Ho(z).We get the solution:
l:l(z) = Ho(z),
~(z) =/~(-z)= Ho(-Z);
H(z)=2.Ho(z), G(z)=-2.Ho(-Z)
(3)
The amount of amplitude distortion introduced by a nontrivial linear-phase FIR filter in the QMF can be minimised by optimising the FIR filter coefficients. For the filter bank the QM-filters designed by Johnston [6] have been employed (Johnston's 16b).
In Figure 1 a transform image coding scheme is presented. In our image coding application with separable multiresolution transformation we simply applied elementary step of the one-dimensional pyramid scheme, first to the rows and then to the columns of the image matrix (Figure 2).
268 O~ginal ~aga
Compremed
QuantizationooeffioientsQ
kna~
J
I
Figure l" Transform image coding scheme
71.~
=~,F ~
dhC~oftilm~anM
offllll~ ia mlumm
Figure 2: Two-dimensional multiresolution transformation
The quantizer we used is an equidistant quantizer. The discretization and reconstruction mappings are defined as:
tQ:~ ~ z, x ~ L x / Q + l / 2 j 9
(4a)
ro :Z -> 9t, x ~ x . Q
(4b)
respectively. In our method, only integer quantization distances are admissible. L-r, ~
~.r,~
., '.
~ ., Tl_s
%,
~ . J ,, rz., 5'.0r ., r~.~
-
Oo -.~
I
I
I
I
I
I
.
_
_
I
Figure 3: The disposition and scanning of the transformed subimages
Each subimage T,(i = 1..... K) of the transformed image T is assigned a quantization distance (factor) Q~, and the amplitudes of the subimage are quantized according to the mapping % . A much simpler alternative is to use a unique quantization distance over the whole transform domain, i. e: Q1 = Q2
.....
Qr = Q
(5)
The quantized transform coefficients of detail subimages and the approximation image at r resolution levels are scanned or vectorized in the manner shown in Figure 3. We have applied adaptive coding scheme, the so-called recency rank coder, Elias, 1987 [3]. Reconstruction is made using decoding and synthesis filter bank. 3. ADAPTIVE FUZZY CONTROLLER The block scheme of the adaptive fuzzy controller is shown in Figure 4. The self-tuning controller is used [7]. With this application we get constant desired picture quality on output. For identification the method of W. Pedrycza [8] is used. The relation matrix as process model is the result of the identification. The comtroller design based om method of M. Togai [9].
269 ,+ , 0oneolklr
PSHR_~
,I
Ir162
4 Fury Oonlroll~. [
Proof, ~n-O I
I
Figure 4: Fuzzy controller The basic task of the fuzzy controller is to determine the quantization factor Q for desired PSNR. e(n) = PSNR_SET- PSNR(n)
(6)
The fuzzy controller consists of three parts: (the fuzzyfication-part, fuzzy inference machine based on fuzzy rules and the defitzzyfication part). The goal of use the adaptive fuzzy controller is adaptation on different non-linear processes. We can prescribed desired controller behaviour, e.g. fast convergence with no overshot.
4. RESULTS OF SIMULATIONS We made simulations with different pictures, CIRCLE (black), LENA, BABOON, RANDOM (white random noise), for different initial values. The comparisons of the results for different pictures are shown in Figure 5.
Figure 5: Comparison of simulations for different pictures (Qini= I0) a) PSNR as function of iteration b) relation Q - PSNR
270 5. CONCLUSIONS The use of the adaptive fuzzy controller to control picture quality in PSNR is described. With adaptation desired acceptable solutions for different pictures can be achived. With fuzzy logic other quality criteria can be constructed. One of the advantage of the use of a fuzzy controller is that we don't need complex models of the process. Process identification enable the construction of adaptive bit allocation. The bit alocation algorithm can be made with fuzzy rules, too. The state in signal processing technology enables the implementation of such algorithms in real time. It is possible to regulate the compression rate in bpp too. This method can be applied in any other transform coding compression method.
6. REFERENCES [ 1] Safranek, R. J., MackKay, K., Jayant, N. S., and Kim, T. (1988): Image Coding Based on Selective Quantization of the Reconstruction Noise in the Dominant Sub-band. Proc. 1988 IEEE ICASSP, pp. 765-768, April 1988. [2] P. Planinsic, J. Mohorko, Z. Cucej, D. Donlagic, P.Filip: Image compression based on the discrete wavelet transform. Proceedings of the 15th International Conference in Information Technology interfaces ITI'93, Pula, Croatia, June 155-18, pp 477-481, 1993. [3] P. Elias: Interval and Recency Rank Source Coding: Two On-line Adaptive Variable-Length Schemes. IEEE Trans. Inform. Theory, Vol. IT-3, pp. 3-10, January 1987. [4] Chuen Chien Lee: Fuzzy Logic in control systems: Fuzzy logic controller, Part I, II. IEEE trans, on systems, man, and cybernetics, vo120, no. 2, pp. 404-418, pp. 419-435, 1990. [5] Eiichi Tsuboka and Jun'ichi Naka: On the Fuzzy Vector Quantization based Hidden Markov Model. Advanced Program ICASSP'94, Adelaide, South Australia, April1994. [6] J.D. Johnston: A Filter Family Designed for use in Quadrature Mirror Filter Banks. Proc. oflEEE ICASSP, pp. 291-294, 1980. [7] K.J. Astrom: Theory and Applications of Adaptive Control. A Survey. Automatika, vol.19, no.5 pp.471486, 1983. [8] W. Pedryycz: An identification algorithm in fuzzy relational systems. Fuzzy sets and systems, vol.13, pp. 153-167, 1984. [9] M. Togai, P. P. Wang: Analysis of a fuzzy dynamic system and synthesis of its controller. International J. Man-machine studies, vol. 22, pp. 355-363, 1985
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
271
EZW Algorithm Using Visual Weighting in the Decomposition and DPCM Laurent Lecornu and Czeslaw Jedrzejek I EFP- The Franco-Polish School of New Information and Communication Technologies, ul. Mansfelda 4, 60-854 Poznan, Poland Abstract
Some modifications of the EZW algorithm are described in the paper. First, the EZW algorithms using bi-orthogonal wavelet decomposition and quincunx wavelet decomposition are compared. After studying the statistics of wavelet coefficients, a DPCM quantization is used for coding the low pass component. In the EZW algorithm using biorthogonal wavelet transform the coefficients are also weighted using a visual weighting function. The visual perception of the result seems a little better but the PSNR is little worse. As a conclusion, this modification adds some complexity to the algorithm without a very significant result. However, modifications in the standard Shapiro EZW algorithm shed light on excellent performance of the recent Said and Pearlman EZW coder.
1. Introduction Embedded zerotree wavelet (EZW) coding, introduced by Shapiro [1], is one of the best techniques for image compression. The principle of this algorithm is the exploitation of self-similarity across different scales of an wavelet transformed image and ordering the magnitudes of wavelet coefficients. The coding can be terminated when a target rate or a target distortion is obtained. The EZW algorithm starts with a discrete wavelet transform following by the embedded coding using successive-approximation quantization. Two kinds of wavelets decomposition are used and compared, namely the bi-orthogonal wavelets transform and the quincunx wavelets transform. The main interest is in the pre-processing performed before the embedded coding. A modification of EZW algorithm using bi-orthogonal wavelet transform is introduced using a visual weighting function. Also a DPCM is employed for the quantization of the lowest subband. All comparisons are made for the Lena 512x512, 8 bit image.
2. EZW algorithm
One can divide the EZW into three steps: - wavelet decomposition, - building the tree and the quantisation of the coefficients, - encoding of the tree and constructing compressed file The modified EZW also contains weighting of wavelet decomposed subbands (Fig. 1). The building of the tree and encoding part of the EZW algorithm consists of: I. II. III. IV.
V.
Computation of the maximum value Valm~of the absolute value of wavelet coefficients Threshold (T) determination: T= ValmJ2 Dominant pass Subordinate pass
T=T/2
VI. Goto III. Each symbol of the significant map is compressed using arithmetic coding. To the compressed file, a header containing the image size, the maximum absolute value and the number of levels of the decomposition is added. Wavelet I ,lWeightingof theL..~Quantization andl .]Encoding I decompositionl 1decomposition I l z ~ o t ~ building l "]of the tree I Figure 1: Scheme of the modified EZW algorithm
x This work was partially done while the author visited the NPAC, Syracuse University, Syracuse, NY 13244, USA.
272 1==pass
./..~partial .
. P
|!
I~ Isi ~ Isi
.
tree .
isl~ ....I sl
bit rate constraint
D 9Dominant pass
S 9Subordinate pass
1
Figure 2: Example of a compressed file after 4 iterations with bit rate constraint
3.
Wavelet decomposition
a- bi-orthogonal separable wavelet transform One of the best and the most often used filter is the 9/7 tap filter [2] with the following coefficients In ! O"i 1 '! 2 I 3 '1 4 '1 Ih(n) 10"026749 [ h*(n) 0.557543 0.295636 -0.028772 -0.045636 i 0
1060294910,2668641-o.0782231-0016 s64
i
Table 1: Filter coefficients (bi-orthogonal wavelets)
g~ : (-1) nh_n§ On = (-1)n~n+l
(1)
The signal is convolved with ~ h(n) and ~ " g(n) for the decomposition and with V2" h*(n)/2 and V~" g*(n)/2 for the reconstruction. There exist reports that varying the size of the filters at various levels of the wavelet decomposition leads to better results [3]. At the first levels of decomposition, a long filter is preferable while at the later stages a shorter filter is advantageous. The problem is the transmission of the filter coefficients and the increase of the complexity.
b- quincunx wavelet transform In general, one expects non-separable filters have better compression properties than separable filters. Venkataraman and Levy [4] proposed a non-separable filter with a smaller product of spatial and frequency localisations compared to 9/7 tap filter, that, in particular, would have better edge performance. In [5], it has been suggested (but not explicitly demonstrated) that the quincunx wavelet transform being nonseparable and non-oriented gives better result than the bi-orthogonal wavelet transform. In another work Jee and I-laddad [6] designed optimal filters of a given class that provide minimum reconstruction error for vector-quantized M-channel subband codec. Surprisingly (they do not comment on this), separable paraunitary and bi-orthogonal filters produce smaller mean-square-error (MSE) for a bit rate 0.36 bits/pixel than a nonseparable paraunitary filter for a bit rate 0.50 bits/pixel. The quincunx transform decomposes the original image with a multiresolution scale factor of , ~ ' . This means that the analysis will be twice as free as dyadic multiresolution analysis. The scaling function is defined by the following formula:
r ran( x, y) = 2-mr
y)- n) nG Z2,n= (nx, ny)
(2)
where L ( x , y) = (x + y, x - y) is a linear transform with me ~ Z and L o Lo...o L o L, n times. This isotropic multiresolution analysis provides, at each resolution level, only one wavelet coefficient sub-image and one low-resolution sub-image. h(n) h*(n) m a 0.001671 jgcgj b -0.002108 -0.005704 c :0.019555 -0.007192 cdedc d 0.139756 0.164931 gidig 9 0.687859 0.586315 f 0.006687 jgcgj fbf g -0.006324 -0'017113 i -0.052486 -0.014385 j 0.010030 Table 2: Filters coefficients (quincunx wavelets)
273
Figure 3 shows the parent-children dependencies in the case of quinctmx wavelet transform. The scanning order is: LI.,7, H ~ , HHs. H ~ . . . . This wavelets transform organises the information of the original image into several resolutions. The image is "split" into M images of wavelet coefficients (resolution 0 to M-l) and an image at the lowest resolution. LLr
HH5
HHe HHs
\
NH~
HH~
t HH2
Figure 3: Parent-child dependencies (quincunx wavelet) Our result is as follows: the mathematical quality (PSNR) of the EZW using the quincunx wavelet transform with 512 decomposition levels (in average the best result is for the 6 level decomposition) is 0.2-0.3 dB lower than the PSNR results for the bi-orthogonal wavelet transform with 6 level decomposition for bit rates 0.1-2 bits/pixel.
c- statistics of the coefficient's subimages In [2], statistics of the coefficient's subimages was modelled by the generalised Gaussian. A particular case of this function is the Laplacian function. Following [2] we use this Laplacian approximation to study the performance of the quantizer. The statistics of coefficients serves to provide optimal partitioning of bits between various subbands, in this case LL and the remaining subbands. In the original Shapiro work [1] a zerotree is built starting from the lowest LL subband. However, although statistics of a subimage at the lowest resolution is similar to the statistics of an original image, there is less similarity between the lowest resolution and the higher subbands. This suggests another technique for the coding of the lowest resolution subband. This has been partially implemented in a highly successful Said and Pearlman modification of the EZW [7], who constructed zerotree in such a way that only a fraction of pixels in the lowest subband are roots while the rest is scalar quantised. In [5], a DPCM technique is used to encode the lowest subband. In this work the DPCM is introduced in the original EZW algorithm. In this way LL component can be more precisely encoded compared to the rest of coefficients. However, again the PSNR results for EZW with DPCM encoding of the lowest subband are consistently lower by up to 0.2 dB for bit rates 0.1-2 bits/pixel compared with the standard Shapiro algorithm..
4. Results for the modified EZW algorithm The main modification studied in this work is the use of a weighting function based on visual perception. Assignment of the weights is based on the fact that the human eye is not equally sensitive to signals at all spatial frequencies. Following [2], on the basis of contrast sensitivity data to obtain a controlled degree of noise shaping across the subimages, one considers a function Brrtd such that:
B.,,a = Y" log((r=2~"'' )
(3).
where O'~a is the standard deviation corresponding to a subimage (re,d). The values of "~ and Braa are chosen experimentally in order to match human vision. Wavelet coefficients are multiplied by function Braa before the embedded coding. Our results indicate that decoded images obtained for a given compression rate have the same (or little better) visual perception but the PSNR is between 1 and 2 dB smaller. These weighs effect positions of significants in each subband after the coding. Mazzari and Leonardi [8] presented an algorithm to estimate the
274
normalising coefficient for perceptual purposes based on a quantisation noise-like criterion. Then they proceeded to consider a method based on a steepest descent algorithm that was not related to visual perception but assigned arbitrary weights to subbands in order to minimise the total distortion. Due to numerical difficulties they resorted to an ad-hoc minimisation. As a consequence they used similar techniques to improve visual quality or PSNR, but not both of them. The question arises to what degree the original filter properties are lost through such a procedure. In general, it is only necessary that the transform be invertible. Orthogonality and other filter properties are not required for image coding [9,10].
5. Conclusions The results obtained with the EZW method using the bi-orthogonal wavelet transform are better than for the quincunx wavelet transform and we do not have an easy explanation for this fact. Using visual weighting function gives the same (or little better) visual perception but worse PSNR. Recently Said and Pearlman [7] demonstrated excellent performance of the modified EZW coder. Although a lot remains to be understood, there are two apparent differences between their coder and the original Shapiro coder. One is that the Said and Pearlman coder uses scalar quantisation for the portion of the LL band. Then every root has 4 children, contrary to the Shapiro coder, that for which the root in LL has only three children. The main reason for much better performance of the Said and Pearlman coder is not the difference in coding the LL band, but rather than they code descendants just after coding the significants. This allows them to save bits when all descendants are insignificant. In addition to having better quality, the Said and Pearlman coder [7] is also much faster than not only the original Shapiro coder but also the recent Shapiro coder which employs a technique [11] that builds a zerotree map before encoding. Further speed-up, important for video coding with wavelet coded key frames can be achieved with much shorter all-pass filters [12]. There exists a possibility for improving the Said and Pearlman coder by using the conditioning rather than prediction for the coding, similarly as it has been done by Algazi et al.[13] with regard to the Shapiro coder [1]. For small (CIF and QCIF size) images, ouside the realm of the wavelet packets only vector wavelet transform [14] with lattice vector quantization is potentially more promising than the Said and Pearlman coder since zerotree quntization loses its effectivenes for small number of decompositions. This work has been supported by the Polish Scientific Committee (KBN) grant 8 T l l E 035 10. C. Jedrzejek also acknowledges the support of the USAF Rome Laboratory (AFSC) Collaboration and Interactive Visualisation grant F30602-95-C-0273-01.
References [ 1] J.M. Shapiro, "Embedded image coding using zerotrees of wavelet coefficients", IEEE Trans. Signal Processing, vol. 41, no. 12, pp. 3445-3462, Dec 1993. [2] M. Antonini, M. Barlaud, P. Mathieu and I. Daubechies, "Image Coding Using Wavelet Transform", IEEE Transactions on Image Processing, vol. 1, no. 2, pp. 205-220, April 1992. [3] S. A. Martucci and I. Sodagar, "Zerotree entropy coding of wavelet coefficients for very low bit rate coding", IEEE Conf. on Image Proc. ICIP-96, Lausanne, 1996 (to appear). [4] S. Venkataraman and B.C. Levy, "Nonseparable orthogonal linear phase perfect reconstruction filter banks and their application to image compression", Proc. IEEE Conf. on Image Proc. ICIP-94. Austin, TX, vol. 3, pp. 334338, 1994. [5] M. Barlaud, P. Sole, T. Gaidon, M. Antonini, and P. Mathieu, "Pyramidal lattice vector quantization for multiscale image coding", IEEE Trans. on Image Processing, vol. 3, no. 4, pp. 367, July 1994. [6] I. Jee and R. A. Haddad, "Modeling and design of multidimensional vector-quantized M-channel subband codecs", Proc. IEEE Conf. on Image Proc. ICIP-95, Washington, vol. 3, pp. 85-88, 1995. [7] A. Said and W.A. Pearlman, "'A new fast and efficient image codec based on set partitioning in hierarchical trees", IEEE Trans. Circ. and Syst. Video Tech., vol. 6, pp. 243-250, June 1996. [8] A. Mazzari and R. I.,eonardi, "Perceptual embedded image coding using wavelet transforms", in the Proceedings ICIP-95, vol. 1, pp 586-589, Washington, Oct. 1995. [91 E. H. Adelson and E. Simoncelli, "Orthogonal pyramidal transforms for image coding", SPIE, vol. 845, pp. 5058, 1987. [ 101 J,P. Andrew, P.O. Ogunbona, and F.J. Paoloni, "Cot~g gain and spatial localisation properties of discrete wavelet transform filters for image coding", lEE Proc.-Vis. Image Signal Process., vol. 142, no. 3, pp 133-139, June 1995.
275 [ 11]J.M. Shapiro, "Techniques for fast implementation of the Embedded Zerotree Wavelet (EZW) algorithm", Proc. ICASSP'96, vol. 3, pp. 1455-1458, Atlanta, GA, May 1996. [12]C.D. Creusere and S.K. Mitra, "Image coding using wavelets based on perfect reconstruction IIR filter banks coefficients", IEEE Trans. Circ. and Syst. Video Tech. (to appear). [13]V. Ralph Algazi, and R. R. Estes, Jr., "Analysis based coding of image using wavelets on perfect reconstruction IIR filter banks coefficients", IEEE Trans. Circ. and Syst. Video Tech. (to appear). [14] W. Li and Y.-Q. Zhang,, "A study of wavelet transform coding of subband-decomposed images", IEEE Trans. Circ. and Syst. Video Tech. vol. 4, no. 4, pp. 383-391 1994.
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
277
Efficient 3-D Subband Coding of Colour V i d e o M a r e k D o m m i s k i and R o g e r Swierczy/mki Poznafi University o f Technology, Institute o f Electronics and T e l e c o m m u n i c a t i o n Poznafi, Poland ABSTRACT This paper presents a simple and fast technique for video sequences coding at bit rates about 150 kbps. This technique is intent on low-priced video applications. It consist of two basic elements: well-known 3D subband analysis and synthesis and novel subband coding technique. For the inter-frame (temporal) subband analysis and synthesis the simple Haar wavelets are good enough. On the other hand - for the intra-frame (spatial) analysis and synthesis we use the highly efficient recursive filter banks. The proposed subband coding technique uses the base subband (low-lowfrequency) of the low-temporary subsequence in a simple detector of the scene moving areas. Information in other subbands is coded only in the areas related to significant movements. The most important feature of proposed technique is its simplicity. We need approximately 1.5 second per input QCIF frame on PC DX2 66 MHz machine using non-optimised software. 1. INTRODUCTION The rapid development of image communication and multimedia technology has stimulated a great interest for video coding for low bitrate channels, i.e., for channels of about 100 kbps. Some of multimedia applications need simple techniques which could be implemented in low-cost hardv~are. As the highly efficient modern video data compression techniques usually need large computational power to be implemented, the paper deals with a relatively simple and efficient technique which is very easy to implement on cheap hardware. We agree that the relation between compression ratio and the reconstructed image quality could be worse than using the more sophisticated techniques. In order to avoid the problems faced by application of the hitherto popular block-based DCT techniques, the techniques which are free of the blocking effects are examined extensively. Among them the object-based methods [3] as well as the methods based on extensions of the well-known image subband coding (SBC) (e.g., [4,5]) are of particular interest for video coding. The first approach to subband coding of video consists in replacing of the DCT block-based coding inside the prediction loop that implements interframe prediction with motion compensation [4,6,8]. The prediction error is encoded using subband decomposition. Disadvantage of this approach is related to the fact that the regions of significant prediction error tend to be unnecessarily extended due to the downsampling [9]. Another approach, recently more and more popular, consists in application of the three-dimensional (3-D) subband coding where the three-dimensional spatio-temporal frequency band of the input video sequence is split up into the low- and high-frequency bands in the temporal frequency domain and then again each of these channels is split up into some subbands in the domain of the spatial frequencies [10-14]. The proposals combine 3-D SBC with classic motion compensated prediction [13,14] or geometrical vector quantization [11,12]. The paper [10] describes encoding of the channels obtained by a 3-13 subband analysis using a hybrid technique where the base temporal subband is coded with a DCT block-based coder while the vector quantization is applied in the other subbands. 2. SUBBAND ANALYSIS AND SYNTHESIS In this paper, we use the 3-D subband coding approach based on application of spatial recursive filter banks in reversive arrangements [2,4,15] and short temporary FIR filters. The input colour video sequence is processed componentwise, i.e., the luminance as well as the chrominance components are processed independently. The chrominance components (usually denoted as U and V) are decimated prior to coding. Each component is analysed by temporal, horizontal and vertical filter banks. In contrary to the papers [ 10-14], for the spatial analysis, both horizontal and vertical, separable recursive half-band filters with polyphase implementations are used because of their simplicity [1,2,4,15]. Elliptic 5th order bireciprocal (power symmetric) filters with the minimum stopband attenuation of 43 dB are found to be close to optimum for spatial analysis [2]. Nevertheless, in the temporal analysis, we follow the suggestions of [ 11,13] using very simple linear-phase twutap Haar wavelets: H(z) = 0.5(l:t:z-1),
(1)
where "+" and "-" correspond to the low- and high-pass filters, respectively. This filter bank has a very simple implementation and exhibits small group delay (half of sampling period) resulting in small system response times which are v e ~ critical for videophone and videoconferencing applications. It enables perfect reconstruction in the very practical DFT arrangement.
278 3. SUBBAND CODING TECHNIQUE The basic idea of the proposed technique to encode channel information is very simple (see Fig. 1). We exploit lower spatial resolution of the human visual system for the high temporal frequency channel. Therefore the high temporal frequency subbands are decimated, i.e., only the spatial subband 0 is encoded and transmitted. However, four spatial subbands are encoded and transmitted in the low temporal frequency subband. Encoding of the subband 0 (low spatial frequencies) from the high temporal frequency channel as well as the subbands 1-3 from the low temporal frequency channel is controlled by the signal DL. The signal DL is a difference of the two consecutive frames from the subband 0 in the low temporal frequency channel, i.e., in order to avoid the "dirty window" effect, a frame of the DL signal is created from the information originating from the four consecutive frames (see Fig. 1). Moreover the signal DL is quantized using a quantizer with a dead zone. The width of the dead zone is set as a parameter usually being small (typically equal to I or 2). A pixel is active if the respective value of DL is positive. Only active pixels from the four above mentioned channels are transmitted. The other pixels from these channels are reconstructed in the decoder from the previous frames. In contrary to these four channels, the subband 0 from the low temporal frequency subband is reconstructed by differential decoding, that is by adding the samples of the signal DL to the samples of the previous frame. At the output of the decoder, Huffman coding is applied to all the signals. The compression gain can be increased by increasing the number of subbands in the spatial domain. A
B
I' \
C
--
La
g-
Ha
1
2
0
3
~.,,,,:~,,,_~,,:. l;~..,~:...:,:,:,,', ~~.::,.':::~::. i ~ N ~:;~.~,:
I
0
"~'~="-'~
D
-g-
Lb
Hb
1
2
0
3
I [| 0
DL
D
Fig. 1. Channel encoding principle. 4. EXPERIMENTAL RESULTS The goal of the experiments is to verify the proposed coding technique and used combination of nonlinear-phase spatial filters with temporal short FIR filters. The technique has been examined using standard videophone test sequences like "Salesman" and "Miss of America". In this paper we present only results for the luminance channel, the two chrominance channels are coded independently in the same way. Their bit rates do not exceed 10% of the luminance bit rate for each component, when we tolerate small colour distortions. The input luminance sequences are in the CIF format, i.e., 288 lines and 356 columns with 10 frames per second. At the first step, the sequence is decimated horizontally and vertically to the size of 144 x 178. Then the above described technique is applied. The signal is reconstructed using the same filter banks used for the analysis. Nevertheless, the spatial filters process the data in the opposite direction to that used in the coder. Therefore the spatial phase shifts are compensated and the nonlinear-phase characteristics of the spatial recursive filters have no influence on the reconstructed sequence [1,2,15]. The reconstructed sequence is interpolated to the CIF format for visualisation. The obtained bitrates are 145 kbps for the sequence "Miss of America" and 167 kbps for "Salesman"as shown in figures 2 and 3, and tables 1 and 2. (Please remember, that one has only 5 frames per second for the LT and HT sequences, ff the input rate is 10 fps.) The subjective quality is satisfactory as shown in figure 4 (at the end of this paper) which is obtained for not the easiest frames from the relatively difficult sequence "Salesman".
279
The experimental results show that application of the colour transformation from the YUV to the Lab space slightly improves the coding efficiency. 6000 5000
oo bytes 3000 2000 1000 0
-
;
~
~~~~~~~
:
:
2
;
;
[] LT
:
:
-
5
:
:
;
10
:
;
-
:
;
:
;
15
:
;
:
:
:
20
frameno.
"
25
Fig 2. Bytes used to encode the individual frames after temporal analysis, the sequence: "Miss of America", LT and I-IT denote the low and high temporal frequency channel, respectively. Table 1. Bitrates (in bytes per frame) for the sequence "Miss of America". ~ame
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Total
3195
2081
3475
2374
2705
2417
3043
3004
3414
3956
3830
4991
5550
5233
4437
4564
4445
4357
LT
2761
1699
2935
2007
2230
1915
2540
2456
2698
2989
3114
3654
4118
3953
3403
3437
3423
3399
HT
434
382
540
367
475
502
367
548
716
967
716
1337
1432
1280
1034
1127
1022
958
6OOO 5OOO
bytes 3000
9HT
1000 0
-
:
:
:
2
:
:
:
:
5
:
:
:
:
:
10
:
:
:
:
:
:
15
frame
:
:
:
:
20
no.
:
:
25
Fig 3. Bytes used to encode the individual frames after temporal analysis, the sequence: "Salesman". Table 2. Exact bitrates (in bytes per frame) for the sequence "Salesman". ~ame
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Total
4861
5998
5042
2996
4301
3723
4029
4061
4656
5716
5022
3226
1977
2531
2023
2724
3779
5017
LT
3841
4844
4480
2565
3651
3218
3369
3559
3279
4662
4346
2940
1743
2241
1782
2371
3157
4011
I-IT
1020
1154
562
431
650
505
660
502
927
1055
676
286
234
290
241
353
622
1006
5. CONCLUSIONS AND FUTURE AIMS A very fast technique has been proposed. An average encoding and decoding time for a QCIF frame is about 1.5 seconds on a PC 486 DX2 66MHz machine. High processing speed has been obtained by using recursive filters for spatial analysis as well as a simple subband encoding method. The most important future work to be done is the increasing the quality-compression ratio factor. In order to achieve this we employ switchable recursive filter banks together with the switching strategy described in [7]. This will improve subjective quality of decoded image at the same bit rate. Appropriate results as well as this technique will be presented at the conference time.
280
Fig. 4. Two consecutive frames from the reconstructed sequence. REFERENCES [1] M.Domafiski, R.Swierczyfiski, Subband coding of images using hierarchical quantization, Signal Proc. VII, 1994, pp. 1218-1221. [2] M.Domatiski, R.Swierczyfiski, Design of nonlinear-phase filter banks for subband coding of images. Proe. IEEE Int. Conf. Image Proe., Austin TX, 1994, pp. 893-897. [3] H.G. Musmarm, Object-based analysis-synthesis coding, Proe. Int. Symp.Circuits Syst., IEEE 1994. [4] J.W. Woods (ed.), Subband image coding, Kluwer 1991. [5] M.Domafiski, R.Swierczyfiski, Subband coding in the YUV space. XVIIth Nat. Conf. Circuit Theory and Electronic Networks, Wroelaw - Polanica Zdr6j, 1994, pp. 221-226. [6] P.H.Westerink, Subband coding of images, PhD thesis, Delft Univ. of Technology, 1989. [7] S.Aase,Image subband coding artifacts: analysis and remedies, PhD Thesis, Trondheim 1993. [8] D. Qian, A motion compensated subband coder for very lowbit rates, Image Communication, 1995. [9] H.G. Musmann, private communication, 1995. [10] K.Ngan, W.Chooi, Very low bit rate video coding using 3D subband approach, IEEE Trans. Circ. Syst. Video Tectm., vol. 4, 1994, pp. 309-316. [11] C.Podilchuk et al., Three-dimensional subband coding of video, IEEE Trans. Image Proe., vol.4, 1995, pp. 125139. [12] C.Podilchuk, Low bit rate subband video coding. Proe. IEEE Int. Conf. Image Proe., Austin TX, 1994, pp. III 280-284. [13] J.Ohm, Three-dimensional subband coding with motion compensation, IEEE Trans. Image Proe., vol.3, 1994, pp. 559-571. [14] D.Taubman, A.Zakhor, Multirate 3-D subband coding of video, IEEE Trans. Image Proe., vol.3, 1994, pp. 572588. [15] M.Bleja, M.Domatiski, Image data compression using subband coding, Annales des Telecommunications, vol. 45, 1990, pp. 477-486.
This work has been supported by KBN grant no. 8S504 002 06. Roger Swierczynski is a fellow of The Foundation for Polish Science.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
281
Adaptive Wavelet Packet Image Coding with Zerotree Structure Tsuyoshi Otake, Kouichi Fukuda and Akira Kawanaka Faculty of Science and Technology, Sophia University, Japan Abstract In this paper, we propose an image compression method associating a wavelet packet and a zerotree coding scheme. The wavelet packet is employed to reduce the correlation of the image pixels. To encode the coefficients, we apply the zerotree structure which can reduce the correlation across the decomposed component images. In this case, the decomposition criterion for each decomposition level is computed by simplified local zerotree structure. If a target bit rate or distortion is given, we can obtain the most suitable decompositions at the less computational operation. In the simulation, it results that the proposed method has a better coding efficiency than the JPEG case about 3 dB and the octave decomposition case about 1 dB in PSNR at 0.4 bpp with the standard image "Barbara."
I
Introduction
Image compression is essential for applications such as transmission and storage in data bases. The wavelet transform which make possible a multiresolution analysis is employed in order to suppress the blocking effects which appears in the image coding scheme using the block transform such as DCT. Ordinary, the decomposition has been applied to only the lower frequency component image recursively. However, several researches suggest that the decomposition of the high frequency component is effective with respect to the images containing many sharp edges. On the other hand, it causes the loss of the coding efficiency for some images. Therefore, the adaptive decomposition of the high frequency bands is required. The wavelet packet [I][2] is one of the adaptive methods, which is introduced as a generalized wavelet decomposition. In this paper, we show that the suitable decompositions are determined by the coding gain simply calculated by means of the variances of the decomposed frequency components and the coefficients of the wavelet basis. To encode the adaptive wavelet packet coefficients, we apply the zerotree structure [3] which can reduce the correlation across decomposed frequency components by exploiting the self-similarity in image. In this case, when lower bit rate is desired, it is not necessary to decompose the component image composed of smaller coefficients than smallest threshold for zerotree structure. We accordingly derive the decomposition criterion corresponding to the desired coding rate or distortion. In this method, the total data amount and the reconstructed distortion with regard to each component image are computed by simplified local zerotree structure. If a target bit rate or a target distortion is given, we can obtain the suitable decompositions which make possible the less computational operation on the reconstruction process.
II AdaptiveWavelet Packet In this section, we consider the frequency decomposition which gives the higher coding efficiency. For example short time Fourier transform (STFT) gives a unchanged tiling at all space and frequency domain. The wavelet octave decomposition provides a fine signal representation localized in both the space and frequency domains, however, it cause a loss of coding efficiency with regard to images containing many sharp edges. It can be suppressed by the wavelet packet algorithm introduced in [ 1][2] as a generalized wavelet decomposition. The wavelet packet decomposes an image adaptively. An example of two dimensional frequency decomposition is shown in Fig. I. To obtain the coding efficiency of the frequency decomposition, we introduce the coding gain function. The coding gain G is defined as G = 101Og~o
/ /" 1
Hx- AkBk k=o
= ~x-)__~G k k--o
O~k
G~ = - l O a k log,o
AkBk o~k
where Ak is the ratio of the variances of the decomposed component images to that of the original image, B, is the ratio
282 of the variances of the quantization error for each component image to that of the reconstructed error, o~ is the ratio of the pixel numbers in each component images to that in the original image and K is the number of the decomposed component images. The decomposition criterion on each decomposition level is determined whether the higher coding gain is obtained or not. This coding gain criterion can be applied to the biorthogonal filter which has a linear phase property.
wl
wl
m
w0 STFI'
Wavelet
Wavelet Packet
S'ITT
70 Wavelet
;o Wavelet Packet
Fig. 1: 2-D frequency decomposition.
III Zerotree Coding of Wavelet Packet Coefficients In this section, we consider the entropy coding of the wavelet packet coefficients. The wavelet packet coefficients decomposed at arbitrarily fine space and frequency resolutions have the less correlation between neighboring pixels, however, the correlation across the component images is remained. To remove this redundancy, we apply the zerotree structure [3] to the entropy coding. Every wavelet coefficient at a given scale can be related to a set of coefficients at the next finer scale of similar orientation. The coefficients at the coarse scale is called parent and all coefficients corresponding to the same spatial location at the next finer scale of similar orientation are called children. The relation with parent and children is shown in Fig. 2 (a). All parents have four children except for the lowest frequency band, where each parent has three children. In the adaptive wavelet packet case, this parent-children relationship is shown in Fig. 2 (b). Given a threshold T, a coefficient x is called an element of a zerotree root if itself and all of its descendants are smaller than T, which case is called insignificant. The significance map can be represented as a string of symbols from a 4-symbol. The four symbols used are zerotree root (ZTR), isolated zero (IZ) which means that the coefficient is insignificant but has some significant descendants, positive significant (POS) and negative significant (NEG). Each time a coefficient is encoded as significant, its magnitude is appended to the subordinate list. This quantization using threshold T is called successive approximation quantization (SAQ). The SAQ sequentially applies a sequence of thresholds To,...,TN_ ~ to determine significance, where the thresholds are chosen so that T~ = T~_,/2. The initial threshold TO is so that I xjl < 2To for all coefficients xj.
Fig.2: Parent-child dependencies.
IV Adaptive Decomposition with Local Zerotree Structure To obtain the new decomposition criterion adequate to the zerotree entropy coding, we introduce the simplified local zerotree structure. When the coefficients which have a same parent are all insignificant, it is not necessary to make their symbols by assuming that their ancestor becomes ZTR symbol. The local zerotree symbols are as follows.
283
POS
if Ixl->
x_> 0
NEG
if Ixl->
x<0
ZERO
if
Ixl <
in the coefficients which have the same parent, there is one significant coefficient at least.
where x is the value of the input coefficient and T, is a given threshold. To make these local zerotree symbols, the SAQ is applied. The flow chart for encoding a coefficient of local zerotree is shown in Fig. 3. Here the data amount of the symbols R is simply calculated. N-I
R = ~_~ {Npos (T)log 2 Peos(T)+ Nue~ (T)log
2 PoEt,(T/)+
NzeRo(T~)log 2 PzeRo(T~)
i=o
+ Additional Data of Nonzero Value} where N(T, ) and P(T, ) denote the number of each symbol, which is indicated by its subscript, and the probability. The first three terms in the sum represent the data amount of the significance map and the fourth term represents the data amount of the nonzero values. Then, the cost of the symbols can be expressed as Cost = log D + AR where D is the distortion of the reconstructed image with regard to the threshold T, and 2 is the proportional multiplier. The decomposition criterion is determined whether the lower cost is obtained or not by the decomposition. Input Coefficient
NO
POS
NEG ZERO
No Symbol
Fig. 3: Flow chart for local zerotree coding.
V
Simulation Results
The proposed coding scheme was applied to the standard black and white 8 bpp test image "Barbara" (512x 512 pixels). Two dimensional separable length-9 quadrature mirror filters (QMF) [4] were used. The performance of the proposed method was compared with the wavelet octave method and the JPEG. First, we decide the depth of the octave decomposition adapted for the characteristics of the image. The depth criterion for the desired coding rate or distortion is considered at each depth of the octave decomposition. The relation between the rate and the distortion at each depth is shown in Fig. 4. It is clear that the depth-4 octave decomposition is enough to encode the image "Barbara." Second, higher frequency components are decomposed adaptively and we can obtain zerotree symbols. Finally, all strings of symbols are entropy coded using adaptive arithmetic coder. Fig. 5 shows the frequency tilings of the proposed method. It can be shown that the high frequency bands are decomposed adaptively and the decomposition characteristics change with a target bit rate or a distortion. In these simulations, the depth of the zerotree structure is fixed 6 and a parent pixel and the children pixels are related as well as the depth-6 octave decomposition is used. The coding efficiency was evaluated in terms of PSNR versus the average bpp, which are shown in Fig. 6. As can be seen, the proposed methods perform better than the JPEG case about 3 dB in PSNR. Also, they have a better coding efficiency than wavelet octave case about 1 dB. Fig. 7 shows the parts of the reconstructed images. "Barbara" was encoded using the proposed method at 0.21 bpp with 27.65 dB in the PSNR. The wavelet octave method was then applied to "Barbara" at 0.21 bpp, and the resulting PSNR was 26.30 dB. The reconstructed image of the proposed method has a better PSNR on a lower bit rate than that of the wavelet octave method. The high frequency stripes of "Barbara" are improved in the local zerotree case. Obviously, there are noticeable blocking artifacts in the JPEG version.
284
Fig. 4: Rate-Distortion curves for each depth of the octave decomposition.
Fig. 6: Simulation results for various image coding schemes.
Fig. 5: Frequency tiling of "Barbara."
Fig. 7: Parts of the reconstructed images.
VI Conclusion The image coding scheme associating the adaptive wavelet packet with the zerotree structure was described. The use of the wavelet packet makes most of the energy localize and that of the zerotree coding reduces the correlation among the decomposed component images which appears as the self-similarity in image. If a target bit rate or a target distortion is given, the proposed method has a better coding efficiency at the less computational operation on the reconstruction process.
Reference [1] R. Coifman, Y. Meyer, S. Quake and V. Wickerhauser, "Signal processing and compression with wave packets," Numerical Algorithms Research Group, New Haven, 1990. [2] K. Ramchandran and M. Vetterli, "Best Wavelet Packet Bases in a Rate--Distortion Sense," IEEE Trans. Image Processing, vol. 2, No. 2, pp. 160-175, April 1993. [3] Jerome M. Shapiro, "Embedded Image Coding Using Zerotrees of Wavelet Coefficients," IEEE Trans. Signal Processing, vol. 41, No. 12, pp.3445-3462, Dec. 1993. [4] E. H. Adelson and E. Simoncelli, "Orthogonal pyramid transforms for image coding," SP1E Conf. Visual Commun., Image Process., vol. 845, pp. 50-58, 1987.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K.
B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
285
EFFICIENCY OF THE IMAGE MORPHOLOGICAL PYRAMID DECOMPOSITION Dragana Sandic, Institute for Telecommunications and Electronics IRITEL, Beograd, Yugoslavia Dragorad Milovanovic, Branimir Reljin, Faculty of Electrical Engineering, Beograd, Yugoslavia Abstract- In order to measure the objective efficiency of the image decomposition by morphological pyramid, coding and entropy gains are computed Coding gain is the measure of energy compaction of image in frequency domain, which is basic condition for efficient compression. Also, a decomposition will be considered to be more effective if the entropy gain as a measure of entropy reduction is higher while maintaining the same fidelity measure in the compressed data. To make comparison with linear filters, coding and entropy gains for linear Wavelet filters are also shown. Simulation results show that energy compaction measured by coding gain is some lower using morphological filters while bit rate reduction measured by entropy gain is higher. The choice of optimal decomposition method depends also on optimal coding of subbands and subjective image quality.
1. INTRODUCTION The main goal of an image compression is to reduce number of bits for image storage and transmission, preserving subjective image quality. Image data compression methods take advantage of the intrinsic features of images as well as of their relation to the final human observer to eliminate redundancy. Digital compression algorithms include discrete image transformation, quantization and entropy coding. The key for efficient data compression is image decomposition. Multiresolution image decomposition scheme typically apply linear filters to generate a sequence of subimages with progressively decreasing resolution. Then the subbands images can be ranked and processed independently. When linear filters are not efficient enough the alternative are nonlinear filters. Among the nonlinear filter classes, the class of morphological filters becomes increasingly popular. The complexity of the morphology filter design and implementation is low. This paper describes a subband image decomposition method for monochrome images using morphological filters. In the first part, mathematical morphology operations are reviewed. Pyramidal image decomposition method is described in the second part. Third part contains details about decomposition efficiency measures. Simulation results are presented in the fourth part and the last part is summary. 2. MORPHOLOGICAL FILTERS The foundation of morphology is set theory. Mathematical morphology represents signals as sets and a morphological operation consists of a set transformation which transforms a set into another set [1]. Mathematical morphology examines the geometrical structure of an image by probing its microstructure with certain elementary forms so called structuring elements (SE). Both binary and gray-level images can be processed effectively by morphological operations. Let the image X(x) be represented as a function of coordinates x and B is a structuring element. The analytical definitions of the four basic operations dilation, erosion, opening and closing [2] are as follows, respectively:
X~B=max[X(x-b)+B(b)] beB
(1)
XOB=min[X(x+b)-B(b)] beB
(2)
X o B = (xOB) (9 B
(3)
X ,, B= (X ~ B)@B
(4)
3. MORPHOLOGICAL PYRAMID IMAGE DECOMPOSITION Pyramidal image structure has been commonly used for image coding, computer vision applications and progressive image transmission. The pyramid approach is attractive due to low computational complexity and simple parallel implementation. The advantages of applied morphological filters are their ability to preserve geometric structure, their direct geometric interpretation, simplicity and efficiency in hardware implementation [ 1,2,3]. An efficient pyramidal image decomposition method by morphological filters is analyzed. The original image is decomposed into four subband images. Each subband is downsampled which results the same number of pixels in the original image as the sum of pixels in the subbands. In a tree-structured decomposition each low-pass subband is decomposed further, whereas the high-pass subband is not processed anymore. If the sequence of images is viewed as a multilevel data structure with the original image at the bottom level and the lowest-resolution image at the top level, the structure resembles a pyramid [4,5].
286 The pyramidal image structure provides a means for processing images at multiple resolutions. The pyramid representation of an image consists of a sequence of decreasing spatial resolution images derived from the original image [3]. Subband decomposition is performed by morphological filter bank [4,5]. Analysis and synthesis filter bank consists of one-dimensional filters respectively: low-pass H0(X): closing[opening (X)], F0(Y): dilation (Y), high-pass H1 (X) : X - closing[opening (X)] (5), F1(Y) : Y-dilation (Y) (6) Filter banks for 2-D analysis and synthesis in a four subband splitting and reconstruction are designed by a separable product of the above 1-D horizontal and vertical morphological filters (Fig.l). 2-D morphological analysis and synthesis filters are respectively: v
h
H00" Ho[Ho(X)],
Hol
H~[Hh(X)]
Hlo" H~'[Hoh(X)], Hl1" H[[Hh(X)]
(7)
Foo" F~[Fh(y)],
Fol'F~[FI~(Y)]
FlO" F~'[Foh(Y)],
Fll'F~'[FI~(Y)]
(8)
where HIa, F{a and H v , F v for i=0,1 are 1-D horizontal and vertical tow and high pass filters, respectively. 4. DECOMPOSITION EFFICIENCY In order to compare efficiency of morphological and linear filtering, quantitative measures, coding gain and entropy gain are analyzed [6,7]. The decomposition will be considered to be more effective if the coding and entropy gain are higher while maintaining the same compression ratio and fidelity measure (MSE) in the compressed data, respectively. 4.1 Coding gain Coding gain (CG) is a factor for which the MSE reconstruction error has been decreased applying separate coding on subband images over fullband coding, concerning the equal bit rates. For M subbands, fixed total number of bits and optimal bit allocation, assuming equal probability density function of subbands and original image CG is: M E cr 2 k m n CG= k = l , 2 1 Z 5", ( X / j - X u ) 2 (9) M cy 2 cY'~=mni=lj= 1
H (~' k)~
k=l
rk
where rk is the ratio of the number of samples in k-th subband and the number of full band image samples, ~ ~ k-th subband variance, xij mean of all pixels in k-th subband, m n total number of pixels in that subband. The maximal value of CG is obtained when the energy distribution in subbands differs the most. Good filter has high coding gain at each stage [6]. The compression efficiency increases with higher decomposition stage, but the growth of CG decreases with stage order, and after a certain stage it is not worth decomposing anymore. 4.2 Entropy gain Entropy gain is a measure of bit rate reduction after image decomposition and compression. The main theoretical result is that the zero-order entropy of subband images is less than zero-order entropy of the fullband image. Therefore, a decomposition will be considered to be more effective for a given image if the zero-order entropy is lower while maintaining the same MSE measure in coded image [7]. Entropy gain EG is the difference between the entropy of fullband image and mean entropy of subband images: EG = Horig_ Hj,
M
n___i.i (10) Hj = ~ H k , Hk = ~(-Pi)ld(Pi), Pi = N k=l i where Horig is fullband image entropy; Hj is mean entropy for j subbands; Hk is entropy of j-th subband, Pi probability density function of gray scale level i, ni number of pixels with value i, N total number of pixels.
5. SIMULATION RESULTS To make comparison, test image is processed with morphological and with linear filters. The monochrome test image "Lena" (256x256 pixels with 256 levels) is processed. Simulation is performed in Matlab for Windows. At first decomposition stage an image is decomposed by morphological filters in four subbands according to equations (7) and system diagram at Fig.2. The decomposed subimages using structuring element of size 3 are shown in Fig.3L. The lowest subband analysis image Xo0(m,n) contains the most information and gives the general brightness of the picture. The horizontal and vertical subband analysis images X10(m,n) and X01(m,n) have successfully extracted vertical and horizontal sharp edges, respectively, and the diagonal subband analysis image Xll(m,n) contains the diagonal edges of the image.
287
Fig. i. Separable four-subband decomposition.
Fig.2 System diagram of four-band splitting using morphological analysis filter bank.
Fig.3 Scaled decomposed subimages by (1) morphology and (2) Wavelet DWT9-7 filters: (a) Xoo(m,n), (b) Xlo(m,n), (c) X01(m,n), (d) Xll(m,n) 9
Fig.4 (a) Coding gain CG and (b) entropy gain EG in five decomposition with M=4,7,10, 13, 16 subbands and three sizes of structuring element SE=3,5,7.
288
Fig.5.(a) Coding gain CG and
(b) entropy gain EG obtained with linear filters DWT 9-3, DWT 9-7, and DWT 5-7 for M=4,7,16 subbands.
Among the actual linear filters, Wavelet filters are analised. Discrete Wavelet Transform is calculated using hierarchical octave filter bank (Fig.l)with regular 1-dimensional FIR (Finite Impulse Response) filters [10]. In our simulation biorthogonal filters are used (9 and 5 taps in analysis stage; 3 and 7 taps in synthesis stage) [7]. The decomposed subimages by Wavelet filters DWT9-7 are shown on Fig.3D. For five morphological pyramidal decomposition stages of test image and three sizes of structuring elements (SE) quantitative measures, coding gain CG and entropy gain EG are computed according to eq. (9,10) and shown on Fig.4. Coding and entropy gain for linear Wavelet filters DWT 9-3, DWT 9-7, and DWT 5-7 are shown on Fig.5. 6. CONCLUSIONS The computer simulation shows that decomposed subband images have extracted vertical, horizontal and diagonal edges more successfully using morphological than by linear Wavelet filters, but energy compaction measured by coding gain is some lower using morphological filters. However, entropy gain EG is some higher using morphological filters which indicate possibility for higher bit rate reduction. The highest values for CG and EG are obtained with structuring element of size SE=3. The decomposition efficiency is higher with the increase of decomposition stage. However, the number of decomposition levels is bounded while an image has finite dimension. Increasing the number of subbands over sixteen doesn't cause further significant increase in decomposition efficiency. The simulation results give comparison of the actual morphological and linear filters. Relatively high coding and entropy gains are necessary, but not sufficient, for good image coding performance. The choice of decomposition/compression method depends also on optimal coding of subbands and subjective image quality, which is subject for further research. REFERENCES
[1] P.Maragos, R.Schafer, "Morphological filters - Part I, II", IEEE Trans. on ASSP, Vol.35, No.8, August 1987, pp.1153-1184. [2] R.M.Haralick, S.R.Sternberg, X.Zhuang, "Image analysis using mathematical morphology", IEEE Trans. PAMI, Vol.9, No.4, 1987. [3] A.Toet, "A morphological pyramidal image decomposition", Pattern recognition letters, No.9, 1989, pp.255-261. [4] S.C.Pei, F.C.Chen, "Subband decomposition of monochrome and color images by mathematical morphology", Optical Engineering, Vol.30, No.7, July 1991, pp.921-933. [5] Z. Fazekas, K Fazekas, "Morphological filters for image processing," Journal on Communications, Voi.45, MayJune, 1994. [6] O.Egger, W.Li, M.Kunt, "High Compression Image Coding Using an Adaptive Morphological Subband Decomposition", Proceedings of the IEEE, Vol.83, No.2, february 1995, pp.272-287. [7] D.Milovanovic, Z.Bojkovic, A.Samcovic, "On objective performance measures in image compression by subband coding", Proceedings of IEEE Workshop NSIP 1995, pp.202-205. [8] L. Overturf, M.Comer, E.Delp, "Color Image Coding Using Morphological Pyramid Decomposition," IEEE Trans. on Image Processing, Vol.40, No.2, 1995. [9] R.J.Chen, B.C.Chieu, "Three-dimensional morphological pyramid and its application to color image sequence coding", Signal Processing, 44, 1995, pp. 163-180. [10] J.P.Andrew, P.O.Ogunbono, F.J.Paolini, "Coding gain and spatial localization properties of discrete wavelet transform filters for image coding", lEE Proc. Vision, Image and Sign. Processing, vol 142, no.3, pp. 133-140, June 1995.
Proceedings IWISP '96,"4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
289
OPTIMAL VECTOR PYRAMIDAL DECOMPOSITIONS FOR THE CODING OF MULTICHANNEL IMAGES * Dimitrios Tzovaras and Michael G. Strintzis Information Processing Laboratory Electrical and Computer Engineering Department Aristotle University of Thessaloniki, Thessaloniki 540 06, Greece Tel. : +3031.996359, Fax: +3031.996398, e-mail: tzovaras~dion.ee.auth.gr Abstract In the present paper we determine two families of analysis and synthesis vector filters which achieve optimal construction of multiresolution vector sequences by minimizing the variance of the error signals between successive pyramid levels. A measure of the entropy reduction achieved by the pyramid is in this way maximized. The effect of this is to ensure that the lower-resolution image produced by the primary subband bears maximum resemblance to the input image. Furthermore, it is assumed that additive transmission noise corrupts the downsampled signal prior to the synthesis stage. It is seen that under noiseless or lossless transmission conditions, the two above families of optimal analysis and synthesis filters coincide. The results are evaluated experimentally for the vector coding of color images.
I
Introduction
Subband analysis/synthesis techniques have been extensively studied for image and video coding applications [1]. According to the subband coding technique the image is decomposed by a filter bank into several sub-images in terms of different frequency bands and these sub-images are coded instead of the original image. Pyramidal image coding has also been studied [2, 3] and optimal construction of the pyramid sequence was sought by minimizing the variance of the error image for each level of the pyramid. In this way, a measure of the entropy reduction achieved by the pyramid is maximized. If the pyramid is to be used for the scalable or progressive coding of the sequence, this construction also ensures the production of a same-size copy of the signal or image which at a lower resolution bears as much resemblance to the original as possible. In a typical scalable coding application this copy may be transmitted via a slower communication channel, while the original is perfectly reconstructed from the entire pyramid. Along with scalar processing, vector processing has attracted particular interest in the signal and image processing community recently [4]. Vector transform coding techniques have recently been used for image coding applications [5] to remove the inter-vector correlation. In the present paper the results of [2, 3, 6] are first generalized and the problem of the optimal design of a vector pyramidal coding scheme is addressed. Furthermore, the results are generalized to the case where transmission noise corrupts the downsampled signal prior to the synthesis stage. In the examined scheme the analysis or the synthesis part is considered fixed (i.e. the analysis or the synthesis filters are fixed) and the statistics of the quantization part involved for transform coefficient coding are considered known. The problem then is to define the optimal synthesis (analysis) vector filters that minimize the distortion due to the quantization of the pyramid vector transform coefficients. Thus specific knowledge about the power spectrum of the original signal and the quantization noise, can be incorporated to design optimally the vector filter bank so as to minimize the quantization distortion.
II
Optimal Vector P y r a m i d a l and Subband D e c o m p o s i t i o n s
A multiresolution data representation consists of a sequence of linear transformations of the data with successively reduced resolution. If the vector sequence x[m] represents the original data, the construction of the multiresolution sequence begins with the computation of the predicted value u[m] of each x[m] as a local weighted average : N
u[m]=
Z
h[i]x[m-i]=h,x,
(1)
i= --N
where the asterisk , denotes convolution. Interpolation is then used to revert to the original image size: vim] =
w[m]Z
h[i]x[m- i]-
w[m]u[m],
(2)
i
*This work was supported in part by the ACTS PANORAMA project 092 and the Greek Secretariat for Research and Technology projects NIKA and IHIS.
290 where the a s t e r i s k , denotes convolution. Interpolation is then used to revert to the original image size: v[m] = were] Z
h [ i ] x [ m - i] = wEm]u[m],
(2)
i
where w[m] = (1+(-1)m) 2
If the pyramid is used in signal or image transmission, the downsampled signal is subject to rl
H(z)
x
Figure 1: The hierarchical multichannel image coding scheme. corruption by noise. If additive noise n[m] is assumed, N j
y[m]--= Z i=--N
The error image is" and the total error variance 9
g[i](v[m-i]+n[m-i])
(3)
.
I
e[m]- x[m]- y[m],
(4)
E = E{eT[m]e[m]}.
(5)
The process is repeated for the reduction in resolution of the sequence x'[m] = u[2m]. An optimal construction of the pyramid vector sequence may be sought by minimizing for each level of the pyramid the variance of the error image (5). In this way, a measure of the entropy reduction achieved by the pyramid is maximized. The vector sequence u[m] is not wide-sense stationary; however the time averages K
1 Rux[p] = lim 2K +-------'~ ~
K
u[m]xT[m -- p]'
1 Rv[p] = lim 2/( +-------~ ~
rn= --K
m=
v[m]vT[m -- p]'
(6)
--K
are seen to exist, under nonrestrictive conditions on x[.]. It can be seen that ~I'vx(z) = 21-H(z)Ox(z) , ~I'v(z) = ~1 (H(z),~x(z)HT(z_l) + H ( _ z ) O x ( _ z ) H T ( _ z _ ~ ) )
(7)
= 21 P ( z ) .
(8)
Likewise, the output y of the interpolating filter is seen as before to possess cross and autocorrelation functions defined as in (6). Their Z-transforms are related by O y x ( z ) = G(z)~I'sx(z)
Cy(z) = G(z)r
-1) .
(9)
From (5), the error variance is found by
27ryE = tr[E{e[m]eT[m]}] = t r [ J Oe(z)z-ldz] ,
(10)
where tr[F] is the trace of the matrix F and ~I'e(z) is the power spectrum of the error e[m]. Clearly r O x v ( z ) - ~I'yx(z)+ O y ( z ) and hence from (9-10) :
27rjE = t r [ j ( O x ( Z ) - 2G(z)~2sx(z))z-ldz]+ t r [ J G(z)Os(z)GT(z-1)z-ldz]
.
-- r
(11)
Thus, the design of the pyramidal decomposition scheme should aim at the minimization of the error variance (1 1).
291 III
Optimal
FIR
and
IIR
vector
filters
With arbitrary given h[i], the optimum FIR filter g[i] in (3) will minimize the error variance (5) if the well known orthogonality condition holds : E{
x[m]-
E
sT[m--l]}=O,
g[i]s[m-i]
i-- - - N
for l = - N , . . . , N. This implies N
Rxs[/]=Eg[i]Rs[/-i],
,I=-N,...,N.
i--0
Given (11) this may be separated into two sets of equations for the identification respectively of the even- and oddindexed coefficient matrices g[i]: Rxs[2ll]
~ g[2il]R~[2ll - 2il]
Rxs[21~+ 1] = ~ g [ 2 i ~ + 1]Rs[2Z~ - 2i~],
~1
i2
which define fully g[i], i = - N , . . . ,N. The optimal IIR filters are found by direct minimization of (11). We shall consider the noiseless case first.
III.1
Noiseless
Case
In this case s = v, and hence O s x ( z ) = O v x ( z ) = ~1H(z)Ox(z),
Os(z) = Ov(z) = 12 p (z) ,
(12)
with P given by (12). The error variance is found to be
2 (z)Ox(z)HT(z-1)Q(z)z-ldz' 27rijE=trjOx(z)-H(z)Ox(z)G(z)+tr/1H
(13)
where Q ( Z ) --" ~I (GT(z_ 1) G ( z ) -~- G T ( - z - )1G ( - z ) )
.
(14)
The optimal pyramidal decomposition will be obtained by minimization of the above expression (13) for the error variance. Assuming first the analysis filter H(z) fixed and given, the optimum corresponding synthesis filter minimizing (13) is given by (15) G(z) =
Ox(z)HT(z-1)P-I(z).
Conversely, if the synthesis filter G(z) is fixed, the optimum analysis filter can be found. This is achieved when H(z) =
q-l(z)GT(z-1) ,
(16)
where Q ( z ) i s given by (14). The globally optimum filter pair ( n ( z ) , G(z)) will be found by either (15) or (16) and the minimization of the resulting expression in (13). The minima found either way are easily seen to be identical.
III.2
Noisy
Case
With considerable insight into the structure of the optimal pyramids and filter banks gained from the consideration of the noiseless case, the noisy case may now be considered. Again, if the analysis filter is fixed, the optimum synthesis filter in the pyramidal configuration will be found by minimizing (11). It can be shown that this will be given by G ( z ) = O sT x ( Z -1 ) O s l ( Z ) .
(17)
This is a completely general expression for the optimal synthesis filter, which can be further analyzed under some additional simplifying assumptions. For example, the additive noise may be assumed to be uncorrelated with the input Ovn(z) = 0 .
(18)
This assumption is reasonable in the instance of transmission noise and is justified for a large class of practical quantizers which includes fine and dithered quantizers [1]. In this case
9 ~x(z) = ~vx(z) = 1 H(z)Ox(z)
Os(z) = Ov(z) + Or(z) = ~1P(z) + Or(z)
(19)
where Or(z) is the noise power spectral density. Note also that : Or(z) = O r ( - z ) . From (17), the optimal G ( z ) i s G(z) --
OT(z-1)HT(z-1)[P(z)+2Or(z)] -1
(20) (21)
292
IV
E x p e r i m e n t a l Results
The proposed vector pyramidal coding method was tested for the coding of multichannel images. The results are evaluated in the coding of the color RGB image "Peppers" of size 256 x 256. In all the cases examined, uniform quantization of the transform coefficients was applied. For the definition of the two families of analysis / synthesis filters the conversion from the standard Red Green Blue (RGB) format to the YUV format and reverse, given by Xrgb
"--
Axyuv
,
Xyuv
--
AXrgb
,
was used, where A and B are the conversion matrices. For the first family of filters, a good choice for the synthesis vector filter is G(z) = A h ( z ) , where A(z) = diag[Al(z),A2(z),Aa(z)] and Ai(z), for i = 1 . . . . ,3 are FIR low-pass filters. In this case the analysis filter would be given by H(z) = (A T ( z - 1 ) A TAA(z) + A T ( - z - ~ ) A T A A ( - z ) ) -~ A T (z-1)A T In Figure 2 the proposed technique is compared in terms of PSNR versus bitrate, with the scalar pyramidal coding with the same type of filters. As seen, vector coding improves considerably the reconstruction quality, compared to scalar coding, for the same bit rate. The conversion between the RGB to YUV format may also be used for the definition of the second family of analysis/synthesis filters. In this case a good choice for the analysis filters would be H(z) = BA(z)(I)x(z) , while the synthesis filters are given by (15) and the assumption was made that the input signal is modeled by a multichannel AR model. Simulations were performed also using the second family of vector filters and the results were comparable with the ones obtained with the first choice of filters. I
35
I
I
I
I
I
m
P S 33 n N R [dB] 31 -
29 0.2
VP SP
*
I
I
I
I
I
I
0.4
0.6
0.8
1
1.2
1.4
-
1.6
Bit R a t e (bits/pixel) Figure 2: Bit rate versus PSNR performance of the vector pyramidal coding technique (VS), compared to the scalar pyramidal coding (SS) of each vector component.
V
Conclusions
In the present paper two families of analysis and synthesis vector filters were determined which achieve optimal construction of multiresolution vector sequences by minimizing the variance of the error signals between successive pyramid levels. Experimental results were given demonstrating the superiority of vector pyramidal coding when compared to scalar pyramidal coding.
References [1] P. P. Vaidyanathan, Multirate Systems and Filter Banks, Prentice-Hall, 1993. [2] M. G. Strintzis, "Optimal Filters for the Generation of Multiresolution Sequences," Signal Processing, vol. 39, No. 2, pp. 55-68, June 1994. [3] M. G. Strintzis, "Optimal Biorthogonal Wavelet Bases for Signal Representation", IEEE Trans. Signal Processing, vol. 44, No. 6, pp. 1406-1418, June 1996. [4] W. Li and Y.-Q. Zhang, "A Study of Vector Transform Coding of Subband-Decomposed Images," IEEE Trans. on Circuits and Systems for Video Technology, vol. 4, pp. 383-391, Aug. 1994. [5] J. Wus and W. Li, "Vector Subband Coding of High Resolution Images," in Picture Coding Symposium (PCS'&4), (Sacramento), pp. 123-125, Sep. 1994. [6] S. N. Efstratiadis, D. Tzovaras and M. G. Strintzis, "Hierarchical Partition Priority Wavelet hnage Compression". IEEE Trans. Image Processing, vol. 5, No. 7, pp. 1111-1124, July 1996. [7] X.-G. Xia and B. W. Suter, "Vector-Valued Wavelets and Vector Filter Banks", IEEE Trans. on Signal Processing, Vol. 44, No. 3, Mar. 1996.
Session J: SEGMENTATION
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
295
MULTILINGUAL CHARACTER SEGMENTATION USING MATCHING RATE
Kyung-Ae Mooff, Su-yotmg Chit, Jong-Won ParP and Weon-Getm Oht tSystems Engineering Research Institute Taejon, KOREA *Department of Information Communications Engineering ChtmgNam National University, Taejon, KOREA
ABSTRACT Character segmentation which has an effect on the performance of optical character recognition system(OCR) is very difficult especially when one character splits into two or three components and several characters touch each other. Some methods are proposed to solve these problems. But, heuristically driven pitch information of characters is not competent enough for solving splitting and touching problems occurred in documents with various sizes and styles of characters. And multistage graph search algorithm using dynamic programming can improve the segmentation results, but this method needs combinatoricaUy increasing computing time. This paper describes a character segmentation method using the matching rate between an input character and two finally selected candidate characters in the documents which are consist of alphanumeric, symbols, Korean and Chinese characters. The method can determine the exact cutting and merging point one by one character, consequently it needs small computing time. The experimental results have proven that the proposed method is efficient and accurate enough to enhance the performance of document recognition system.
1.
INTRODUCTION
Extraction of characters from text lines is the final step in document segmentation process and is one of the keys to putting OCR technology to practical use. But existence of touching and separated characters, frequently occur in oriental compositions, makes it more difficult to design an effective segmentation algorithm because of its ambiguity in determining suitable cutting or merging position. Solving this character segmentation problem so many researches are carded out in the universities and industriest~H~. Akiyamat81 and Hungt91 use the character pitch and size, rectangular areas of pixels, information to merge seperated ones. This information are very effective for Chinese characters, almost shape rectangle. But, they cannot move forward to an extent of multilingual, alphanumeric, Korean and Chinese, compositions because of their diversity in shapes. Almost every alphabet is composed of a clod of letter but with various pitch. Korean characters have so many letters very similar with alphabet in their pitches. So heuristically driven pitch information of characters is not working for solving split and touch problems occur in documents with two languages. Afiyoshit~~ and Fujisawat'l~ introduce multistage graph search algorithm using dynamic programming, multiple hypothesis and verification method, to solve these problems. It is remarkable that they use character pattern information for determining the cutting points and character recognition and can improve the segmentation results, but this method needs combinatorically increasing computing time. In this paper, we present a new approach to character segmentation based on the concept of matching rate. Matching rate implies similarity between two character features, an input and a reference features, and is
296
calculated simply from mesh patterns, as a matter of course correctly segmented character has a very high matching rate and vice versa. We apply two kinds of matching rate to our OCR system, one is for reducing candidates and the other is for finding suitable border line between two character patterns. The proposed approach has been tested on a large number of documents and results in efficient and accurate enough to enhance the performance of document recognition system. The paper is organized as follows. In Section 2, we explain the matching rate concept more in detail and in Section 3 the character segmentation algorithm is described. Finally, experimental results, evaluations and conclusions are discussed in Section 4 and 5, respectively.
2.
M A T C H I N G RATE
In this section, mesh information of character is used to calculate the matching rate of characters. Mesh information is very simple but it is widely used as a feature even for script recognition because of the flexibility in its size and data form. After an extracted character image is normalized by 48x48 image size, the mesh feature vector, which is 16x16 size, is generated from this normalized image. Then, we get two matching rates: M r is the similarity between input and reference characters and M x is the similarity between an input character and exclusive OR(O) data acquired from two finally selected candidate characters. M r = d(Vi, Vr) M x = d(Vi, (Vcand1 9 Vcand2) x W) V i 9a feature vector of input pattern Vr 9the reference feature vectors of final candidate patterns W" the weight heuristically assigned based on mesh condition M r is used for reducing or rejecting candidates, and M x is used for finding exact cutting and merging point. As M x is acquired only from the difference between two candidates, we can overcome the overlapping of the characters.
3.
CHARACTER SEGMENTATION
Character segmentation is a critical part because incorrectly segmented characters are not likely to be correctly recognized. Kahn et al. p] suggested that a document recognition system is required to read texts accurately with at least a 99.9% recognition rate in practical application. However, it is not simple to recognize ill-formed and ill-spaced printed characters. Especially, separated and touching characters segmentation in the multilingual documents including Korean, Chinese and alphanumeric characters is very difficult problem. The proposed method follows steps for an efficient character segmentation.
STEP1.
The First Simple Character Segmentation
First, text blocks are extracted from a document image through a document analysis process, and text lines are extracted from the text block image. After that, individual characters, including separated and touching characters, are extracted from the text lines using vertical projection. Fig.l(b) shows individual characters containing separated characters, such as. ' L ' , ' ] ' and touching character ' ~ FT' as the result after this step. In order to decide if individual characters are separated and touching character, the character rectangle's pitch information such as character height, width, gap and interval, etc. is calculated.
STEP2.
Merging Separated Characters
Some Korean and Chinese characters are composed of one or more character rectangles. In Korean character, 8% characters of frequently used 990 ones are composed of two character rectangles. In Chinese character, 13% characters of 5401 ones are composed of two or more character rectangles. Because of this features of Korean and Chinese character, additional merging process is required. However, in the multilingual document processing, it is ambiguous to distinguish one or more alphanumeric characters from one Korean or Chinese character, such as two alphanumeric character 'or' and one Korean character ,o]-, and
297
three alphanumeric characters '111' and one Chinese character ')11' in Fig.2, only using the character rectangle's pitch information. In this paper, we calculated the matching rate for separated characters and then decide whether two or more separated characters are merged according to the matching rate.
Fig. 1. Example of character segmentation. (a) original image. (b) simple character segmentation using vertical projection. (c) results after merging separated characters 'l--'and ' ] '. (d) results after segmentation touching character' ~ FT'. (e) candidate splitting points(p0,p 1) for touching character' ~" FT'.
Fig. 2 Example of ambiguous characters to be merged. (a) incorrectly merged character 'or' (b) numeric '111' (c) Chinese character 911'
STEP3. Segmentation of Touching Characters In the general document images, the degraded image generates the touching characters because of the printer quality and scanner resolution. For splitting this touching character, several candidate splitting points of the touching character are determined from vertical direction histogram as in Fig. l(e) and appropriate splitting points of them are selected according to the matching rate of character rectangles to be split by each candidate splitting points. 4.
EXPERIMENTAL RESULTS
We tested the proposed character segmentation method on more than one hundred documents including various fonts and sizes. Hardware environment for experimentation is PC-486 and Microtek image scanner with 300 DPI resolution. Table 1 shows the character segmentation rate in each step. In STEP1 only using character's pitch information without a recognition results, character segmentation error is about 10% of each document. In STEP2 and STEP3, character segmentation error is reduced to less than 1% excepting an ambiguous case to solve without contextual knowledge. Hence, the final character segmentation rate is more than 99% on our sample documents.
Text A
Text B
Text C
Result after STEP1
88%
90%
89%
Result after STEP2
99.8%
98%
96%
Result after STEP3
100%
99.6%
99.4%
Table 1. Character segmentation rate in each step.
298 This experiment results proved that this method is efficient for printed Korean, Chinese and alphanumeric characters with touching and separated characters. 5.
CONCLUSIONS
We have described a method of character segmentation, which uses two matching rates. Matching rates are simply calculated between input and reference characters and between an input character and exclusive OR data acquired from two finally selected candidate characters. After extracting the rectangle of pixels from a text line using vertical projection, merging or splitting criterion of incorrectly segmented characters is decided based on these matching rates. Consequently, this proposed method needs a little computing time and is free from the overlapping of the characters because one of the matching rates is acquired only from the difference between two candidate character patterns.
REFERENCE
[1] S. Liang, M. Shridhar and M. Ahmadi, "Segmentation of Touching Characters in Printed Document Recognition," Pattern Recognition, Vol. 27, No. 6,pp. 825-840 ,June 1994. [2] Y. Lu. "On the Segmentation of Touching Characters," Proc. 2nd International Conference on Document Analysis and Recognition, Tsukuba, Japan, pp. 440-443, Oct. 1993. [3] S. Kahan, T. Pavlidis and H. S. Baird, "On the Recognition of Printed Characters of any Font and Size," IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. PAMI-9, No. 2, pp. 274-288, March 1987. [4] T. Bayer and U. Kresel, "Cut Classification for Segmentation," Proc. 2nd International Conference on Document Analysis and Recognition, Tsukuba, Japan, pp. 565-568, Oct. 1993. [5] S.W.Lam and S.N. Srihari, "Multi-Domain Document Layout Understanding," 1st International Conference on Document Analysis and Recognition, Saint Malo, France, pp.112-120, 1991. [6] K.Y. Wong, R.G. Casey and F.M. Wahl, "Document Analysis System," IBM J. Res. Develop. 26, No.6, pp.647-656, 1982. [7] D. Wang and S.N. Srihari, "Classification of newspaper image blocks using texture analysis," Computer Vision, Graphics, and Image Processing, Vol. 47, pp. 327-352,1989. [8] T. Akiyama, S. Saito and I. Masuda,"A Method of Character Extraction from Printed Documents Guided by Positions of Non-Overlapping Characters," Institute of Electronics, Information and Communications Engineers, Vol. J61-D, No. 10, pp. 1194-1201, 1984. [9] Yea-Shuan Hung and Wen-Wen Lin, "Field Segmentation and Character Isolation Method In Free-Format Chinese Printed Document," International Conference on Computer processing of Chinese and Oriental Language, pp. 151-155, Aug. 1988. [10] Shunji Ariyoshi, "A Character Segmentation Method for Japanese Printed Documents Coping with Touching Character Problems ," Proc. 1lth ICPR, pp.313-316, 1992. [11] Hiromichi Fujisawa, Yasuaki Nakano and Kiyomichi Kurino, "Segmentation Methods for Character Recognition: from Segmentation to Document Structure Analysis," Proceedings of the IEEE, Vol.80, No.7, pp. 1079-1092, July 1992.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
299
Architecture of an Object-based Tracking System Using Colour Segmentation R a f a e l G a r c i a - C a m p o s , J o a n Batlle, and R a i n e r B i s c h o f f * Computer Vision and Robotics Group Dep. of Electronics, Computer Science and Automation University of Girona, Av. Lluis Santal6, s/n, 17003 Girona (Spain) Tel. +34.72.418400 Fax +34.72.418399 email: {rafa, jbatlle} @ei.udg.es *Institute of Measurement Science Federal Armed Forces University Munich (Germany) email: [email protected]
Abstract This paper presents a robust specialised architecture for real time tracking, using colour segmentation. Most of the existing object-tracking algorithms extract the features which constitute the object (usually called tokens), and track them from one frame to another. We propose a new architecture to solve the tracking problem at object level. The first step is the extraction of the object from the scene. The proposed architecture uses the discriminatory properties of three colour attributes: Hue, Saturation and Intensity, in order to segment the object. The second step consists of the computation of the centroid of the object by using the enhanced image from the segmentation module. As a last step, tracking of this centroid is achieved in a sequence of images. The processor presented here can provide the sequence of centroids at video rate.
Introduction Tracking moving objects over time is a complex problem in computer vision and has been an important research subject over the last few years [1], [2], [3]. Impressive tracking systems have been developed for some specific applications [4], [5]. We assume, initially, that we just want to track one object in the scene, but there may be other moving objects present in the scene. Some of the methods presented in the literature have a serious matching problem in this case, and often mistake one object for another. In such a situation, when the moving objects have different colour properties from the object to be tracked, our system does not suffer from the influence of other objects. Kinematics-based algorithms may not work properly when the objects abruptly change their motion from one frame to the next [6], [7]. Such situations can be due to collisions or sharp turns. On the other hand, these algorithms detect motion in a sequence of flames, by extracting tokens such as edges, comers, interest points, etc. [8], [9]. Methods extracting two dimensional models mistake objects when there is a change in their 2-dimensional shape [ 10], [ 11 ]. This paper presents a new system which can cope with the problems explained above for some special applications, because we are not tracking tokens, but the object (or, more accurately, the centroid of the object). The basic condition imposed by our method is to maintain the colour features of the object in a sequence of images. In the first part of the paper, we describe the general architecture of our tracking processor. Different modules of the system and their interconnections are shown. Then, a detailed description of each module is provided. Finally, some results and experiments are presented.
General Overview The simple architecture proposed takes advantage of the robustness of combining Saturation and Hue properties of the object. Furthermore, the dynamic range of the camera is increased by performing the real time re-scaling for the RGB channels as shown in [12]. First, a colour camera takes an image, and its sampled RGB signal feeds the Colour Conversion module. RGB is a widely known representation of colour which is not suitable for colour vision applications. Our system transforms the RGB signal into a known model: the Hue, Saturation and Intensity (HSI) model as shown in [13]. The Colour Conversion module performs an HSI real time conversion as stated in equations (1), (2) and (3) by means of three Look Up Tables (LUTs): one for Hue, another one for Saturation, and the last one for Intensity.
300
H
Lx/(R- G) 2 + ( R -
B)(G - B)J
S= 1 - min(R, G,B) R+G+B
(2)
I= R+G+B~
(3)
Every component of RGB is sampled into 8 bits. The 5 most significant bits are taken from each component, generating a 15-bit bus. This bus is addressing simultaneously the three 32Kb LUTs, which are programmed through the system bus using (1), (2) and (3) by means of a specific software. As the LUTs are programmed dynamically, it is possible to reconfigure them during execution, loading, for example, a different colour model as shown in [ 14]. The block diagram of the system is shown in Figure 1. ...........................................................................................................................................................
"
I
,
Colour
Camera
FPGA
I
~ / / "
J
Bus Generation
Address
Real Time Tracking
sl
_
LUT
Colour Conversion
[I C~176 I ~e~ment~tinn
PreDetection
Select
'Monitor
Processor
..................................................... iiiiiiiiiiiiiiiiiiiiiiiiiiiiii ........................... iiiiiiiiiiiiiiiiiiiiiiiiiiiii ..............
<
Host Computer
PC microprocessor
Figure 1" Block Diagram of the System Once the HSI conversion has been performed, the image can be thresholded by choosing the preferred minimum and maximum values for Hue, Saturation and Intensity, depending on the colour characteristics of the object and the illumination conditions of the scene. For the same comparator (see Figure 2) we can define multiple intervals, allowing to represent an object by a certain number of colour attributes. This operation will be performed in the Colour Segmentation module, which is inside the FPGA. The upper and lower limits of Hue, Saturation and Intensity are set up by means of a specific software running on the PC, and these values are multiplexed in the FPGA in order to save I/O pins. For every pixel sampled in the A/D converter, we obtain its Hue component. The time for this computation is equal to the access time to the Hue LUT (approx. 50 ns). This value is passed to the FPGA where it will be compared to the minimum and maximum levels allowed for the Hue, and the comparator will generate a signal indicating whether or not this value is in-between the limits. The same idea applies for Saturation and Intensity. All the circuitry in the FPGA has been developed using parameterised VHDL descriptions. In this way, future improvements of the system can be easily performed; such as increasing the number of objects to track. We will def'me the region R as the set of pixels which are part of the object in the segmented image. This image is used to compute the centroid of the object to be tracked. The computation is performed in two independent phases, which can be pipelined: a first stage where the data from the thresholded image is acquired and accumulated (Predetection module), and a second one which actually computes the centroid of the object. The pre-detection consists of accumulating the x-coordinate for all the pixels in the image which are part of the region R. The same idea is applied to the y-coordinate, as well as counting the number of pixels in this region. So xc(R) is given by (4) and (5), that is
301
ZX
xc
(R) = (x,y)~R A(R)
(4)
Zy
yc(R)
(5)
= (x,y)~R
A(R)
where A(R) is the area for the region R (the number of pixels in this region). The robustness in the computation of xc(R) and yc(R) is achieved because (4) and (5) are actually working as noise filters. This computation is carried out by the PC microprocessor because the hardware implementation of a divisor requires too much space in the FPGA. Once the coordinates for the centroid of region R have been found, this centroid can be expressed as (6).
c(R)=(xc(R), yc(R) )
(6)
Figure 2 shows the block diagram of the internal structure of the FPGA.
Pixel Clock HSync
_
~n ""
I Comparator for Hue
i:
for Saturation ] i Comparator "~ for Intensity
i :Select
![ Comparator]
I•
~r
xr X
:
(x,y)
Coordinates Generation
Colour SegmentationModule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Y~
Zy
Accumulator
r
Pre-DetectionModule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 2: Internal Structure of the FPGA So, the processor is organised in a two stage pipeline. The first stage is dedicated to the acquisition of data from the scene. While this data is being acquired and accumulated for
frame k ( Z x, Z Y' A(R)), (x,y)eR
the second stage is
(x,y)~R
processing previous data from frame k-1. In this way, new data may be acquired while the previous data is being processed. This allows the processor to compute c ( R ) at video rate. The time available to compute the centroid of the object in one scene is a whole frame period (20 ms in the CCIR standard) because of the pipeline. It means that the centroid of the frame k will be obtained 20 ms later. So, the tracking is performed with a delay of one frame. This compares to systems using a frame-grabber, which normally have to wait for the acquisition of the frame into memory before starting to process it. This means that our processor has already pre-processed the image while normal frame-grabbers are still storing. Eventually, a sequence of centroids is provided by the microprocessor in real time in order to track the object in a 2dimensional approach, and the marked object is displayed on a monitor.
Experiments and Results Two experiments have been carried out. In the first test we used a video sequence of motorbikes in a circuit where different characteristics of visually similar colours were present. We could keep track of them although the perspective and form of bikes and pilots changed. In the second experiment we tracked a painted mobile robot. This mobile robot was painted in two colours: one for the front part and another for the back. The centroid of each colour was computed, obtaining the direction vector of the vehicle. The only condition imposed in our experiments is the difference in the colour properties between the object to be tracked and the other objects and the background, respectively. Although this condition may look too restrictive, experience has shown that very few objects have the same Hue and the same Saturation in the tested images. Tests with the Intensity property showed that it is too susceptible to lighting conditions, and only in a very few cases it can help in the segmentation process.
302
Conclusions We have briefly presented here a simple but robust architecture, which allows the 2-dimensional tracking of an object even when there are other moving objects in the same scene. The system is robust enough to keep track of both rigid and non-rigid objects even if there are changes in their perspective. As far as we are not using any heuristics we are not affected by brusque changes in the motion and/or occlusions as in kinematics-based systems. Our approach just computes the centroid of the object, while other approaches l~e in the token-based tracking, every token (feature) is characterised by its position and orientation [ 15]. The orientation will not be provided by our system, but, on the other hand, we need less computation to determine the position. Moreover, tracking is achieved without the need of a memory for storing and scanning the image, saving time and costs. Further Work This system can easily be extended due to its scalar architecture. It allows to track more than one moving object in the scene by adding more comparators and accumulators to the FPGA. The two-stage pipeline leaves enough time for the computation of multiple centroids. Accuracy of the colour segmentation module will be improved by increasing the number of bits for the HSI conversion from 5 to 7/8 per channel. A second prototype is being developed were the divisions from equations (4) and (5) are performed in the FPGA, instead of passing the values to be divided over the system bus to the CPU. The use of parameterised VHDL descriptions helps to rapid testing and prototyping. Future research work will extend to multiple camera configurations in order to extract depth information. Our real time tracking processor could then be used in a wide range of applications, such as object grasping by assembly robots, visual feedback for automated navigation, etc.
References [1] Lowe, D.G., "Robust model-based motion tracking through the integration of search and estimation," International Journal of Computer Vision, vol.8:2, pp. 113-122, 1992. [2] Coombs, D., and Brown, C., "Real-time smooth pursuit tracking for a moving binocular robot," Proc. IEEE, pp. 23-28, 1992. [3] Huttenlocher, D.P., Noh, J.J., and Rucklidge, W.J., "Tracking non-rigid objects in complex scenes," Proc. IEEE, pp. 93-101, 1993. [4] Dickmanns, E.D., Graefe, V., "Applications of dynamic monocular machine vision," Machine Vision and Applications, pp. 241-261, vol. 1, 1988. [5] Frau, J., Casas, S., Balcells, LI., "A dedicated pipeline processor for target tracking applications," Proc. IEEE International Conference on Robotics and Automation, pp. 599-604, 1992. [6] Shariat, H., Price, K. E., "Motion Estimation with more than Two Frames," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 417-432, vol. 12:5, 1990. [7] Horn, B. K. P., Schunck, B., "Determining Optical Flow," Artificial Intelligence, pp. 185-203, vol. 17, 1981. [8] Zhuang, X., Huang, T.S., Ahuja, N., Haralick, R.M. 0 "A simplified linear optic flow-motion algorithm," Computer Vision, Graphics and Image Processing, pp. 334-344, vol. 42, 1988. [9] Davis, L.S., Wu, Z., and Sun, H., "Contour-based motion estimation," Computer Vision, Graphics and Image Processing, pp. 313-326, 1982. [ 10] Bergevin, R., and Levine, M.D., "Extraction of line drawing features for object recognition," Pattern Recognition. Vol.25: 3, pp. 319-334, 1992. [ 11] C6dras, C., and Shah, M. "Motion-based recognition: a survey," Image and Vision Computing. Vo1.13:2, pp. 129155, 1995. [ 12] Reginc6s, J., and Batlle, J., "A system to reduce the effect of CCDs saturation," Proc. International Conference on Image Processing, 1996 (To appear). [ 13] Gonzalez, R.C., and Woods, R.E., "Digital Image Processing," Addison-Wesley Publishing Company. 1992. [ 14] Pujas, Ph., and Aldon, M.J., "Robust colour image segmentation," Proc. International Conference on Advanced Robotics, pp. 145-155, 1995. [15] Deriche, R., and Faugeras, O., "Tracking line segments," Proc. European Conference on Computer Vision, pp. 259-268, 1990.
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
303
Segmentation of Retinal Images Guided by the Wavelet Transform Dr. T. Morris, Mrs. Z. Newell Department of Computation, UMIST, P.O. Box 88, Manchester, M60 1QD, U.K. Abstract Glaucoma is one of the major causes of preventable blindness in the world. It induces nerve damage to the optic nerve head (that region of the retina where nerve fibres and blood vessels pass through the eye) via increased pressure in the ocular fluid. It is presently detected either by regular inspection of the retina, measurement of the intra ocular pressure (IOP) or by a loss of vision. It has been observed that nerve damage precedes the latter two events and that direct observation of the nerve head could therefore be the best method of detecting glaucoma, if the observations could be made reliably. This paper describes our work in isolating the optic nerve head in images of the retina: we describe previous attempts that have made using simple image processing techniques and the current multiresolution approaches we are taking, we present a sample of our initial results. Once the nerve head has been located, its shape will be quantified using measurements that have already been shown to be effective.
Introduction. The neuroretinal rim forms the outer boundary of the optic nerve head: that region of the retina where blood vessels and nerve fibres pass out of the eye. It is normally a circular structure, but is known to change shape due to nerve damage in glaucoma. It has been suggested that the nerve damage occurs before the intra ocular pressure (IOP) increases. Since measuring the IOP is the primary screening test for glaucoma, damage to the eye will have occurred by the time the disease is diagnosed. The progress of glaucoma and its treatment is assessed by further measurement of IOP and by changes in the shape of the neuroretinal rim. At present, the shape of the rim is assessed manually, either subjectively by direct inspection or by tracing it from photographs of the optic disk. Both of these methods have been shown to be unreliable [1 ]. In this project we are concerned with automatically locating the neuroretinal rim using a multiresolution algorithm based on the wavelet transform. In doing this we shall eventually provide ophthalmologists with an additional tool for diagnosing and assessing the treatment of glaucoma. A second strand in this work is the investigation of multiresolution segmentation techniques. What may be termed classical image segmentation algorithms take a single resolution viewpoint: the data is captured and segmented at the same, highest possible, resolution. Whilst this is the resolution at which results are required, the approach results in non focused processing since effort is expended in examining regions that are clearly either object or background (and could be identified as such by other techniques) rather than concentrating on the more problematical boundary regions. Pyramid and hierarchical algorithms achieve focussed processing by firstly examining a low resolution version of the data and coarsely segmenting the image. The segmentation is progressively refined by increasing the resolution at which the data is examined until the original resolution is regained. We view the wavelet transform in this light, a wavelet transformed image will contain representations of the image at varying resolutions. This information may be used to initiate a segmentation and progressively refine the boundaries between object and background (cf. [2]). The algorithm is more efficient than the traditional technique as regions that may be clearly labelled are
304 identified early and excluded from further processing which concentrates on the boundary regions.
Background. Automatically identifying the neuroretinal rim is not an easy task. The images are taken under.low light conditions and are therefore noisy. (The data used in the present investigation was collected by attaching a video camera to the eyepiece of a Zeiss fundus camera. The retina was illuminated with a photographic flash and the video image grabbed. Cox and Wood [3] describe the instrumentation in more detail.) The structure itself is indistinct and partially obscured by blood vessels. In fact, if a simple edge detector is applied to these images it is the blood vessel boundaries that give the strongest response, the required features respond very weakly if at all. Figure 1 shows a normal retina, portions of the required boundaries are clearly visible. A significant amount of work has been reported on attempts at automatically detecting the neuroretinal rim, though none has met with significant success. The earliest was by an ophthalmologic equipment manufacturer who simply thresholded an image of the optic disk. The thresholded region was approximated by an ellipse and thus characterised by the ellipse's properties. Cox and Wood [3] presented a semi-automated method: an observer indicated extremal points on the boundary which were automatically connected by tracing along the boundary. They showed how important it was for the same observer to perform all of the measurements since the inter observer variability was similar to the difference between normal and abnormal classes [1]. Morris and Wood [4] initially presented a completely automatic method which traced between points on the boundary identified automatically by their grey level gradient properties. They have latterly returned to semi automated methods which are proving to be more reliable [5]. Lee and Brady [6], and Donnison and Morris [7] have both investigated using active contour (snake) methods to locate the boundary. Both sets of authors have highlighted the importance of pre-processing the images to emphasise the difference between retinal and optic disk regions of the image before searching for the boundary. Donnison and Morris appear to have had more success in their implementation, probably due to their formulation of the active contour. Multiresolution methods have long been seen as an attractive method of segmenting images. From a theoretical viewpoint, they mimic the human visual system, pragmatically they allow us to make early decisions as to the approximate locations of image features and focus attention on just those areas for further processing. A number of approaches have been followed. Image pyramids [e.g. 8] are generated by progressively reducing the size of an image from the base (the original full resolution image) to the top (a single pixel whose value is the average grey value of the image). They seem to have been used most often in edge detection applications. Scale spaces [e.g. 9] may be generated by applying a feature detector at varying resolutions to the original image; a volume is generated: two axes coincide with the spatial axes of the original image, the third represents scale. As the scale is varied the size of the detected feature changes. In a typical application, Marr's edge detector could be used [ 10], the scale parameter would equate to the cy of this operator. The common theme in all of these approaches to segmentation is that information extracted at a lower resolution is used to guide the extraction of information at the next level of the hierarchy. In this project we are using the wavelet transform as a means of generating the hierarchical representation of the image data and thus to guide its segmentation.
Materials and Methods. The wavelet transform generates an image which can be divided into four quadrants. Three represent horizontally and vertically oriented and corner features at some scale.
305 The fourth quadrant repeats this basic structure at a smaller scale. It is our intention to use this structure to guide the interpretation of the original image. Suppressing the coefficients corresponding to small scale image features effectively enhances the gross image features we are seeking. But it does not allow their boundaries to be accurately deliminated. Figure 2 shows a retinal image which has been filtered using the Daubechies wavelet, the fourth stage coefficients have been set to zero and the data inverse transformed. It is apparent that gross features of the image remain and small scale features (the blood vessels) have been removed. We have thus enhanced (portions of) the boundary between the optic nerve head and the retina and may therefore suggest the approximate location of this boundary. More importantly, we may suggest regions of the image that are definitely nerve head and regions that are definitely retina. The suggested boundary may be refined by considering the information that is regained by reinstating coefficients from lower stages of the transform. Ultimately we would use the information contained in the image at its original resolution.
Figure 1. Original Fundus Camera Image.
Figure 2. Wavelet Filtered Image.
Having derived an algorithm for locating the neuroretinal rim, its shape will be characterised using measures known to be clinically relevant. Essentially these are measures of the vertical eccentricity of the structure, since it is known that nerve damage occurs most readily in these areas, which coincide with the blood vessels. We shall perform a retrospective study on data collected in previous work. This data consists of approximately 500 images collected from normal and abnormal classes. These will be analysed blind by one of the authors and the results validated by the other. If the results are satisfactory, we shall then perform a prospective study in conjunction with a local eye hospital. If these trials are successful we shall have developed a tool which could be used in the diagnosis of glaucoma, certainly in assessing its treatment. Conclusions.
In this paper we have discussed the importance of quantifying the shape of the optic nerve head and described previous attempts at doing this. We have outlined the use of the wavelet transform in guiding the analysis of images of the retina, specifically to segment the optic disk from the retinal background. We have also described how we intend to validate the algorithm using clinical data. Finally, we have shown how we have progressed towards our goal of deriving a useful tool in the diagnosis and treatment of glaucoma.
306 References.
M.J. Cox, I.C.J. Wood. "Inter- And Intra-Image Variability In Computer-Assisted Optic Nerve Head Assessment." Ophthal. Physiol. Opt. vol. 11 1991, pp 36-43. .
.
.
.
.
.
.
.
10.
R. Machiraju, A. Gaddipati, R. Yagel. "Wavelet Based Feature Driven Identification And Enhancement Of Medical Images." Technical Report OSU-CISRC-2/96-TR09, Ohio State University. M.J. Cox, I.C.J. Wood. "Computer-Assisted Optic Nerve Head Assessment." Ophthal. Physiol. Opt. vol. 11 1991, pp 27-35. D.T. Morris, M.J. Cox, I.C.J. Wood. "Automated Extraction Of The Optic Nerve Head Rim." American Association of Optometrists Annual Conference, Boston, Dec. 1993. T. Morris, I. Wood. "The Automatic Extraction Of The Optic Nerve Head." American Academy of Optometrists, biennial European meeting in Amsterdam, May 1994. S. Lee, J.M. Brady. "Integrating Stereo And Photometric Stereo To Monitor The Development Of Glaucoma." ' Proceedings of the British Machine Vision Conference, 1990, pp 193-198. C. Donnison, T. Morris. "Identifying The Neuroretinal Rim Boundary Using Dynamic Contours." submitted to Image and Vision Computing. A. Rosenfeld. "Pyramids: Multiresolution Image Analysis." Proc Third Scandinavian Conference on Image Analysis, Copenhagen, July 1983, pp 23-28. Tony Lindeberg. "Scale-Space Theory: A Basic Tool For Analysing Structures At Different Scales." J. of Applied Statistics, 21 (2), 1994 pp 224--270. D. Marr, E. Hildreth. "Theory Of Edge Detection." Proc Roy Soc vol B207, 1980, pp 187-217.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
307
An Adaptive Fuzzy Clustering Algorithm for Image Segmentation Yannis A. Tolias, Ph.D. Student, and Stavros M. Panas, Assoc.
Prof
Dept. of Electrical and Computer Engineering, Telecommunications Division, Aristotle University of Thessaloniki, Thessaloniki, GR-54006 Greece. e-mail: {tolias,panas} @psyche.ee.auth.gr Abstract-In this paper we present a novel adaptive fuzzy clustering scheme for image segmentation. In the pro-
posed method, the non-stationary nature of images is taken into account by modifying the prototype vectors as functions of the sample location in the image and the inherent high inter-pixel correlation is modelled using neighbourhood information. A multi-resolution model is utilised for estimating the spatially varying prototype vectors for different window sizes. The fuzzy segmentations at different resolutions are combined using a data fusion process in order to compute the final fuzzy partition matrix. The results provide segmentations, having lower fuzzy entropy when compared to the Possibilistic C-Means algorithm, while maintaining the image's main characteristics. In addition, due to the neighbourhood model, the effects of noise in the form of single pixel regions are minimised. 1.Introduction Many algorithms have been prcsented in the literature, dealing with the fuzzy segmentation of images. Some of them rely on finding the best threshold, given an image of some special nature by utilising fuzzy entropy and index of fuzziness [1],[2], some of them using the Fuzzy C-Means Clustering Algorithm [3],[4] and others by trying to re-formulate various random field models using the fuzzy approach [5][6]. All clustering schemes except the last one [6] do not adapt to local characteristics of the available data set. Some early results of trying to adapt the clustering schemes to specific data properties, e.g. the high correlation of image pixels belonging to the same neighbourhood have been presented by the authors in their previous work [9]. The basic problem of clustering is to separate a set of objects O = { Ol,O~ .... ok} into C self-similar groups (clusters) according to a similarity criterion and using the available data X = {xl, x~ .... x,,}. A vector [3= [[31, .... 13c] denotes the most valid prototypes, chosen from a prototype family. A real C x n matrix U can be used to represent the result of cluster analysis of X by interpreting [Uik] as the degree to which x k belongs to cluster i. Both FCM (Bezdek,[7]) and PCM (Krishnapuram and Keller [8]) algorithms attempt to find good cluster structure descriptors (U*, [5*) as minimisers of a particular member of the family of objective functions c n C n Jm(U,[A,rl)=-~ ~ u~D(xk,f5i)+ ~rlt ~ ( 1 - uik) m , where U is the fuzzy partition matrix, [3 is the vector of the protoi=1 k=l i=1 k=l types, rli is a penalty term for the i-th cluster [3], m > 1 is the degree of fuzzification, C >= 2 is the number of clusters, n is the number of data samples and D ( . ) is the deviation of the data vector xk from the i-th cluster prototype. One may notice in this formulation that the cluster prototypes [5 are fixed for the entire range of the data set. This may give good results if the data are stationary, but in the context of image segmentation this is not the case. Images are highly nonstationary signals and the implied assumption of stationarity by fixing the values of [3 to constant values throughout the image does not result in good segmentations in terms of index of fuzziness and fuzzy entropy. A specially optimised algorithm for the purpose of image segmentation that incorporates the non-stationarity and the neighbourhood correlation features that are inherently present in all non-trivial images using a fuzzy multiresolution approach is presented in this paper. In Section 2.1 we present the multiresolution, spatially constrained model for the adaptive segmentation of images; in Section 2.2 we discuss the non-stationary estimation of the cluster prototypes; in Section 2.3 we analyse the inter-pixel correlation model. Finally, in Section 3 we discuss the results of the proposed scheme and the effects of different parameters to the segmentation results.
2.Analysis of the Algorithm
2.1 The multiresolution non-stationary image segmentation model The key element of the proposed family of algorithms is that the final segmentation should incorporate all the available segmentation information calculated at various resolution levels r, having utilised the non-stationary modelling of the cluster prototypes and the spatial constraints. If we assume that the segmentation performed at each resolution level U r is correct, then each segmentation result should contribute -to some degree- for the calculation of the final fuzzy partition matrix; however, the interpretation of the results of segmentation at each resolution level and the restrictions imposed by various resolutions should be considered. Let X denote the image data, having values that typically range from 0 to 255. Let xk denote the intensity of a pixel at the location k, with k ~ [0, MxN-1], with M,N being the image dimensions. The fuzzy segmentation of the image to c regions (clusters) is obtained by finding the fuzzy partition matrix U=[uJ. In the proposed model, the prototype vectors [3 vary
308
with the location k, i.e. 13=13i (k). Like both FCM and PCM, our approach iterates between estimating 13 and updating the partition matrix U using the calculated estimates of [5 i (k). The prototype values are estimated using a hierarchical approach. We construct a pyramid of images X" at different resolutions r, having dimensions M ~ x N', starting from the highest resolution image (r=0) by ideal low-pass filtering and decimating by two. Let 13~,..~, (k) denote the estimated cluster prototype for cluster i out of c, at a resolution level r using a window of size W. Let also U~ denote the fuzzy partition matrix for a certain resolution level r, having dimensions cx(2 r M x 2rN). At the lowest resolution image, typically of dimensions of 32x32 (r=-3) either the FCM or PCM algorithm is applied, its' results being an initial segmentation; U3= PCM(X 3) or U3=FCM(X3). For each resolution level, the following calculations are taking place: The values of 13i are calculated (as described in See. 2.2) in a window of size then, the fuzzy partition matrix f f is calculated in the following manner:
~r(k )
1
Wsize that
is equal to half the image size;
1
(1)
} The r h values, that define the inter-cluster distance, are calculated as the standard deviations between the prototype values estimated within the window W and the image data. Finally, the spatial constraints are taken into account, thus modifying f f as described in See. 2.3. When the calculation of f f has converged for a certain window size, the window size is reduced by a factor of two and the whole process is repeated until a minimum window size W,nin=8 is reached. The calculation of b r has converged for a window size if the number of changes in the fuzzy partition matrix is lower than a specified threshold. A good threshold was found to be 5% of the last number of changes. Typically, 3 to 5 iterations are adequate. When the algorithm has converged for the minimum window size at a resolution level, we have the segmented image for that resolution. The values of 13~,.sobtained are expanded by a factor of 2 and the process of re-estimating 13~,.wand updating the fuzzy partition matrix are repeated for the next resolution level, until the original resolution level is reached. The covergence of U~ is followed by a data fusion procedure that utilises all the segmentation information obtained for the different resolutions to calculate the final segmentation. If we assume a multiresolution quad-tree structure for the segmented pixels, then, each segmented pixel at resolution r has four children at resolution r + l . We define an information gain me~'c (1GM) for measuring the knowledge that the calculation of U for a cluster i has provided at the higher resolution as the difference between the parent's possibility of belonging to each class i from the average of the children's class assignments, that is:
~ur+l[m,n] 1GM r [k, l l =u r [k, l l
Children~ ur[k'l] 4
(2)
If IGM~i[k,l] is close to zero, then the existence of a homogeneous region is implied and the updated partition matrix resuits for cluster i are correct with possibility 1- 1GMri[k,l]; otherwise, details must have been emerged and the cluster assignments of the lower resolution segmentation are correct with a lower possibility. If (fMN denotes the results of segmentation at resolution r expanded to dimensions MxN, the final fuzzy partition matrix U is calculated in the following manner: U= 1
-~ ~0 (1- IMG r )'UrMN (3) r=rmin where K is a normalising constant. The factor 1-IMG r removes the bias towards the results of lower resolution segmentation only when details emerge in higher resolutions, providing consistent segmentation of homogeneous regions.
2.2 Non-stationary estimation of the clusterprototypes The estimation of the non-stationary cluster prototypes is one of the key elements for the performance of the proposed algorithm. We assume that there exists an ordering such as:
1(k) < ~ 2 (k) <... < [5c (k)
(4)
309 i.e., the cluster prototypes are ordered before their localised estimation begins. That is a meaningful assumption, especially when trying to assign linguistic variables to the segmentation (((dark>> objects are always darker than ((bright)) ones). The ordering is performed after the initial application of FCM or PCM to the lowest resolution image. One can easily observe that for a given window size W, the following relation holds:
min{x r (k)} < ~ r1,W .<. . <. ~rc, (5) . .W < max{x r (k)} k~W k~W A good estimation of fSr;w(k) for a specific window size W may be achieved if we let -
~r1,w(k) = min{ xr (k)}
k~W (6) Fig. 1. Adaptive estimation of the cluster prototypes for image f5~,w(k) r = max{x ~ (k)} <>, c=2, m=2.0. The profile of the image centreline is k~W shown in solid line, while the dashed and dotted lines show the and calculate the rest by bilinear interpolation. The calculalocalised estimation of the cluster prototypes 1 (bright region) tion of fS~i,w(k) for a specific resolution r and window size and 0 (dark region), respectively. W is performed for a grid of points spaced equal to half window size (50% overlap); the rest of the values of are calculated by bilinear interpolation. In order to maintain reliable estimations for the cluster prototypes, we assign the values resulting from Eq. (6) to them if the number of pixels in the window that are assigned to each cluster are with possibility greater than 0.5 is greater than the window size; otherwise, the prototype value for each cluster is assigned the value calculated with double the window size. This method of estimation of the non-stationary cluster prototypes is robust and reliable. When the range of image values within a window is small, only the dominant cluster for that window is affected. The results of the adaptive estimation of the cluster prototypes are shown in Fig. 1.
2.3 Refinement o f the Fuzzy Partition Matrix using spatial constraints The inter-pixel correlation properties of the image are taken into account by finding, for each pixel's 8connected neighbourhood Ns the cluster assignment y that has the greatest possibility of being correct:
Vs ~ N 8, Ys = max {u r (s)l i=1.....c
(7)
Then, the fuzzy partition matrix is updated using the rule:
r = Iu~ +e, i f y k = Y s , Vi. Uik LU~ - e , if Yk ;aYs
Fig.2. Segmentation results for <>image. (a)c=2, m=2.0, (b) c=2, m=2.5, (c)c=4, m=2.5, (d)c=4, m=2.5.
(8)
e is a positive constant that regulates our a priori knowledge of the neighbourhood strength, i.e., the inter-pixel correlation. When e is high, the region sizes increase and their boundaries are smoothed. In our implementation, a fixed value of 1/8 and a decreasing sequence using the r u l e e(r) = 2 4 - r / 256 were used. The latter smoothes the regions in the low resolution levels of segmentation while leaving the higher resolution images unchanged.
3. Results and Discussion In this section, the results of the application of the proposed scheme are presented. In general, it can be stated that the proposed scheme results to a simple representation of the image while preserving it's basic characteristics. In addition, there are no single-pixel regions in the segmented images that are a direct consequence of the presence of noise. In Fig. 2 the segmentation results for the image <>are presented for different values for the number of clusters c and the fuzzifier m.
310 The way m affects the final segmentation results is evident. For low m values, the fuzzy partition matrix values are high, approaching a hard partition of the data for almost all data vectors. That results to an inconsistent calculation of the prototype vectors at small window sizes, because almost every data vector contributes to their calculation. On the contrary, when m is greater than 2.0, only the vectors that are most similar to the prototypes are used for the new prototypes updates. Good segmentations have resulted by setting m=2. O. In Fig. 3, the effects of e to the final segmentation are presented. It is evident that when e is kept constant to 1/8, the resuits are biased towards the low resolution segmentations. The opposite holds when e varies according to Section 2.3.
Fig. 3. Segmentation results for <>image using different values of e. In both cases, c=2, m=2.0. (a) e=l/8, (b) e(r) = 2 4-r / 256.
TABLE I. COMPARISONOFTHECLUSTER'SFUZZYENTROPYFORDIFFERENTSEGMENTATIONSOFTHE
Another interesting property of the proposed algorithm is the reduction of the fuzzy entropy of the cluster assignments. We have computed the fuzzy entropy of each cluster using the definition provided by Pal & Pal [2]: n H ( A k ) = K ~ , u u e 1-uu + ( 1 - u h . )e uki i=1
(9)
where n is the number of data vectors, K=l/n, and A k is the fuzzy set represented by each cluster assignment. The fuzzy entropy calculated for different segmentations are shown in Table I. Table I. shows a reduction of fuzzy entropy that ranges between 1 and 20%, with the greater values resulting when the number of clusters increases. Lower fuzzy entropy implies better segmentation results in terms of interpretation, a result that should be expected because of the local adaptation of the cluster prototypes. 4. Conclusions We have presented a hierarchical fuzzy algorithm specially optimised for image segmentation that takes into account the inherent image properties, namely the non-stationarity and the high inter-pixel correlation. We have proposed a new approach for the estimation of the cluster prototypes, the modelling of the inter-pixel correlation and, finally, a data fusion method for the incorporation of the segmentation results of different resolutions to the final segmentation. The results are better in terms of fuzzy entropy compared to the PCM algorithm. Further research should include better modelling of the pixels' correlation and enhancement of the data fusion method.
References [1] S.K.Pal, R.A. King and A.A. Hashim, "'Automatic grey level thresholding through index of fuzziness and entropy," Pattern Recognition Letters, March 1983, vol. 1, no. 3, pp. 141-146. [2] N.R. Pal and S.K. Pal, "'Object background segmentation using new definitions of entropy," lEE Proceedings E (Computers and Digital Techniques), July 1989, vol. 136, no. 4, pp. 284-295. [3] R.L.Cannon, J.C.Dave and J.C.Bezdek, "'Segmentation of a thematic mapper image using the fuzzy c-means clustering algortithm," IEEE Trans. On Geoscience and Remote Sensing, vol. 24, no. 3, pp. 400-408, May 1986. [4] F. Wang, "'Fuzzy Classification of remote sensing images," IEEE Trans. On Geoscience and Remote Sensing, vol. 28, no. 2, pp. 194-201, March 1990. [5] H. Caillol, A. Hillion and W. Pieczynski, "'Fuzzy random fields and unsupervised image segmentation," IEEE Trans. On Geoscience and Remote Sensing, vol. 31, no. 4, pp. 801-810, July 1993. [6] B.B. Devi and V.V.S. Sarma, "'Fuzzy Approximation Scheme for Sequential Learning in Pattern Recognition," IEEE Trans. Syst., Man, Cybern., vol. 16, no. 5, pp. 668-679, Sept.-Oct. 1986. [7] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum, 1981. [8] R. Krishnapuram and J.M. Keller, "'A possibilistic approach to clustering," IEEE Trans. Fuzzy Systems, vol. 1, no. 2, pp. 85-110, 1993. [9] Yannis A. Tolias ans Stavros M. Panas, "'An Enhanced Fuzzy Possibilistic Clustering Algorithm for Image Segmentation," in Proc. of the 8th Intern. Symp. on Theoretical Elecm'cal Engineering, ISTET '95, Thessaloniki, Sept. 22-23, 1995, pp. 200-202.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K.
B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
311
Hy2" A Hybrid Segmentation M e t h o d F. Marino, G. Mastronardi DEE,Politecnico di Bari ITALY Abstract Two parametrical methods (Hyperbole and Hysteresis) have been shown and compared in order to have an other one (Hy2) that merges the merits of both. The goal is to solve some typical problems of segmentation algorithms with a fixed threshold, unfit on the whole image. The benchmark was an image with superimposed gaussian noise. The same image, without noise, was binarized and used as reference in order to draw the parameters characterizing the method in function of the measured qtmlity.
Introduction The segmentation is apreventive operation in a lot of procedures of the image processing and it is fundamental in the analysis of edges. After the choice of a threshold T (global, local or locally adaptive), the segmentation consists to reduce the gray levels belonging to an image in order to obtain a synthetic representation of some its characteristics, as object edges or significant luminance changes [ 1]. The problem is the suitable choice of the threshold T: it can be chosen arbitrarily or obtained from the histogram of the source image. A global threshold, does not guarantees quality comparable to those ones tipical of locally adaptive thresholds; neverthless, problems arise in presence of non-prefiltered noise. The algorithm that we are introducing is the match between two other ones and performs well directly on noisy images.
The Hyperbole Law We start from the simple idea that a single light (dark) pixel in a dark (light) region might be, in high probability, due to the noise. So if we consider: A v, the average of the 8 pixels adjacent to the processed one and Ga/n, a value that makes parametrical the method (Q is the quantization level in the input image, commonly 256), a hyperbolic local threshold T (1) may be introduced. It evaluates the pixels in function of the brightness of the adjacent ones, and, because of an inverse proportionality, it considers them as noise in case of clash with the neighbourhood: a T = Gain-------(1) Av+ 1
In the formula, divisions by zero are avoided and a certain flexibility is granted by the Gain: obviously, this parameter, depends by the histogram of the input image [2]. The law needs 1 division and 8 additions for each pixel to be se~mented, but of course, may be approximate, by computing a single value of T for each part of a pre-partitioned image. In every case, it reaches its goal (to preserve the segmentation by eventual noise) well enough. E x p e r i m e n t a l R e s u l t s : Also if the segmentation is commonly used on images for medical, biological or mechanical applications, in order to have a more valid test to evaluate and compare the efficiency of the different algorithms, we have preferred to use a pictorial image: the previous ones, in fact, are often represented by a restricted number of grey levels and show in origin a partial segmentation; moreover, pictorial images present less marked shading. The benchmark was an image with superimposed gaussian noise (Fig. lb). The source image (Fig. 1a), before the noise superinlposion, was segmenteded (Fig. 2a) and used as reference in order to draw the parameters characterizing the methods in function of the measured quality. This segmentation was obtained with a global threshold equal to the median of the source inaage. Firstly, the image was tested with different Gain values to detect the optimal one. In the examinated image (and in general in the pictorial ones) the best performaces may be reached with a value of the Ga/n equal to 70-80% of the median. The percentual errors respect to the reference image are plottexl in Fig. 5a. A qualitative analysis may be aided by Fig. 3a and Fig. 4a, they show the best performance of the algorithm and the related error. Of course, segmentation by hyperbole is less noisy than that one by a glgbal threshold (Fig. 2b). Nevertheless, hyperbole law generates a marked erosion (see Fig. 4a) in proximity of the edges: this one is proportional to the Gain and to the brightness variation. In order both to reduce this erosion, and to remove strongly noisy pixels (if only isolated) the hysteresis principle may
be adopte,l.
312
The Hysteresis Principle This second algorithm adopts two thresholds, T1 (cut-threshold) and T2 (security- threshold, higher than the first one), and it is based on the principle that a pixel (having the brightness higher than the cut-threshold T1) may be considered as "high" only if it is connected to a seed-pixel, i.e. one having the brightness higher than the security- threshold T2. The algorithm recovers points with luminance greater than T1, if they are contiguous to a seed pixel, but, above all, by an opportune choice of T2, it removes the isolated noise. For these reasons, it solves the problems typical of the hyperbole law, but, as we will show in the next paragraph, it performs very bad in homogeneous areas. It is not a case, in fact, that it was introduced in [4] declaredly for hedge maps. Experimental Results: Because of the parameters dependence, also for the hyisteresis algorithm, a test with different pairs of values (T1, T2) was necessary. The percentual errors respect to the reference image have been computed: from these ones, a set of curves may be plotted, as it is shown in Fig. 5b. The graph aids to conclude that different values of T1 (provided T1 is minor than the average of the grey levels in the source image) produce no significant differences, but on the contrary, T2 must be carefully chosen. Respect to the reference image, the least error was obtained with T2 equal to 11(3-120% of the median. Fig. 3b and Fig. 4b, show merits (bounded erosion, cut of isolated noise) and limits (bad performance in homogeneous areas) of hysteresis. Fortunately, they are complementary to those ones of the hyperbole, so, a match between the two algorithms may be usefully considered to trace a hybrid efficient novel algorithm. The Hybrid Hy2 Method The Hysteresis-Hyperbole (Hy2) "algorithm aims to man "tain the merits both of the local hyperbolic thresholding and of the hysteresis principle in order to directly threshold images affected by noise. The implementation of Hy2 is similar to that one of Hysteresis, but it is characterized by the local computing of T2, by employing the hyperbolic law. The obtained segmentation will foreseeable improve respect those by hysteresis for the same reasons that a local performs better than a global threshoding. Moreover, to make fully local the algorithm, a consideration aids to generate T1 by T2. The idea is that the subtlety of the hysteresis gap may be assumed as proportional to the local homogeneity: a wide gap, in fact, is advisable only in boundary areas, to correctly process pixels belonging to a hedge. Therefore, if we define the homogeneity h as: .
la-b-c+dl+la+ -dl+ -b+c-dl . . . 3Q
(2)
where A, B, C and D are the comer-pixels in the 3x3 kernel centerexl on X, the pixel to be thresholded; a, b, c and d are the grey values of the correspondent pixels and Q has the same meaning that was in (1) and it is considered to normalize to 1 the ratio. So, if we computexl the cut-threshold T1 by (3):
the hysteresis gap will be wide in the bound areas and narrow in the homogeneous ones. Nevertheless, to make the processing less expensive, the computation of (3) will be enabled only if requested: in fact, the quality of the segmentateA image is not strongly depending by T1 variations, provided it is in a certain range (see Fig. 5.b). Experimental Results: In order to compare Hy2 with the previously shown algorithms, the same experimental steps were made. So, a set of tests (see Fig. 5a) has provided us with the percentaml errors respect to the reference image in function of the Gain. These errors may be easily compared with those ones by Hyperbole: both the two algorithms, in fact, are strongly related to the same single parameter, the Gain. A similar behaviour is evident, but two differences may be easily remarked: 1- the two curves are shifted respect to the Ga/n axe, and 2- the pefformace about the minimum error are different: Hy2 perfomls better than Hyperbole; moreover, Hy2 strongly improves the minimum error byHysteresis. The first phenomenon comes from the different use, in the two algorithms, of the hyperbolic Ga/n. Hy2, in fact, needs two thresholds and adopts the Ga/n to compute the highest one (T2): so, because of the cut is made by T1, it may be easily explained why P is obtained by a higher Gain than R. Nevertheless, the second point is the most remarkable: it demonstrates that .the goal of directly thresholding images affected by noise has been reached by suitably improving by the hysteresis principle a noise removing local thresholding algorithm. By a qualitative point of view, as it is shown in Fig. 4c, some noisy pixels of course remain, but they are strongly botmdexl in homogeneus areas; moreover, the hedges result sufficiently well defined.
313
In conclusion, the Hy2 thresholding algorittun has suitably both joined the merits and removed the limits of Hyperbole and Hysteresis: as consequence of this merge, it is able to generate an acceptable segmentation directly from a noisy image. Moreover, in order to reduce the processing time, the algorithm can be easily implemented in parallel by partitioning the data related to the input image. References [1] V.S. Nalva, T.O. Binford "On Detecting Ezlge~s",IEEE Transactions on Pattern Analysis and Machine Intelligence, V ol. PAMI-8, N. 6, 1986, pp. 699-714 [2] S. Brofferio, L. Carnimeo, D. Comunale, G. Mastronardi "A Backgrotmd Updating Algorithm for Moving Object Scenes", Time-Varying Processing and Moving Object Recognition, Vol. 2, edited by V. Cappellini, Elsevier Science Publishers B. V., 1990, pp. 297-307 [3] G. Mastronardi, S. Sangiovanni "An Adaptive Filter for Image Processing in occam2", Proc. of ISMM Int. Symp. on Industrial, Vehicular and Space Applications of Microcomputers, New York 1990, pp. 69-72 [4] Z. Perkovic, T. Ilakovac "Thresholding of Edge Map Using Hysteresis", Proc. of MIPRO'94, Rijeka 1994, pp. 3.62-3.67
34
,.
I
32 30 ,
" =
-Hyperbole /
28,
26 24
t,~ z 2
20
18 16 14 : 12 10 100
110
120
130
140
150
160
170
180
190 200
Gain
Fig. 5a: Hyperbole vs Hy2 method: Gain- % error curve. 34
T1
32. 30
"-
28
L
150 160
26 24
,=22 I,._
"-' 20 18 16
----
170
=-
180
-'- - 190
14
--
200
12 10 8 200
210
220
230
240
T2
Fig. 5b: Hysteresis method: T1 & T2- % error curve.
314
Fig.4a Fig.4b Fig.4c la) original image; lb) superimposed gaussian noise; lc) noise distribution. 2a) reference image for the error computation: it is generateAdirectly from Fig.la; 2b) segmentation of Fig.lb by global threshold T=Median=198; 2c) Fig.2b- Fig.2a error (=13.99%). Segmentation by: 3a) Hyperbole (Gctbz=145);'3b) Hysteresis (T2=220, TI=170); 3c) Hy2, Gain=155. Errors: 4a) Fig.3a- Fig.2a (=10.57%); 4b) Fig.3b- Fig.2a (=10.17%); 4c) Fig.3c- Fig.2a (=8.73%).
Session K: IMAGE ENHANCEMENT/RESTORATION
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K.
B.G. Mertziosand P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
317
Efficient computation of the 2-dimensional RGB vector median filter Stephen J. Sangwine and Anthony J. Bardos Department of Engineering, The University of Reading, UK Abstract The vector median filter may be used for filtering colour images, with all the advantages of the scalar median filter for monochrome images such as preservation of edges. Vector median filtering in two dimensions is computationally demanding because of the need to compute distance metrics between all pairs of pixels within a neighbourhood. A straightforward naive implementation of the filter using a 3 , 3 neighbourhood on RGB images of 512 • 512 pixels using integer arithmetic, requires about 15 seconds of computation on a Pentium 100MHz PC The authors have studied the efficient implementation of this filter and have achieved a threefold reduction in execution time, but have encountered the problem of diminishing returns Execution time below 5 seconds has been achieved but at the expense of generality, making the coding absolutely specific to the case of a 3 • 3 square neighbourhood, ['or example Surprisingly, it appears that significant speedup of the computation of this filter is not possible without parallel processing or hardware implementation, and a fast parameterised software implementation which would work ['or, say, a g :~ 5 neighbourhood has eluded us.
1. Background The median filter is well known in monochrome image processing. It selects the pixel with median intensity from among pixels in a neighbourhood and has the useful property of removing impulsive noise without blurring edges [1 ]. The vector median filter [2, 3] may be viewed as a generalisation of the median filter to vector-valued signals and images Colour images may be processed using vector filtering so that pixel values are treated as vector quantities rather than as separate components. In this paper; we consider vector median filtering ofRGB colour images, but the conclusions apply also to, say HSI, or CIELAB images, only the distance metric in colour space being different in these other cases. The advantage of vector filtering is obvious: impulsive noise, for example, will be removed by a vector median filter, replacing the noisy pixel with one of its neighbours, thus preserving the local colour properties; whereas median filtering of the RGB image components separately may replace, say, the noisy red component of a pixel with one of its neighbours, leaving the noise-free green and blue components unaltered, achieving only a change of coiour of the noise-corrupted pixel rather than its removal. The vector median filter may be evaluated as follows, assuming an RGB image and a simple P~1hagorean distance metric in RGB space. We assume a neighbourhood such as a 3 x 3 pixel square, although other shapes and sizes of neighbourhood may be used, just as with conventional median filtering [1]. For each pixel in the chosen neighbourhood the following sum must be computed [2]: S, -
Z j-I
x i- xj
,
t-l,.
.... N .
where x, is the ith pixel in the neighbourhood, and N = 9 for a 3 x 3 neighbourhood. The pixel which yields the minimum value of S is selected as the vector median and used at the corresponding pixel position in the filtered image in rectangular RGB space the norm [[ r,- x/II is the Pythagorean distance between the two pixels thus
r
~2 [
X2
where r, is the red component of pixel x, and so on. Thus S is simply the sum of the distances in ROB space from the pixel to all other pixels in the neighbourhood. The selection of the pixel with minimum S may be readily visualized as finding the pixel nearest the "centre" of the pixels within the neighbourhood viewed as a cluster in ROB space. In this sense the vector'median filter can be seen as a simple generalisation of the scalar case. It is not, of course, necessary to compute square roots, since the ordering of the squared norms is the same as the ordering of the norms themselves. It may be noted that the use of other distance metrics would lead to other filters with properties similar to the vector median filter, but requiring simpler computation without multiplication, which could be significant in a fast hardware implementation. The norm used above is the Euclidean distance measure in ROB space.
318 Other possibilities include the city block distance or the square distance [4]. None of these alternatives yields the same 'ordering' of pixels as the Euclidean distance. 2. Naive implementation An obvious straightforward implementation of the vector median filter is to iterate over all pixels in the image (excluding edge pixeis) and to compute S for each pixei in the 3 • 3 neighbourhood, calculating the distances between pixels afresh each time they are required. The nine values of S generated can be compared as they are computed, retaining only the minimum value found so far and the corresponding pixel value (or its location). Once all nine values of S have been computed, the minimum is known and the corresponding pixel can be output to the filtered image. All calculations may be done in integer arithmetic. As soon as the partial value of S exceeds the best minimum value found so far, further summation may be abandoned for the pixel in question, since the resulting value of S cannot be minimum. This implementation is very simply coded and easily modified for a different neighbourhood shape. (Change of neighbourhood size should be readily achieved through parameterization.) There are several obvious inefficiencies which can be eliminated. A trivial matter is to eliminate calculation of the norm II x, - x; [[ for the case i - ./ because the norm is zero. A more significant inefficiency is the recalculation of norms within a neighbourhood. Since II x,- x; II = {Ix;- x, II some saving in computation can be made. We refer to this as norm symmetry and discuss it in the next section. An apparently more important issue is that, when the processing of one neighbourhood is complete and the. next neighbourhood (around an adjacent pixel) is processed, many of the norms previously evaluated are required again, and similarly when processing moves to the next line in the image. Avoidance of this recalculation requires some form of caching of norms and we discuss this also in the next section.
3. Norm symmetry and caching In theory, major computational savings are possible by caching calculated norms used in the evaluation of S , both to exploit symmetry within a neighbourhood and to exploit neighbourhood overlap between adjacent pixels on the same line and between neighbouring pixels on consecutive lines. To quantify the savings we consider a 3 x 3 neighbourhood as shown in Figure 1. The naive implementation outlined in the previous section would evaluate distances (norms) between the pairs of pixels shown in Table 1 (assuming that the trivial step of eliminating distances to a pixel from itself has been taken). Table 1 shows that 72 norms are evaluated by the naive implementation. Clearly this can be reduced to 36 by not evaluating the entries in the upper triangle, if a scheme is used to save the values in the lower triangle. In principle, further savings may be made by caching norms calculated along the whole of the current line and the two previous lines in the image. Once the cache has been filled (after three complete lines have been processed) each new neighbourhood introduces only one new pixel, as shown in Figure 2, requiring only eight new norms to be calculated. Thus, to a good approximation, the number of norms evaluated over the whole image would be reduced to eight times the number of pixels in the image, rather than 36 times exploiting symmetry alone. As we discuss below, achievement of this level of efficiency is prevented by the overheads associated with caching the norms and then looking up the cached norms. (In short, array address calculations take time comparable to the time taken to recompute the norms from scratch.) We have therefore resorted to a simple scheme which evaluates the norms, storing the values in 36 variables (not arrays because of the time taken up by address calculations). With this scheme caching of 15 of the 36 norms is easily and efficiently implemented by a sequence of variable assignments each time the neighbourhood advances to the ne~ pixel on a line. At the start of a new line, all 36 norms are calculated. The price paid for efficiency however, is a loss of generality, as the code cannot be parameterised for other sizes of neighbourhood. 4. C a c h i n g implementation There are 36 norms to be evaluated for each pixel position in the image (neglecting edge pixels fer a 3 , 3 neighbourhood). These norms are stored in 36 variables named according to the identifiers showaa in the lower triangular part of Table I. All 36 must be evaluated for the first pixel position of a new line. When processing the next pixel on a line, however, all the norms involving pixels b, c, e, f, h and i can be cached and not recalculated. Norms involving pixels c, f and i of the new neighbourhood must be calculated. These number 21. Table 2 shows the 21 norms of a given neighbourhood (in italic type) which are not available from the previous neighbourhood to the left. The 15 norms shown in normal type are cached by a sequence of 15 variable assignments from the corresponding norms of the previous neighbourhood. Our attempts to achieve better gains in efficiency have been thwarted by increased execution time due to array address calculations, since any larger caching scheme requires the use of arrays to store the cached norms for three entire lines of the image.
319
5. Theoretical performance gains for larger neighbourhoods We take the case of a 3 x 3 neighbourhood as our base and assume that only 36 norms are needed per pixel in the image, exploiting symmetry and avoiding zero norms ('distances' from a pixel to itselt). Clearly this figure of 36 norms is (92- 9)/2. The number of norms to be calculated for a 5 x 5 neighbourhood would be (252- 25)/2 or 300. The number of norms which can be cached for a 3 • 3 neighbourhood is the number of norms between 6 pixels (omitting the column of three which are not part of the next neighbourhood), which is 5 + 4 + 3 + 2 + 1 . Clearly, then, the number of norms which can be cached for a 5 • 5 neighbourhood is the number of norms between 20 pixels (omitting the column of five which is not part of the next neighbourhood), which is: 19 ~--'n- 200 n-1
The saving in computation using caching for a 5 ~: 5 neighbourhood is thus about 66%, although efficient computation using scalar variables rather than arrays would be somewhat tedious to code.
6. Actual performance comparisons for a 3 • 3 neighbourhood The algorithm has been implemented in Ada 95 and compiled with the Gnat compiler under Windows 95. The figures in Table 3 were obtained with optimisation on a 100MHz Pentium PC. The naive version was as described in Section 2 without exploitation of norm symmetry but using integer arithmetic throughout. The cached version used the scheme described in Section 4. The final version incorporates a 256 • 256 look-up table for calculating (a-b)2. The look-up table gives a reduction in execution time of about 1 second. Instrumentation of the program reveals that initialisation of the look-up table requires 60 ms, which is included in the figure shown in Table 3. As can be seen a reduction in execution time of about 75% is achieved from the naive version to the final version, but even on a fast PC the processing of a 512 :~ 512 RGB image still requires over 4 seconds It is clear from our work that further reduction in execution time cannot be achieved easily on a serial processor, a result which we have found both surprising and frustrating for what appears to be a simple filter, which is required only to select one pixel out of a small neighbourhood.
References 1. M. Sonka, V. Hlavac and R. Boyle, Image Processing, Analysis and Machine Vision, Chapman and Hall. 1993. 2. J. Astola, P. Haavisto and Y. Neuvo, 'Vector Median Filters', Proc. IEEE, 78, (4), April 1990, 678-689. 3. J. Zheng, K. P. Valavanis, J. M. Gauch, 'Noise Removal from Color hnages', J. oflntelltgent and Robotics Systems, 7, June 1993, 257- 285. 4. R. Beale and T. Jackson, Neural Computing: an Introduction, Institute o f Physics Publishing, 1990.
ba ab ac ad ae af ag ah ai
bc bd be bf bg bh bi
ca cb cd ce cf cg ch ci
da db dc de df dg dh di
ea eb ec ed ef eg eh ei
fa fb fc fd fe fg fh fi
ga gb gc gd ge gf gh gi
ha hb hc hd he hf hg
ia ib ic id ie if ig ih
hi
Table 1- Pixel pairs within 3 x 3 neighbourhood
320
ab ac
bc
ad ae
bd be
ca/ ce
af
bf
~/
ag ah
bg bh
ai
bi
de
jr
cg ch ci
~.,/
dg dh
eg eh
di
ei
fg fh fi
gh gi
hi
Table 2. N o r m s within a 3 x 3 n e i g h b o u r h o o d which are not p a r t of the previous n e i g h b o u r h o o d and which m u s t t h e r e f o r e be calculated (shown in italic)
naive
14.8
cached
5.2
final
4.2
J
Table 3. Performance results
a
b
c
d
e
f
g
h
i
Figure I- 3 x 3 neighbourhood
Figure 2" Single new pixel introduced by new neighbourhood, assuming caching of previous lines.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
321
Image Restoration For Millimeter Wave Images by Hopfield Neural Network Kenichiro Yuasa 1
Hidefumi Sawai 2
Kenichi Watabe3
Koji Mizuno 3
Masahide Yoneyama 1
1Dept. of Information & Computer Science, Toyo University 2Kansai Advance Research Center, Communications Research Laboratory, ministry of Post Telecommunications 3Research Institute of Electrical Communication Tohoku University ABSTRACT A Discrete type Hopfield neural network is proposed to use as an post processor for an imaging radar using 60GHz millimeter wave. The millimeter wave image obtained by the imaging radar is
generally very degraded because of design limitation of the imaging system. The role of the post processor is to restore the degraded millimeter wave images. The Hopfield neural network is capable to recall the correct original images from the degraded input image by means of its associative memory effect. Some experimental results using two dimensional objects are reported. 1.
INTRODUCTION
A vision system is a key technology to develop intelligent robots working in the extremely severe circumstance instead of a human worker. This robot vision is required to reconstruct an object's image and telecommunicate this image to an operator for existence. To create such a vision system, many technologies have already been proposed and researched. In an earlier project, one of the authors proposed a system in which an ultrasonic imaging and a feedforward type neural network were used. m ...t,l As the ultrasonic wave propagates with much attenuation, it is impossible to measure back scattered wave from an object being far from the receiver array. But, the robot vision system is generally required to get the information of an object located far from the receiver array. In this time, we propose another type of robot vision system which combines the millimeter wave imaging and a Hopfield neural network. The authors have already proposed a system using millimeter wave imaging and neural network. K~l [rl
This paper present recent advance of this work. It is possible to get the information of an object located at comparatively long distance from the receiver array by using millimeter wave. By using Hopfield type neural network instead of feedforward type, it becomes easy to teach the network in case that teaching samples of degraded images can not be obtained previously. Image restoration of very degraded millimeter wave image can be performed by means of associative memory effect of Hopfield neural network. 2.
MILLIMETER WAVE IMAGING SYSTEM
The arrangement to measure the back scattered wave from an object is shown by Fig.2. An object is illuminated by electromagnetic millimeter wave emitted by a transmitting antenna. As shown in Figure.2, a virtual plane S(x, y) being parallel to the object plane O(x', y') is assumed to place at position distant z from the origin of O(x ~,y~). The scalar potential r at any point r(x, y) on the virtual plane O(x, y) is expressed by equation (1). In this equation, ~(x', y') shows the reflection coefficient and ((x, y)shows surface function of the object.
j exp '4 kr)a(r) (x"Y')exp(jVr') r x' Y
(1)
322
where,
r = I111
V = (v~,vy,v~)
v= - - k ( ~ - sin0) =
v~ = -k(-~ + cose) r'= y', y')) A ( r ) - Invln~
v,
~(xl, yl) ; reflection coefficient z ~ = ~(xl, y I) ; surface function
Figure 1. Arrangement of backscattered millimeter In case of 2-dimensional object(~(x',y') = 0), the shape of the object is equal to its reflection coefficient~(x', y'). By means of setting a dielectric material lens at the position of the virtual plane, the scalar potential of the wave produced at the backward focal plane of the lens is proportional to the reflection coefficient pattern of the object as shown by equation (2). z
ys)-
z
(2)
where, a " proportional coefficient By distributing many receiving antennas at the focal plane of the lens, it is possible to measure discrete millimeter wave image ~ ( - z x f / f , - z y f / f ) which is similar to the object shape. In the experiment, we used the receiver array consisting of 10 antennas arranged in vertical line. This vertical antenna line array is scanned horizontally in 10 steps to make an image having 10 x 10 pixels. But, this millimeter wave image is usually very degraded due to defect of object's information arising by design limitation of the imaging system. For example, small number of receivers and limited size of the lens are important factors to degrade the image quality. So, it is almost impossible to recognize the object's shape ~(x ~, y~) by its electrical field image. The method of our research is to use Hopfield neural network as a post processor connected behind of the millimeter wave imaging system. The associative memory effect of the Hopfield neural network makes possible to recall the correct images from their degraded millimeter wave images. 3.
DISCRETE HOPFIELD NEURAL NETWORK
AS A N A S S O C I A T I V E M E M O R Y
As we used 2-dimensional objects in the experiment, only binary values data should be treated. Although the millimeter wave images measured by receiver array take analog values, those values are converted to binary values by using any threshold. Because, the analog values of intensity of those images do not have any information in case of 2dimensional objects. Therefore, the discrete type is better to use for the Hopfield neural network. The Hopfield neural network used in the experiment has 100 neurons. The allocation of neuron is shown by Fig.2. As shown by Fig.2, 10 x 10 neurons are located squarely. The neuron's allocation corresponds to allocation of receivers of receiver array. And it also is same as a pixel's allocation of millimeter wave image. For the teaching mode to the network, the connecting weights Wij are determined by the following equation.
1 e
P
Wji- -~ Z Z XaiX~j(C-x)a~ a=l/3=1
(3)
323
where, N :number of neurons P :number of teaching patterns X~i:Input value from ith neuron of ath teaching pattern X~j :Input value from jth neuron of flth teaching pattern (C -1)~f~: a/~th element of inverse correlation matrix For the recalling mode of memorized patterns, the stable state of network is searched by using following equation.
X~(t + l) - f [~lwijxj(t)]
(4)
L j=0
where, N" number of neurons 0+1 ( u > 0 ) -1
(u-0)
< 0)
Figure 2. Neuron's allocation 4. E X P E R I M E N T A L RESULTS Two samples of teaching patterns for neural network used in the simulation experiment are shown in Fig.3.
Figure 3. Teaching patterns Those teaching patterns have binary values and consist of 10 x 10 pixels. They were made according to their object's shape. Two samples of binary degraded millimeter wave images measured from objects of capital letters of C and D are shown Fig.4(a) and Fig.4(b) respectively.
324
Figure 4. Samples of experimental result The restored image recalled from the degraded image of Fig.4(a) using the Hopfield neural network is shown by Fig.4(a)'. Fig.4(b)' is the restored image recalled from a degraded image of Fig.4(b). 5. C O N C L U S I O N The rate of success to recall the correct images is strongly related to the following two factors. One is the degree of degradation of the input image and the other is the number of memorized images to the network. In the experiment, as we used too degraded millimeter wave images for the input pattern to the neural network, the rate of success to recall the correct images was a value between 50 To improve the restoration capability of this system, it may be effective to increase the number of neurons of the network. Increasing the number of neurons of the network acts on the network capacity to become higher, but at the same time, it makes the converging time of computation much longer. REFERENCES
[1] S,Watanabe and M.Yoneyama "Ultrasonic Robot Eyes Using Neural Networks" IEEE Trans.of UFFC, vol.37, No.2, 141-147(1990) [2] S.Watanabe and M.Yoneyama "An Ultrasonic Robot Eyes Using Neural Network" Acoustical Imaging, vol.18, 83-95(Plenum Press,New York,1991) [3] S.Watanabe and M.Yoneyama "An Ultrasonic Visual Sensor for Three-Dimensional Object Recognition Using Neural Networks" IEEE Transaction on Robotics and Automation, vol.8, No.4, P.4049, (April 1992) [4] S.Watanabe and M.Yoneyama "A 3D Object Classification Method Combing Acoustical Imaging With Probability Competition Neural Networks" Acoustical Imaging, vol.20, (65-72) [5] K.Watabe, K.Yamada, K.Kobayashi, K.Suzuki, M.Yoneyama, K.Mizuno "Millimeter Wave Imaging Arrays" 1994 Asia-Pacific Microwave Conference Workshops Digest, pp103-105(Dec.1994) [6] K.Yamada, K.Watabe, K.Kobayashi, T.Suzuki, K.Mizuno, M.Yoneyama "Neural Network Post Processor for Object Recognition by Millimeter-Wave Imaging" EUFIT'95(European Congress on Intelligent Technologies and Soft Computing), (Aachen,August,1995) [7] C.Gui!hou, H.Sawai, M.Yoneyama, K.Watabe and K.Mizuno "Reconstructing Millimeter-Wave Image Using Neural Networks With Associative Memory" NOLTA'95(1995 International Symposium on Nonlinear Theory and its Applications), (Las Vegas,December,1995)
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
IMAGE RESTORATION
325
OF MEDICAL DIFFRACTION TOMOGRAPHY USING FILTERED MEM
Kazuhiko Hamamoto, Tsuyoshi Shiina~and Toshihiro Nishimura2 Dept. of Communications Engineering, Tokai University, Hiratsuka, 259-12 Japan. lInst, of Infor. Sciences and Electron., University of Tsukuba, Tsukuba, 305 Japan. 2Dept. of Electrical Engineering, Oita University, Oita, 870-11 Japan. I. INTRODUCTION Diffraction Tomography is a ultrasonic imaging technique which can consider the diffraction effect in reconstruction calculation[ 1]. It is expected to reconstruct quantitatively a higher resolution image compared with pulse-echo ultrasonic imaging. In clinical uses, however, the range of insonifymg angle is often limited and reflection mode which is more appropriate for clinical uses than transmission mode suffers from a lack of low frequency information. In regard to these problems we have proposed a restoration method called "MEM+IR"[2,3,7]. The method is a fast algorithm combining IR (Iterative Revisions[4]) and FFT with MEM applied to radio astronomy[5]. In this image processing, maximization of image entropy works as smoothing of image. Therefore, in the case where the number of pieces of frequency information obtained from measured scattered field is not enough to restore an image, the edge in the image get blunted. As the result, quantity of the image decline. In this paper, filtered maximum entropy image restoration method is proposed. In this technique high-pass filter is applied to MEM+IR to emphasize the edge and prevent a decline in quantity of reconstructed image. Simulation and experimental results show that it is possible to improve the quantity of image better than using MEM+IR. II. DIFFRACTION TOMOGRAPHY
(a) transmission mode diffraction tomography A theory of diffraction tomography is based on an inverse scattering problem of wave equation. The problem is to obtain an object function which presents the refractive index distribution. Fundamental to diffraction tomography is the Fourier diffraction theorem. This theorem is valid in the case where Born or Rytov approximation is applicable. It shown in Fig. 1. Measured forward scattered field lip - - -~
m Y
~ ~aJl
Fourier t
""~*~
sform
' "' / k,"
space domain
frequency domain
Fig. 1. The Fourier diffraction theorem in the case of transmission mode. Acording to this theorem, by illuminating an object in many different directions and measuring the diffracted data, the frequency domain can be filled up with the samples of the Fourier transform of the object over an ensemble of circular arc.
(Io) reflection mode diffraction tomography In the case of reflection mode, there are different ways of irradiating ultrasound. The object is insonified by a pulse plane wave which contains many different temporal frequency components. The Fourier diffraction theorem in the case of monochromatic plane wave insonification is shown in Fig.2. As fig.2 shows, reflection mode lacks a low fi'equency information.
326
\x x .
y~
V
--
~x
-
~
Measured back scattered field
r/
space domain
.. , . ~
~
""-
\
I
~ ~
u l/
-"l frequency domain
Fig.2. The Fourier diffaction theorem in the case of reflection mode.
(c) limited angle diffraction tomography In the case of limited angle condition, a lack of frequency information occurs. Therefore, the image reconstruction problem becomes ill-posed problem. III. RESTORATION METHOD The relationship relating the reconstructed image vectorf=~,...fN) r (N denotes the number of pixels), the frequency information vector obtained from the received signalsy=(yl,...,yM)r (M denotes the number of pieces of frequency information) and Mx N Fotwier transform matrix FM•N is expressed by (1). y =FM:~Vf (1) We define image entropy in (2)[5].
N H=~
t-1
logft
(2)
A reconstructed image energy function J ( f ) is defined from (1) and (2) in (3).
M
J(.f) = -H+n ~ Ft(Yt-(FM~c.f )t):
t-1
(3)
F i is high-pass filter expressed by (4).
The image restoration problem becomes one of finding such a reconstructed image vector that minimizing the energy function J00, where ~1 is a Lagrange multiplier. IV. RESULTS OF TRANSMISSION MODE DIFFRACTION TOMOGRAPHY
(a) simulation results Reconstructed and restored images are shown in Fig.3. The simulation model consists of four cylinders and the maximum radius is 56mm. The sampling period is lmm and N is 65536. The insonifying plane wave has a wavelength of lmm and the interval of insonifying angle is 2 ~ Fig.3 shows that Filtered MEM+IR is superior to MEM+IR.
(b) experimental results Experimental images are shown in Fig.4. The gelatin object whose refractive index is 1.024 is 40mm in diameter and perforated with four holes of 9mm in diameter[8]. A rf burst plane wave is which has a frequency of 803 kHz used as insonifying plane wave and scattered wave is received by a hydrophone[6,8]. The other conditions are the same as ones of simulation. Fig.4 shows the merit of Filtered MEM+IR, too.
327
1~1g.3. Reconstructed and restored images and numerical comparison of true (solid line) and reconstructed (dashed line) values on the centerline in simulation in the case where the range of insonifying angle is 90 ~ (a) : inverse Fourier transform. error=40%. (b) : MEM+IR. error=37.9% (c) : Filtered MEM+IR. error=32.2%. The edge is emphasized and the quantity do not decline in (c) compared with (b).
Fig.4. Experimental images. (a) 9Reconstruction from complete data. error=33.0%. (b) 9IFFT. (limited angle: 90~ error=59.3%. (c) : MEM+IR. error=47.0%. (d) : Filtered MEM+IR. error=45.6%. Cross section is on the centerline of a hall which is below to the right in image. V. RESULTS OF REFLECTION MODE DIFFRACTION TOMOGRAPHY In clinical applications, backscattered waves are easier to measure and more desirable than forwardscattered waves to avoid shadowing by bones and to decrease the size of equipment. Therefore, reflection mode is more appropriate than transmission mode. The reconstruction problem, however, becomes an ill-posed problem since it suffers from a lack of low frequency information. Reconstructed images are shown in Fig. 5. The insonifying angle is 100 ~ The insonifying pulse wave has a bandwidth of 312.5kHz ( the highest frequency component" 375kHz, the lowest frequency component 962.5kHz ). Fig. 3 shows that Filtered MEM+IR is superior to MEM+IR but the difference is not clear very much compared with the case of transmission mode. It shows that the influence of a lack of low frequency information is serious. VI. CONCLUSIONS We proposed "Filtered MEM+IR" to prevent a decline in quantity of limited angle diffraction tomography. Simulation and experimental results shows that it is possible to improve quantity of image better than MEM+IR. However, we will have to develop a more effective reconstruction method for reflection mode in the future.
328
Fig.5. Reconstructed images and numerical comparison of true (solid line) and reconstructed (dashed line) values on the centerline in simulation in the case of reflection mode. The insonifying angle is 100 ~ (a) : inverse Fourier transform. error=96.5%. (b): MEM+IR. error=35.1%. (c): Filtered MEM+IR. error=33.8%. REFERENCES [1] A.C.Kak and M.Slaney, "Tomographic imaging with diffracting sources," in Principles of Computerized Tomographic
Imaging. New York: IEEE Press, 1987. [2] K.Hamamoto and T.Shiina, "Investigation on maximum entropy image restoration of limited angle diffraction tomography,"
Keisoku-Jido~tseigyo gakkai Ronbunsyu, vol.29, no.8, pp.867-875, Aug. 1993. [in Japanese] [3] K.Hamamoto, T.Shiina and M.Ito, "An Investigation on Practicability of Reflection Mode Diffraction Tomography," Jpn.
J. Appl. Phys., vol.33, Part 1, no.5B, pp.3181-3186, 1994. [4] C.Q.Lan, K.K.Xu and G.Wade, "Limited angle diffraction tomography and its application to planar scanning system," IEEE
Trans. Sonics and Ultrason., vol.SU-32, pp.9-16, 1985. [5] S.J.Wemecke and L.R.D'addario, "Maximum entropy image reconstruction," IEEE Trans. Computer, vol.C-26, pp.351-364, 1977. [6] A.Yamada and K.Kurahashi, "Experimental image quality estimation of ultrasonic diffraction tomography," Jpn. J. Appl. Phys., vol.32, Part 1, no.5B, pp.2507-2509, 1993. [7] K.Hamamoto, T.Shiina, T.Nishimura and M.Saito, "Study on maximum entropy image restoration of limited angle diffraction tomography," in Computer Simulations in Biomedicine, edit. H.Power and R.T.Hart, Computational Mechanics Publications, pp.453-460, 1995. [8]K.Hamamoto and T.Shima,"Experimental investigation on maximum entropy image restoration of limited angle diffraction tomography," Denki-gakkai Ronbunshi C, vol. 115-C, no. 12, pp. 1384-1389, Dec. 1995. [in Japanese]
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
329
Directionally adaptive image restoration X a v i e r N e y t 1, M a r c A c h e r o y 2, I g n a c e L e m a h i e u 3 Electrical Engineering Dept. (SIC) of the Royal Military Academy of Belgium 30 av de la Renaissance, B-1000 Brussels, Belgium Abstract A directional selective restoration method is described. The adaptivity is based on the estimation of the local SNR of the ideal image. In order to achieve directional adaptivity, a directional selective transform, based on a discrete short time Fourier transform is presented. The results obtained are compared with a non-directional selective restoration method.
I. INTRODUCTION Image restoration aims at removing blur in presence of observation noise. The inversion problem being illposed, regularisation procedures are commonly used to make the deblurring well-behaved. Many restoration methods have been proposed [1], [2] but most of these consider space-invariant restoration. Adaptive restoration methods, that consist in computing an inversion filter depending on the local properties of the image [3], [4] have only recently been proposed. These methods often involve the use of local image descriptions using a development in basis functions with a local support like wavelets, windowed polynomials or Gabor functions. In [3], adaptivity is attained by estimating the local SNR of the original image and by selecting the restoration filter based on its value. Hence, this adaptive scheme always considers the image as isotropic, what is obviously not the case in the presence of an edge for instance. This results in a well deblurred edge transition, but much noise is introduced along the edge. This could be solved by introducing directional adaptivity, i.e. estimating the SNR locally in various directions, allowing to catch the anisotropies in the image. In order to carry out this directional selectivity, a directional selective transform has to be devised. A modified version of the STFT was selected and is detailed in section 2. The core of the restoration method, i.e. the computation of the restoration filters is explained in section 3 while section 4 shows how these filters can be used to perform an adaptive restoration. Finally section 5 shows some results. II. D I R E C T I O N A L SELECTIVE TRANSFORM Since directional selectivity is of great importance, separable filters, that confuse patterns in direction a and - a , can not be used. Good candidate transforms are the STFT or the Gabor transform. These two transforms are widely used in pattern recognition, where they are used to detect and classify patterns according to their response in the (~,~-) plane. Considering discrete signals, let V(i,j) be the localisation window, the signal L(i,j) localised at location (kAi,lAj) equals L(i,j)V(i - kAi,j -1Aj). If we consider a window function of finite dimension ( N , / ) , a binomial function for instance (-JTN/2t-TjTM/2 V2(i,J) _ ~N-1 "~M-1 ' (1) the windowed signals are also of finite length, and can be decomposed using the DFT. The discrete Fourier transform consists in fact in decomposing the original signal in a basis of plane w a v e s e-2jTr(kr/N+ls/M), where k and 1 determine the frequency and the orientation of the waves. But since the original signal is real, some simplifications may be conducted, leading to a transform that closely resembles to a decomposition in Fourier-series. The basis functions of this decomposition are cos 2rr ( ~ + -~) and sin 27r (k~ ~- + ~ ) . Due to the parity-properties of the sine and cosine functions, the functions corresponding to (k, l) and to ( - k , - l ) are not independent. Moreover, the sine function corresponding to (k, l) - (0, O) is also useless. For the convenience of the notation, these basis functions will be denoted by C~,m(r,s), where the odd n correspond to functions with k > O, while even n correspond to function with k < 0 and n - 0 corresponds to k - O. Something similar holds for m, where m odd correspond to cosine basis functions, and m even correspond to sine basis functions. Hence, up to a constant multiplicative factor, the functions Cn,m define an orthonormal basis o n T~ N x M Defining D,,,~(i,j)- V(-i,-j)C~,,~(-i,-j), the decomposition of the localised original signal in the basis functions C,~,,~ can be carried out by sampling the original signal filtered using filters D.,m
L,~,,~(kAi,lAj) - [ L ( i , j ) , D~,m(i,j)](kA,,laj).
(2)
Similarly, defining P~ , m(i ' j ) - C.,m(i,j)V(i,j) where W ( i , j ) - ~-~k,~V ( i - kAi , j - 1Aj) and provided W(i,j) W(i,j) ' is non zero Vi, j, the reconstruction of the original signal can be carried out by interpolating the coefficients L~,~ (k Ai, lAj) using the pattern functions P~,m
L(i,j) - E E P~,m(i - kAi,j -lAj)L~,m(kAi,lAj) k,l
(3)
n,m
1X. Neyt is holder of an IBM grant from the N F W O (Belgian National Fund for Scientific Research), [email protected] 2 M. Acheroy is with the EE Dept. (SIC) of the Royal Military Academy of Belgium, Marc.Acheroy(~elec.rma.ac.be 3 I. Lemahieu is with the EE Dept. (ELIS) of the University of Ghent, Belgium, Ignace.Lemahieu(~rug.ac.be
330
Fig. 1. Dn,m 0,...,6
filter functions for n , m
Fig. 2. I L ) n , m l filter functions for n , m 0,...,6
-
=
The analysis functions D,~,r~ are depicted in figure 1 together with their transmittance in fig 2. The directional selectivity of these filter is clearly illustrated in this last figure. It should be noticed that, while showing some similarities with the DCT, the described decomposition exhibits a true directional selectivity, which is not the case of the DCT. Moreover, due to the effect of the window function V(i,j), the DC component of the even filters (i.e. cos) is not zero. III.
C O M P U T A T I O N OF T H E R E S T O R A T I O N FILTERS
The degradation model considered is a linear blur filter with additive white noise. Hence, if Lb denotes the blurred signal and L the ideal signal, we have Lb(i,j) -- L(~i,j), B(i,j) + Q(i,j) where B(i,j) is the blurring filter and Q(i,j) represents the noise. Image restoration will be obtained by computing an estimate of the coefficients of the transform of the ideal image, starting from the degraded image. We assume that these estimates can be obtained from the blurred image using filters H,~,m using an expression similar to (2). Those estimates are then used to reconstruct an estimation of the ideal signal using (3). The coefficients of the filters Hn,m are obtained by minimising the mean squared error (MSE) between the unknown coefficients and their estimate %,m " If the noise is assumed to be zero-mean white and uncorrelated with the ideal signal, the only noise characteristic of interest is its variance E(Q(i, j)2) _ S2Q. Moreover, the autocorrelation function of the ideal signal in the neighbourhood of (kAi, lAj) is given by Rk,t(P, q; r, s) -- E(L(kAi, lAj)L(kAi, 1Aj)) and if the signal is assumed to be stationary in the same neighbourhood, i.e. if the statistical characteristics of the signal are shift-invariant in this neighbourhood, the autocorrelation becomes Rk,t(p, q; r, s) -- Rk,l(P- r, q - s) -- R k , l ( r - p, s -- q) and the MSE can be rewritten as
2
~ r~ , rn
_
S2Q~Hn2
(i,J)+[Dn,m(-P,-q)
,rrt (P,q)*Rk,t(P,q)](oo),
i,j
- 2 ~ g~,rn(i,j)[Dn,m(p, q) * B(-p, -q) * Rk,l(P, q)](i,j)
(4)
i,j
+~ i,j
Z Hn,m(i,j)H,~,m(i',j')[B(-p,-q) * B(p,q) * Rk,t(P,q)](i_i,,j_j,) i',j ~
2 with respect to Hn,m(i j) equals zero, which yields At the minimum of this error, the partial derivative of cn,.~
= [D.,.~(p, q), B ( - p , - q ) ,
Rk,t(P, q)](i,j)
Since the latter equality must be satisfied for all values of (i, j), a system of linear equations is obtained whose solution gives the coefficients of the desired filters Hn,m. It should be noticed that the matrix in the left-hand term of eq. (5) is independent of the order (n, m) of the filter, hence only one matrix inversion will be required to compute all filters H.,.~. In order to simplify these equations, the support of the signal autocorrelation is assumed to be much smaller than the blurring kernel B(i, j), i.e. B ( i , j ) , Rk,t(p, q) -- s~,tB(P, q) + m~,t ~'~r,~ B(r, s) where s2,, is the local signal variance and mk,l its local mean. Even though this hypothesis is not realistic for the whole image,
331
it remains a good approximation where the deblurring will have the largest effect, i.e. where there are large intensity changes and hence a small correlation length 9
~p,qB(v,q) ' equation (5) finally becomes If a normalisation condition is enforced Y'~i,,j' H,~,m(z',9 J' ) - ~p,D,~,m(v,q) H,~,m(i',j') i,,j~
7~--(Sii'(Sjj' + [B(p,q) * B(-p,-q)](i_i, j_j, )
- [Dn,m(p,q) * B(-p,-q)](i,,j, )
(6)
k,l
The size of the system to be solved depends on the support size of the filter kernel H,~,,~(i, j) which in turn depends on the width of the localisation window and on the support size of the blurring kernel. Due to the presence of noise, it will not be possible to estimate accurately the coefficients L~,.~ of high order. IV.
ADAPTIVE
RESTORATION
The filters H,~,,~ described in the previous section depend on the local signal variance s~,~ of the image. Non adaptive restoration consisted in taking a constant value for the variance used when computing the filters H . . . . This constant value resulted from a compromise between good restoration where possible and little noise amplification in constant areas of the image. By locally adapting the restoration filter H,~,.~ to the local signal variance, optimal restoration can be achieved, taking into account the difference in information content at different places in the image. In order to evaluate what can be gained from an adaptive restoration, let us first compute the estimation error on the coefficients L. .... in function of the local SNR, for a given restoration filter H,~,m. From (4), taking into account the small correlation length of the signal with respect to the width of the impuls response of B and D,~,.~, introducing the relations (6) together with the normalisation condition, and defining the following amplification factors
AQ(n,m)-
~H,~2
A F ( n , m ) --
~
i,j
Hn,m(i,j)[Dn,m(p,q),
B(-p,-q)](i,j )
(7)
i,j
As(n, m) -
[D,~,,~(-p,-q) , D,~ ,,,~(p, q)](o,o)--dF(n, m) -- 7s~~ Aq ( n , m)
where s 2 is the value of the signal variance that was used when computing the filters H,~,,~, we obtain C2
82
~' m). '~'----~ = AO-(n, m ) + ---~-As(n, s2oso,
(s)
This equation gives the reduced estimation error on the coefficient L~,.~(kAi,IAj) in function of the local SNR, when using restoration filters computed using a given signal variance s 2. From this equation can be seen that the filter optimised for a high SNR will only give an improvement with 2
respect to the filter optimised for a low SNR if the local SNR (~Q') exceeds a certain threshold. This threshold increases for higher order. Moreover, for coefficients of low order, this threshold will be very small, hence low order coefficients will always be computed using filters optimised for high SNR. Conversely, for high-order coefficients, the filters optimised for low SNR will always be used. For intermediate orders, we will switch between the two filters, based on the local SNR. An estimate of the local SNR of the ideal signal can be obtained from the estimates of the coefficients of the decomposition. Indeed, from the decomposition equation giving ]_,. . . . taking the hypothesis of the small autocorrelation length together with the normalisation condition into account and introducing the amplification factors defined in (7), we finally obtain ,l
~
'
P,q
giving a way to compute an estimate of the local signal variance of the ideal signal using an estimation of the coefficients of the decomposition of the ideal image. The term in m~, l in eq. (9) vanishes if the DC component of the corresponding filter D,~,m is zero. For the simplicity, we will assume that the above equation is only used when this is verified, i.e. for the sine based filters. Summing eq. (9) for (n,m) corresponding to filters D~,m with similar directional selectivity a, we obtain an estimate of the signal variance in the corresponding direction
82
__ (n,rn)ea
(n,rn)ea
- 7 ~ , ~ ( n , m)
(10)
332 The influence of the local SNR in the computation of the restoration filters H~,,~ lies in the extend to which the high-frequencies of the ideal signal will be recovered. For low SNR, the high-frequencies of the blurred signal will be considered as being covered by noise, while for high SNR, these high frequencies will be recovered by the restoration filters. In other words, filters optimised for a low SNR will always return zero HF coefficients (high order coefficients), while giving the same low order coefficients as the filters optimised for high SNR. Hence the adaptive restoration method is as follows: compute a family of restoration filters optimised for a high SNR (30dB for instance) and compute an estimate of the low-order coefficients in all directions and for all window positions. These estimates are then used to compute an estimate of the local signal variance, and at the window positions where that local signal variance in certain directions exceeds a threshold, compute higher order coefficients corresponding to this direction/location. All estimated coefficients are then used to reconstruct an estimate of the ideal image before degradation. V. RESULTS The blurred image (fig. 3) used as test image was taken with an out of focus CCD camera. It should be noticed that the obtained blur is quite severe and that the CCD camera used was very noisy. The blurring filter that must be known in order to apply this algorithm was determined starting with a parametrised model of an out of focus camera. The parameters of the model were determined by trial and error. The figure 4 presents the
Fig. 3. Degraded image
Fig. 4. Simple adaptive method
Fig. 5. Directional adaptive method
same picture restored using the restoration scheme described in [3] where local but non-directional adaptivity with respect to the local image SNR is considered. Although the noise in the constant areas of the image is smoothed out, artifacts appears near large intensity transitions. Figure 5 presents a restoration conducted using the directional adaptive method described in this paper. No precautions were taken to avoid the border effects visible on figure 5. VI. CONCLUSIONS A directional adaptive restoration method is presented. This method is based on the estimation of the local SNR of the ideal image. To achieve directional selectivity, a directional selective decomposition scheme, based on a variant of the discrete short-time Fourier transform, is developed. The major improvement of this method with respect to non directional adaptive restoration methods is particularly well visible along edges, where the noise in the direction of the edge is greatly reduced. REFERENCES [1] G. Demoment, "Image Reconstruction and Restoration: Overview of Common Estimation Structures and Problems," IEEE Trans ASSP, Vol. 37, No. 12, Dec. 1989, pp 2024-2036 [2] J. Biemond, R.L. Lagendijk, R.M. Mersereau, "Iterative Methods for Image Deblurring," Proc. o:f the IEEE, Vol. 78, No. 5, May 1990, pp 856-883 [3] J.B. Martens, "Deblurring Digital Images by Mean of Polynomial Transforms," Computer Vision, Graphics and Image Processing, 50, pp 157- 176, 1990 [4] M.K. (Dzkan, A.M. Tekalp, M.I. Sezan, "POCS-Based Restoration of Space-varying Blurred Images," IEEE Trans. on Image Processing, Vol. 3, No. 4, July 1994, pp 450-454 [5] J.B. Martens, "The Hermite Transform -- Theory," IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 38, No. 9, pp 1595 - 1606, Septembre 1990
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
333
Optimal Matching of images at Low Photon Level Mireille Guillaume, Thierry Amouroux and Philippe R6fr6gier Laboratoire Signal et Image E N S P M , D o m a i n e universitaire de Saint-J6r6me 13 397 Marseille cedex 20 France Bruno Milliard and Antoine Llebaria Laboratoire d'Astronomie Spatiale, C N R S B P 8, traverse du Siphon, 13 376 Marseille cedex 12 France
Abstract
We considerthe problemof matchingastronomicalimageswith a very low photon level. We analyzethe performancesof the optimaltechniquewhen the noise of the movedimagesis no longergaussianand the classical linear intercorrelation methodfails. A classical problem in image processing consists in determining the translation from one noisy image to a reference image [1,2]. For the astronomical application that we consider, the image detector is fasten to a shuttle whose motion blurs the observed image s. In order to preserve the quality of this image, the exposure time of the detector is very small, and as a consequence very few photons can be present in each observation. All the observed images Sp are matched and added together in order to built a final image r. At low photon levels, the observed image Sp are mainly corrupted by the discrete nature of the photon which is described by a Poisson statistics. In this paper, we propose to analyze the performances of the optimal technique for the estimation of the translation of the observed image in presence of such a Poisson noise in comparison with the classical linear intercorrelation method. Let s(i) denote the intensity of the observed image at pixel i with i e [1,N] where N denotes the number of pixels of the image. Let r(i) denote the intensity of the reference image and 7: the amplitude of the translation between the reference image and the observed image. The probability of observing s(i) photons at pixel i is thus, according to the Poisson law:
P[s(i)[7:] = exp[-/z(i)] [/z(i)]s(i) s(i)!
(1)
where/z(i) is the average number of photons at the pixel i. This value is proportional to the intensity of the reference translated of "t"and to the exposure time: thus, we can write /z(i)= A,. r ( i - 7:), with A, proportional to the exposure time. In the framework of the classical parameter estimation theory [3], when no a priori information is available, the optimal choice of 7: is the one which maximizes the loglikelihood of the hypothesis. Furthermore, if we assume that we have cyclic boundary conditions for the reference, one can show that the optimal estimation of 7: is obtained by maximizing: N
6l(7:) = ~_~s(i). In[r(/- 7:)]
(2)
i=1
Let us recall that with the assumption of additive gaussian noise the optimal translation estimate would be obtained by maximizing the classical linear intercorrelation between s(i) and r(i).
334 The optimal estimation, in the presence of Poisson noise, consists in maximizing the intercorrelation between s(i) and In[r(/)] Let us now analyze the gain in performance between the classical linear intercorrelation algorithm and the optimal method. For that purpose, we have numerically determined for each method the probability of correct estimation of the translation between the observed image and the reference image. This study has been done with the simulated reference image of fig.la as a function of the normalised exposure time A,. We show in fig. l b, 1c, different realizations of s(i) ~=~.2....N for 2 values of A, :
fig. la fig. 1b A,=I0"3 fig. 1c &=3.10-4 Simulated image of the sky which is used in the test as the reference image, and example of numerical images having the statistical distribution expressed in Eq. 1, for different values of A,. One can note that although the optimal processor expressed in Eq.2 does not depend on ~, the probability of correct estimation depends on it, and thus on the exposure time of the detector. We can see in Fig.2a that the optimal method is appreciably more efficient than the classical linear intercorrelation. The gain vanishes at high photon levels (i.e. high values of s This behavior can easily be understood since poissonian distributions converge to gaussian distributions at high ~ values. It is interesting to analyze the conditions for which one can expect a large (or a small) difference between the linear intercorrelation and the optimal method. For astronomical problems, the minimum value of r(i) is equal to a positive constant which is the mean brighmess of the sky in the regions devoid of stars. We have studied the influence of this value on the probability of correct estimation of the translation of the image s(i). Let rmax (resp. rmin) be the maximum (resp. minimum) value of the image. For a given value of r ~ we have analysed the influence of r,,~,, which is equivalent to modify the contrast ratio. Indeed, a multiplicative factor of luminance can be considered as a modification of ~. One can observe in Fig.2a, 2b, which correspond to different values of rmm, that the difference between the two methods increases as rm~, decreases, and that the difference of performance is significant for small A, values. It is equally interesting to take into account the shortening of the exposure time that can be expected, because it permits to reduce the blurting effect. The algorithm of Eq.2 has been used for real images of the sky in the Spatial Astronomy Laboratory of Marseille, associated to a fast algorithm for correlation, and a new image r has been rebuilted. The reference image was a previous one, rebuilted with matching from the classical correlation method. The stars in the new image have a PSF (point spread function) 3.9 pixels wide, close to the optical resolution which is 3.7 pixels wide.
335
rmin=O.1 fig 2b rmin=8 fig. 2a Probability of good estimation of the translation vector as a function of ~, with ~ varying between 0.0 and 0.003. Different values of rmin are used, with rmax=256. In this paper, we have analyzed the problem of the optimal estimation of the translation between an observed image corrupted by Poisson noise and a reference nonnoisy image. We show that the nonlinear intercorrelation becomes more effective than the linear one when there are regions of the observed image with very few photons whose statistical properties are very different from the gaussian distribution. Then a substantial improvement can be obtained at the price of a negligible increase of the complexity of the algorithm. These results assume the knowledge of a nonnoisy reference image. Of course in many applications this ideal situation does not occur and we have to estimate the reference from a noisy image sequence. In the framework of bayesian theory the maximum likelihood estimate of the reference image r is researched. One must then consider a set S of P noisy and shifted images such as S=( Sl, s2, s3,...... Sp). Let ~ be the translation of the image s~ and T-(~, ~2, ~3....... ~p). The maximum likelihood ofr and Tare r ~ and/~w: (r~,TMV) = argmaxP(r, TlS ) r,T
A simple calculus leads to: 1 P
r ~ (i) = -~_,sp(i + "cp)
(3)
p ~. 1
So the maximum likelihood image is the mean of the images. And:
logrMV(i)l "rpMV=argmax[~rMV(i)~, I_ ~
(4)
It is striking that when the noise present in the images has a Poisson distribution the criterium to take into account for an optimal matching of the images is the entropy of the mean image. The optimal estimate of the shift Xp is the one that minimizes this entropy. This can be compared with the likelihood estimate of the reference r when the noise is additive and gaussian: z'euv = a r g m a x ~ l rUV(i)12 "rp
(5)
i
In this case the likelihood estimate is the one that maximizes the energy of the mean image. Then when the reference image is unknown, the optimal processor is no longer the one given in the Eq.2. This is currently the subject of investigation of our laboratories and will be discussed during the presentation.
336
References [1] A.K. Jain, Fundamentals of digital image processing (Prentice Hall information and system series, New Jersey, 1989 ), p. 400-406 [2] B.V.K. Vijaya Kumar, F.M. Dickey, and J.M. Delaurentis, ~
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
337
A Method for Controlling the Enhancement of Image Feature by Unsharp Masking Filters E. Cernadas, L. Gbmez, A. Casas, P. G. Rodriguez and R. G. Carribn Laboratorio de Imagen Digital, Departamento de Electr6nica y Computaci6n, Facultad de Fisica, Campus Universihario Sur, 15706 Santiago de Compostela, SPAIN. E-mail" [email protected]. Abstract: Unsharp masking is a well-known and useful image enhancement technique. This paper describes how to design unsharp masking filters by relating the MTF of the filter to the DFT of the original image. Introductimt Unsharp masking is an edge-enhancement technique that has proved to be more effective than other kinds of filters in numerous image processing applications, ranging from clinical diagnosis [2, 4, 7] to the detection of star clusters [ 1]. Although unsharp masking filters have been extensively explained in the image processing literature [5, 6], their parameters have in almost all practical cases been determined by trial and error or by ROC (Receiver Operating Characteristic) methods. On the basis of a general analysis of the MTF (Modulation Transfer Function) of these filters, this work describes how to design filters for enhancement of images by examining the DFT (Discrete Fourier Transform) of the image and the image features it is desired to enhance. Theoretical foundations. Unsharp masking consists in depressing the low frequencies of an input i m a g e Iin by subtracting from it a smoothed version of itself, I s. In the case of images represented by a discrete matrix, the output image, lout, is thus given by lou t (n a, ny)
:kol ~ (nx, ny) -kll s (na, ny)
(1 )
where k 0 and k 1 are real constants. One of the several ways of calculating the "unsharp" mask I s [3] is by convolution of the input image with a two-dimensional discrete square pulse p(x,y):
{~ Ixl<-~
p (x, y) =
otherwise
'
lyl<---2t-x
(2)
where L is an odd integer. The smoothed image is then given by Is (nx, ny) =
[l~np*P](nx,ny)
(3 )
Substitution of (3) in (1) yields lout (n x,
ny)=ko [linp*8] (n:c,ny) -k1[l~p.p](nx, ny) =[linp*h](nx,ny)
(4 )
where ~5is the two-dimensional discrete Dirac function and h, the point spread function of the filter, is given by h (n~, %) =ko6 (x-n~, y-%) -kzp (x-n~, y-%)
(5)
If the size of the image matrix is NxN, the discrete Fourier transform of h is defined on the discrete set {(m• m• ~[-N/2,N/2]} by H (mx, my) -
ko---L.~
1----~N~ 1--~N
W~v
(6)
338 where WN=exp(-27tj/N). For practical purposes, the cut-off frequency of the filter may be defined as the lowest frequency, along either coordinate direction, for which IHl=k0. A simple calculation from equation (6) shows that mcut, the value of m corresponding to this frequency, is given by me ~
-
N
-~
(7)
l~lter design. Appropriate choice of the parameters L, k 0 and k~ depends on both a priori considerations and the nature of the image to be processed. 1) To avoid overshooting the allowed dynamic range while minimising compression of the dynamic range of the image, the gain of the filter at frequencies above cut-off should be unity. In other words, if there are no sub-cutoff frequencies in the image, then lou t should be equal to Ii,p. Examination of equation (1) shows that this condition implies k0-1. Note that if Iout= Imp at high frequencies, the unsharp masking filter does not emphasise noise in the original image. 2) For a constant signal (Iinp(nx,ny) = const.), I s = Iinp and Iout-- (k0-kl) Imp. Thus to prevent the output signal from having a negative grey level, k0-k1 must be non-negative, and as k0=l, k 1 must be minor or equal than 1. At the same time, however, k 1 must be positive if the filter is to have the desired effect of depressing the low frequencies present in the smoothed function (see equation (1)), i. e. kl>0. From all these conditions, k 1 must fall into the interval (0, 1]. 3) The value of L depends on the desired cut-off frequency, which will depend on the frequencies present in the image to be processed and the image features that it is wished to enhance. The cut-off frequency may be known a priori from knowledge of the size of the image features that it is desired to enhance, or its selection may be assisted by examination of the discrete Fourier transform of the image (see results and discussion). Once m~ut has been decided on, L is given by equation (7). Results and discussion We illustrate the above theory by discussing the digitised 512x512 pixel image Imp shown in figure 1.a. Choosing k 1 in the permissible range (0, 1] amounts to specifying a compromise between improving contrast (high kl; almost total suppression of low frequencies) and maintaining dynamic range (low kl). The optimal compromise will depend on the kind of image to be processed. For the example of this paper we take k1=0.5. The choice of cut-off frequency depends on the image features which it is desired to enhance in each application. Its value can be suggested by analysis of the DFT of the original image. The DFT of two-dimensional image is difficult to interpret (figure 1.b shows an image of intensity log(1 +lF(u,v)l), where F(u,v) is the Fourier transform of figure 1.a), but it can be analysed by taking sections through its midpoint. Examination of the morphology of these sections suggest that the frequency range of interest may be the medium frequencies between the central peak and the about m=100 pixels. For illustrative purposes, we consider the effect of two extreme cut-offs: one (mcu,=l 7) depressing only the broadest wavelengths in the original image, and the other (mcut2=102) depressing all but the finer details. By equation (7), these values of the cut-off parameter correspond to mask sizes L1=31 (512/17=30.1, but the nearest odd integer must be used) and, in the same way, L2=5. The filters specified above reduce and shift the dynamic range of the original image (figure 2). To expand the dynamic range of the output from the unsharp masking filter Iout, we use the simple piecewise linear function that is also shown in figure 2. The two points at which the three linear segments of this function intercept each other are fixed when the thresholds line (0.05% of the whole pixels of the image, 131 pixel for 512x512 image matrix) intercepts with the Iout histogram coming from the lowest and the highest values of Iout (also shown in figure 2). For a more complex kind of expansion function, see ref. [7]. Figure 3 shows the results of processing figure 1.a using a mask of size 31 (figure 3.a) or 5 (figure 3.b) and the dynamic range expansion function and ki values discussed above. In the image produced with the smaller smoothing mask it is the fine details that stand out (particularly the hair),
339 whereas the larger smoothing mask enhances features such as the chairback. Similar results were obtained with 256x256 pixel resolution using smoothing masks that were half the size of those used for 512x512 resolution.
Figure 1.- a) Original image after digitaiization to a 512x512 pixel matrix, and b) Log(l+lF(u,v)[), where F(u,v) is the discrete Fourier transform of figure 1.a.
Figure 2.- Histograms of lout and figure 3.a.
Iinp, and the dynamic range expansion
function used to obtain the
340
Figure 3.- Figure 1.a after unsharp masking with ko=l, k1=0.5 and a mask size L of a) 31 pixels or b) 5 pixels, followed by dynamic range expansion by means of the function shown in figure 2 (figure 3.a) or an analogous function (figure 3.b).
Conclusions. We have developed a method for objective design of unsharp masking filters in accordance with those features of the original image which it is desired to enhance. The filters so designed do not emphasise the noise of the original image. This approach is an advance on previous methodologies, which have been based on subjective evaluation of the final appearance of the image and/or on ROC studies. References
[1]
[2]
[3] [4] [5] [6] [7]
BENEDICT, G. F., JEFFERYS, W. H., DUNCOMBE, R., HEMENWAY, P.D., SHELUS, P. J., WHIPPLE, A. L., NELAN, E., STORY, D., MCARTHUR, B., MCCARTNEY, J., FRANZ, O. G., FREDRICK, L. W., and VAN ALTENA, WM. F.: "NGC 4314. II. Hubble space telescope I-Band surface photometry of the nuclear region", 1"heA stronomical Journal, April 1993, 105, (4), pp. 1369-1377 CHAN, H., VYBORNY, C. J., MACMAHON, H., METZ, C. E., DOI, K., and SICKLES, E.A.: "Digital mammography: ROC studies of the effects of pixel size and unsharp-mask filtering on the detection of subtle microcalcifications", Investigative Radiology, Jul. 1987, 22, (7), pp. 581-589 JAHNE, B.: Digital Image Processing (Springer-Verlag, 1991) LAINE, A., FAN, J., and YANG, W.: "Wavelets for contrast enhancement of digital mammography", 1EEE Engineering in Medicine and Biology, Sep/Oct 1995, pp. 536-550. LIM J. S.: Two-dimensional Signal and Image Processing (Prentice Hall, 1990) PRATT W. K.: Digital Image Processing, 2nd ed. (Wiley-Interscience Publication, 1991) TAHOCES, P.G., CORREA, J., SOUTO, M., GONZALEZ, C., GOMEZ, L., and VIDAL, J. J., "Enhancement of chest and breast radiographs by automatic spatial filtering", IEEE Trans. on Medical Imaging, September 1991, 10, (3), pp. 330-335. Also in Yearbook of Medical Informatics 92, pp. 223-229
Proceedings IWISP '96,"4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
341
I m a g e N o i s e R e d u c t i o n B a s e d on L o c a l C l a s s i f i c a t i o n and Iterated Conditional Modes * K. Haris 1'3, S.N. Efstratiadis 1'2, N. Maglaveras 1, and C. Pappas I 1Lab. of Medical Informatics, Faculty of Medicine Aristotle University, Thessaloniki 54006, Greece e-mail: [email protected] 2Information Processing Lab., Dept. of Electrical & Computer Eng. Aristotle University, Thessaloniki 54006, Greece 3Dept. of Informatics, School of Technological Applications Technological Educational Institute of Thessaloniki, Sindos 54101, Greece Abstract
In this paper, we propose a novel method for smoothing multidimensional piece-wise constant images corrupted by Gaussia.n white noise. This method applies local hypothesis testing to decide between the presense or absence of image intensity discontinuities (structure) in the pixel neighborhood. In the first case, the local data is considered a sample of a mixture of two Gaussian distributions and its parameters are estimated using the moment estimation method which provides closed form estimates. The estimated parameters allow the local classification of the data and the estimation of true value of the current neighborhood central pixel. The latter result is considered as input to the Iterated Conditional Modes (ICM) algorithm which further improves the estimation results significantly. The proposed method is compared with a number of other image smoothing methods based on their noise reduction and edge preservation properties. Experimental results using synthetic and real 2D images are presented. Keywords:
image smoothing, hypothesis testing, local classification, M a r k o v r a n d o m field.
1
Introduction
Tile reduction of noise that corrupts almost all acquired images, e.g. due to sensor imperfections, is one of the most important tasks in image processing and computer vision. Most of the higher level computer vision tasks are directly affected by errors in the images which introduce uncertainties and lead to wrong decisions [1]. Since linear filtering methods do not preserve the intensity discontinuities and distort fine detail features, numerous nonlinear methods have been proposed [2]. The approaches based on Markov random field models allow the formulation of the image smoothing problem as an optimization one using principles, such as, the Maximum A Posteriori principle, the Maximization of Posterior Marginals principle [3] and the iterative maximization of the conditional marginals given an initial image estimate [4]. Other approaches take into account the underlying image structure (adaptive smoothing techniques) using various adaptation methods which are based, for example, on Anisotropic Diffusion [5, 6]. An efficient nonlinear adaptive noise reduction approach using statistical local classification was proposed in [7, 8]. In this paper, by extending the work in [7, 8], we propose a two-stage adaptive smoothing algorithm based on the assumption that tile true image is piecewise constant and is contaminated with white Gaussian noise. More specifically, let L - {0,1 . . . . . Lm} be the possible intensity values and G - { ( i , j ) " 1 _< i <_ No,1 _< j _< N,.} the spatial pixel coordinates of an Nr-row by Nr image. The m x n neighborhood of pixel (i,j) is defined as follows Afmx,,(i,j)-{(k,/)EG[ [i-k[_<m, lj-l[_
J(i,j) - I(i,j) + n(i,j),
(i,j) e G ,
where n(i, j) is the zero mean Gaussian distribution with standard deviation a. It is also assumed that, a) the true image is piecewise constant, and b) small pixel neighborhoods contain either one (homogeneous area) or *This work was funded in part under the project I4C of the Health Telematies programme of the CEC.
342 two (heterogeneous area) regions [7, 8]. During the first stage of the algorithm, a hypothesis test is applied to each pixel neighborhood in order to decide on its homogeneity or heterogeneity. In the latter case, the neighborhood data is considered as sample of a two-component Gaussian mixture, the parameters of which are estimated by the direct (non-iterative) method of moments. The estimated parameters are used for the classification of the central pixel to one of the two populations. During the second stage, the available estimated true image is used as the initial estimate of the iterated conditional modes (ICM) algorithm which improves the estimation results. In Section 2 the statistical smoothing algorithm is presented and in Section 3, the ICM algorithm is described. In Section 4, the proposed algorithm is compared with other smoothing techniques when applied to synthetic and real 2D Magnetic Resonance images.
2
Initial Noise Reduction by Local Classification
Homogeneity Testing: For each pixel (i, j), the presence or absence of homogeneity is decided based on its n • n neighborhood Afnxn(i,j), where n is odd. A homogeneous .h/'nxn(i,j) is considered a sample of size N - n • n of a Ga.ussian random variable with the same mean # and variance a 2. A heterogeneous Afnxn(i, j) is is considered a sample of size N of a random variable following the distribution of a mixture of two Gaussian distributions with prior probabilities P0,/~ mean values ju0,#l and common variance 0.2. According to the above fornmlation, the maximum likelihood (ML) ratio test [9] gives Af, • (i, j) is homogeneous, if
$2 _< (1 + C)0. 2 ,
(1)
Parameter C is determined by the significance level of the test where $2 is the sample variance of J%~• (i.e. the probability of wrongly accepting homogeneity), based on the fact that the random variable NS2/0" 2 is distributed according to )~v-1, under the homogeneity hypothesis. The power of test (1) is defined as the probability of correctly accepting heterogeneity. Under the hypothesis of heterogeneity and according to our model, the random variable N S2/a 2 is distributed according to the non-central X~v-1 with non-centrality parameter A - NPoPlp 2, where p = ]p0 - ~1]/0. is the signal-to-noise ratio [8, 9]. Therefore, the power of test (1) can be calculated a.s a function of N, P0,/91 and p. When p > 2, the power of the test is large. Fast M i x t u r e D e c o m p o s i t i o n : If .Afnxn(i,j) is decided to be homogeneous, then the true value of pixel (i,j) is estimated by the sample mean of .Afnxn(i, j) which is the best estimator (unbiased and of minimum variance) for the case of Gaussian noise. If the decision is that .Afnxn(i,j) is heterogeneous, then the unknown parameters of the mixture P0, P1, p,0, p l are estimated by the method of moments using the three first sample moments [8, 7]. The closed form estimators are D - x / f l 2 - 4~
P0 -
2
~'1 -- fl + ~ / f 1 2 _ 4"~' '
~
P 0 -" ~/1 '
and
Cl
~1 -- ~ 0
'
J~)l ---- e l -- fL0 ~ 1 -- ~ 0
'
where /~ _. C3 -- C1 C2
c2-c~ '
7 =
e l C3 -- C22
c,-c~
'
cl = r n l ,
c2---m2-
0" 2
,
c3=rn3-3mlo
.2 ,
and m~, for g = 1,2,3, is the ~-order sample moment of Afn• Local Classification: The estimates P0, #1, P0 and 151 are used in the calculation of the threshold T, used to separate the two classes, that is 0.2 P0 T~ 0 + ]/,1 _~. ~ In- . 2 /~1 - h0 P1 The vMue of T is used for the classification of the pixel (i, j) as follows
ILC(i,j) = ( P l if J ( i , j ) > T P.0 otherwise
The above algorithnl reduces the noise quite efficiently while preserving intensity discontinuities very well, especially when the signal-to-noise ratio is above a certain level (p > 2).
3
Iterated Conditional Modes (ICM) Algorithm
The resulting estimate of the true image from the first stage ]LC may contain errors due to: a) wrong decisions at. the hypothesis testing phase, and b) inaccurate parameter estimates and, consequently, inaccurate
343 data classification. This initial estimate may be further improved by the exploitation of the spatial smoothness of the image. One of the simplest approaches to capture the spatial smoothness of the true image is to consider it ms a pairwise interaction Markov Random field [4, 10] with probability distribution
P(I) oc.exp{-U(I)} , where
lr'(I)-
Z
~
(i,j)ea
and
Z
6(I(i,j),I(k,l)),
(k,i)EJ~f3x3-{(i,j)}
6(a,b) is equal to -1 if a = b, and 0 otherwise. Given the observed or, the posterior distribution of I is P(IIJ) ~ exp{-U(IIJ)} ,
where U(IIJ)-
1 ~ In(a2) +
Z
[J(i,j)- I(i,j)] :~ 2a2
+ fl
(i,j )EG
Z
6(I(i,j),I(k,l))
}
.
(k,l)eJq'3x3-{(i,j)}
One approach to estimate I is maximizing the following pixel conditional probability at each pixel [4], that is,
P(I(i,j)]J, ia,) ~ P(J(i,j)lI(i,j))P(I(i,j)l]X]x 3(i,j)) ,
(2)
where G~,Af~x3 are the supports G,A/'zx3 without pixel (i,j). The ICM algorithm requires a good initial estimate of the true image which is provided by the first stage. The entire image is processed iteratively, each time in a raster scan order. The intensity value in each pixel is selected to maximize Eq. (2) and parameter fl needs to be adjusted. Our experiments have shown that the algorithm converges in less than 15 iterations and that best results are achieved when/3 E [1.0, 2.0].
4
E x p e r i m e n t a l Results
In order to quantitatively measure the noise reduction ability of the presented algorithm, we propose the following performance criteria. The set of pixels of a noise free piecewise constant image is partitioned into two sets De, for e = 0, 1. Do contains all pixels having a homogeneous neighborhood, that is, when the neighborhood pixels have equal intensity values. D1 contains the remaining pixels that have a heterogeneous neighborhood, that is, the neighborhood contains at least one pixel with a different intensity value than the rest. The noise reduction ability in either the llomogeneous (t = O) or heterogeneous (g = 1) image areas is defined by
a ~-
F e - E---~'
where
1
~
E e - ]De---I~(i,j)EDt
[I(i,j)-/~(i,j)]2,
and IDtl is the size of De. Table 1 presents a comparison of various algorithms based on F0 and Fx on a synthetic piecewise constant image with Gaussian noise of variance r = 5, 10 and 15. The synthetic image contains several different objects with intensity equal to 110 and background intensity equal to 80. It is clear that the proposed algorithm (Local Classification & ICM) outperforms the rest of the methods with respect to both measures F0, Fa. In addition, the results of the application of the proposed algorithm (neighborhood size 9 x 9) to a real MR. brain image are presented. Figure 1 shows a section of the original image, the initial estimate and the final estimate. The significant image quality improvement due to the ICM algorithm and the ability of the overall smoothing algorithm to effectively reduce noise while preserving intensity discontinuities ~ro clearly demonstrated.
5
Conclusions
An efficient two-stage noise reduction method, that is suitable for approximate piecerwise constant images corrupted by stationary Gaussian noise and preserves image structure, was presented. The initial estimate obtained by local classification is improved by applying the ICM algorithm. Based on our experimental comparison, we concluded that the noise reduction and edge preservation properties of the proposed algorithm are superior to other well-known smoothing algorithms. Finally, it is noted that, the performance of the local classification stage decreases when the noise level is high and the application of ICM significantly improves the initial estimation results.
References [1] K. Haris, S.N. Efstratiadis, N. Maglaveras, and C. Pappas. "Hybrid Image Segmentation Using Watersheds". volume 2727, pages 1140-51, Orlando, FL, April 1996.
344 [2] I. Pitas, and A.N. Venetsanopoulos. Nonlinear Digital Filters: Principles and Applications. Kluwer Academic Publishers, 1990. [3] J. Marroquin, S. Mitter, and T. Poggio. "Probabilistic Solution of Ill-Posed Problems in Computational Vision". Journal of the American Statistical Assosiation, 82(397):76-89, March 1987. [4] J. Bessag. "On the Statistical Analysis of Dirty Images". J. R. Statist. Soc. B, 48(3):259-302, 1986. [5] P. Saint-Marc, J. Chen, and G. Medioni. "Adaptive Smoothing: A General Tool for Early Vision". IEEE Trans. on Pattern Anal. and Mach. Intell., 13(6):514-529, June 1991. [6] P. Perona and J. Malik. "Scale-Space and Edge Detection Using Anisotropic Diffusion". IEEE Trans. on Pattern Anal. and Mach. Intell., 12(7):629-639, July 1990. [7] K. Haris. A Hybrid Algorithm for the Segmentation of 2D and 3D Images. Master's thesis, University of Crete, Greece, 1994. [8] K. Haris, G. Tziritas, and S. Orphanoudakis. "Smoothing 2-D or 3-D Images Using Local Classification". In Proceedings of EUSIPCO'94, Edimburg, September 1994. [9] Z. Wu. "Homogeneity Testing for Unlabeled Data: A Performance Evaluation". CVGIP: Graphical Models and Image Processing, 55(5):370-380, September 1993. [1011R'64Dubes and A. Jain. "Random Field Models in Image Analysis". Journal of Applied Statistics, 16(2):131,1989. ~-
NOISE VARIANCE NOISE REDUCTION MEASURE ]Initial Local Classification (various window sizes)
I
ocal Classification & ICM (various window sizes)
7x7 9x9 II x II 7x7 9x9
II
II 34.242 41.422 60.917
II x II
~Local Classification & Median (various window sizes)
H
7• 9x9 II x II
I Neighborhood Averaging (various window sizes) I Median Filtering (various window sizes)
7x7 9x9 I i x II 7x7 9x9 II • II
I Gradient Inverse Filtering (various iteration numbers corresponding to the indicated window sizes) Anisotropic Diffusion (various iteration numbers corresponding to the indicated window sizes)
~
3(7• 4(9x9)
5 (11 • 11) 3(7•
4 (9x9) 5(11•
42.654 61.160 80.938 41.851 63.913 90.564 43.913 60.537
a=lO
a=15
F1
Fo
3.412
29.290 37.200 60.264 50.575 82.372 104.427
3.024 3.662 4.096 4.893 5.057 7.226 9.708 6.562 12.473 8.539 5"643 I1 53"082 4.565 ;.440302 67.991 7.501 77.758 5.785 94.905 8.872 107.319 6.795 0.351 46.264 1.374 45.547 2.995 0.359 74.489 1.414 73.962 3.119 0.359 107.632 1.425 108.515 3.171 1.046 30.628 2.037 [I 29.765 3.456 II O.936 49.902 2.078 48.674 3.644 0.833 71.893 2.047 72.481 3.7!7
23.067 1[ 26.877 23.607 37.532 28.957 56.748 37.054 53.852 45.757 17194.745844 51.947
3.855
II
12.375 14.562
42.28'56.8050.56I[ 7'~:73~094 41.5063.0552.77[[ 6:i866:il
21.503 25.454
2.314 2.667
24.940 30.634
3.300 4.185 4.780
I1
15.930 22.181 26.146
5.533
6.567 5.865 6.713
Table 1" Comparison of various noise reduction methods on a synthetic image.
Figure 1" Left: A section of the observed noisy MR, image. Middle: The result of the initial noise reduction stage. Right: The result after the application of the ICM algorithm to the initial estimate.
Session L: ADAPTIVE SYSTEMS II: CLASSIFICATION
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
347
An neural approach to invariant character recognition Iraklis M. Spiliotis 1, Panagiotis Liatsis 2, Basil G. Mertziosland Yannis P. Goulermas 2 1Department of Electrical & Computer Engineering Democritus University of Thrace GR- 67100 Xanthi Greece
2Control Systems Centre UMIST Manchester M60 1QD United Kingdom
Abstract
Geometric transformations constitute a difficulty in optical character recognition (OCR) systems. This work describes the development of an intelligent OCR system based on higher-order neural networks (HONNs). These networks can be designed such that their outputs remain invariant to certain geometric distortions, such as translation, rotation and scale. The main obstacle in the practical application of HONNs is the explosion of the weights due to the number of input combinations. This problem is tackled using an efficient object representation scheme called image block representation (IBR).
I. Introduction Invariant object recognition is a major research area in computer vision. A number of approaches have been proposed to address the issue of image correspondence once geometric transformations are applied [1]-[5]. The question remains to select the set of image features which will ensure an acceptable recognition rate. Higher-order neural networks have developed over the past decade as an alternative to traditional object recognition approaches [6]-[ 10]. The basic concept of HONNs is the expansion of the input repre~ntation space using higherorder combinations of the input terms, such that the mapping from the input to the output space can become more readily obtainable. This idea has a certain appeal in object recognition systems since geometric feature extraction mechanisms can be incorporated within the structure of the HONN. For instance, it has been shown that features such as distances and line slopes defined by point pairs, and angles of similar triangles defined by point triplets are respectfully invariants to translation-rotation, translation-scale, and translation-rotation-scale transformations. As mentioned above, these invariants can be obtained by enriching the input representation space with all possible point combinations (up to a certain order). Clearly, this constitutes a serious limitation for the application of HONNs in invariant image recognition, where images may have a spatial resolution of 512x512 pixels. In particular, for an MxN image and n-order point combinations, the number of input terms will be augmented by (MxN)!/(MxN-n)!n!. To allow the use of HONNs to object recognition problems, the technique of coarse coding has been proposed. This decomposes the image into a set of non-overlapping, offset images of coarse resolution, such that the number of input combinations is reasonably bounded. However, coarse coding does not ensure lossless image representation and thus does not allow perfect image reconstruction [ 11]. This research proposes the use of a simple yet effective object representation scheme called image block representation [12]-[14]. This method decomposes the object into a set of non-overlapping rectangular regions, which can then be used to extract the so-called critical points, i.e. points of interest on the object. The number of critical points is relatively small (when compared to the number of object pixels) and subsequently they can be used as direct inputs to the HONN architecture. The performance of the system is evaluated in the case of binary character recognition.
II. Higher-Order Neural Networks A major criticism of single-layered perceptrons [15] was that they were unable to perform non-linear separation, an example being the XOR problem, due to the simplicity of the resulting decision boundaries. A way of dealing with this problem was to generalise the perceptron architecture such that it accommodates intervening layers of neurons, capable of extracting abstract features, thereby resulting to networks that could solve reasonably well any given inputoutput problem [16]. An alternative approach, based on recent studies of information processing exhibited by biological neural networks [17] as well as Group Method of Data Handling (GMDH) algorithms [18], was the expansion of input representation space by using multi-linear terms. This gave rise to a family of neural networks collectively known as higher-order neural networks. In general, the output of a first-order neural network is def'med by [9]
348
f
T_)ia (i) : f(T~ id(i)+ T1hid(i)), forhiddennodes
Y~= / (~~_oT,~ (i) / =/(T~ ut (i) + I"1~ (i)), for output nodes
(1)
where f(net) is a nonlinear threshold function such as the sigmoid function and Tot(i) is the bias term for output i of the k-th layer (where k takes the values hidden and output) given by (2) Tok (i) = w,*
and T~*(i)are the first-order terms for the i-th output unit of the k-th layer T/ (i) = ~.~ w,jxj* k-,
i where wi/are the interconnection weights for each input x~of layer (k-l) and output node i of layer k. Generalising to mixed n-th order networks gives [ 10]
yi
f(~.~Tnh'd (i)] = f(Toid (i) + Tlhia(i) + Tffa (i)+.. .+l . ,...,hid ;, (i).)
(3)
(4)
for the nodes of the hidden layer, where T2hid(i)and Tnhid(i)are given by
T~'id(i) : E~-~ Wi#hidxyx, (second- order terms) k j T hid(i)=E... E wii...(k_n)Xi...x,, hid (nth - order terms) n
(5)
j
Consider an object and any two non-identical points A, B on the object. Next an arbitrary translation and/or rotation of the object within the image is applied and points A, B become A' and B'. Since the invariant under translation and/or rotation is the relative distance between any two points on the object, the output of the HONN can be handcrafted to be invariant to this set of transformations by considering only the second-order terms [9]
wok xjx, t
(6)
j
and by constraining the input-hidden weights to satisfy hid hid
wiA8 = WiA,B, if dAa = dA,B,
(7)
where dan and dAR' are the Euclidean distances between points A, B and A', B' respectively. The learning rule for the higher-order neural networks is the backpropagation algorithm, appropriately modified to accommodate the inclusion of the higher-order terms in the hidden layer. The updating rule for the weights of the hidden layer is then given by [10]
A Wijk = O~i ~_a ~_jXjXk ,
j
(8)
where I/is the learning rate, the ~5'sare calculated as in the classical backpropagation, while k, j take values which satisfy the invariance constraints. HI. I m a g e B l o c k R e p r e s e n t a t i o n A bilevel digital image is represented by a binary 2-D array. Without loss of generality, we consider that the object pixels are assigned to level 1 and the background pixels to level 0. Due to this kind of representation, there are rectangular areas of object value 1 in each image. These rectangles, called blocks, have their edges parallel to the image axes and contain an integer number of image pixels. Consider a set that contains as members all the nonoverlapping blocks of a specific binary image, in such a way that no other block can be extracted from the image (or equivalently each pixel with object level belongs to only one block). It is always feasible to represent a binary image with a set of all the nonoverlapping blocks with object level and this information lossless representation is called Image Block Representation (IBR) [12]. Given a specific binary image, different sets of different blocks can be formed. Actually, the nonunique block representation does not have any implications on the implementation of any operation on a block represented image.
349
The IBR concept leads to a simple and fast algorithm, which requires just one pass of the image and simple bookkeeping process. In fact, considering a NlxN 2 binary image fix,y), x=0,1 . . . . . N 1 - 1 , y=0,1 . . . . . N 2 - 1 , the block extraction process requires a pass from each line y of the image. In this pass all object level intervals are extracted and compared with the previous extracted blocks. As a result, a set of all the rectangular areas with level 1 that form the object. A block represented image is denoted as f ( x , y) = {bi : i = 0,1..... k - 1} (9) where k is the number of the blocks. Each block is described by four integers, the coordinates of the upper left and down right comer in vertical and horizontal axes. The block extraction process is implemented easily with low computational complexity, since it is a pixel checking process without numerical operations. Fig. 1, illustrates the blocks that represent an image of the character d.
Figure 1. Image of the character d and the blocks.
IV. Critical Points Extraction An object normalization procedure is first executed in order to facilitate rotation invariant descriptions of the objects. Specifically the maximal axis of the object is found and the whole object is rotated in such a way that the maximal axis has a vertical position and that the upper half of the image object contains the most of the object's maze. At this point, it is necessary to give the following definitions [14]: l.Group is an ordered set of connected blocks, in such a way that all its intermediate blocks are connected with two other blocks, while the first and last blocks are connected with only one block. 2. Junction point is called a point that it is connected with two other points. 3. End point is called a point that it is connected with only one other point. 4. Tree point is called the point that it is connected with more than two other points. 5. Critical point is called a junction or an end or a tree point. In this research, a fast non-iterative critical points detection method for block represented binary images is presented. The method has low computational complexity, extracts only critical points and to a degree appears to be immune to locality problems. This is achieved by the use of a suitable neighbourhood at each case. Specifically, groups of connected blocks are formed. Each group is terminated when an adjacent block does not exists for its continuation, or when two or more blocks exist for the continuation of the group.
Figure 2. (a). Image of the character B. (b) The extracted blocks. (c) The groups of blocks. (d) The critical points. Each group defines a local neighborhtxxl and all the necessary processing takes place in this neighborhood. Using a few simple rules for the processing, the groups are checked and labeled to certain categories: 9 Vertical Elongated groups. The absolute value of the angle of these groups with the horizontal axis is usually greater than 30 ~ . The width of each block of a vertical elongated group should not exceed a threshold value. The connections among the blocks result to junction points, which belong to the thinned line that results from the group. For each pair of connected blocks, one junction point (the central point of the common line segment of two connected blocks) is extracted. For each block we check if the distance among its junction points and its extremities (i.e. the central points of the edges of the small dimension) of the block, exceeds a threshold value. 9 Horizontal Elongated groups. The absolute value of the angle of these groups with the horizontal axis is smaller than 30 ~ . The width of an horizontal elongated group is significantly greater than its height and also its height appears to have small variation. For the extraction of the junction points the algorithm starts from the left end of
350
an horizontal elongated group and moves to the right with constant width steps. At each step a junction point is extracted at the middle of the height of the group at this vertical position. Angle groups. The angle groups are connected with two other groups that lie on the same vertical or horizontal side of the angle group. The width and the height of an angle group are usually small. An angle group should not be connected to a noisy group. If a group has labeled as angle group and it is connected with a noisy group, then the label "angle" is replaced by the horizontal elongated label or the vertical elongated label. Three junction points are extracted from an angle group. The two junction points are extracted due to the connections with the two groups and another one for the formulation of an angle. Noisy groups. These are small and spurious branches of the object. The noisy groups have width and height less than a threshold and they are connected to only one group, which is not an angle group. In the most cases, the noisy groups are connected from the left or right side to vertical elongated groups or from the up or down side to horizontal elongated groups. In these cases the extraction of junction points from the noisy groups is not acceptable, otherwise a noisy end point would be created. The noisy groups are branches of the object that have small height and width and usually junction points are extracted from the noisy groups, if and only if the noisy group is connected at the ends of an elongated group. Fig. 2 demonstrates (a) an image of the character B, (b) the extracted blocks, (c) the groups of blocks and (d) the critical points.
V. Results In this work. we examine the application of IBR and HONNs techniques to the problem of recognising typed characters. The binary data consisted of 26 Latin letter characters (A-Z) and 10 digits (0-9) with a spatial resolution of 64x64 pixels. Since digits '6' and '9' are rotationally equivalent, they were considered as the same pattern. The font style selected for training the OCR system was 'Times New Roman'. Next, the techniques were applied to each of the 35 characters presented in 5 random translations and rotations, giving a total of 175 training patterns. The first stage of the system was the pre-processing which resulted to the critical points extraction. Here, the rotation normalisation procedure is applied to ensure the success of the IBR scheme. Due to the discrete nature of the image grid, some noise was intr~uced to the characters, when their maximal axis was set to the vertical position. Next, each of the characters was decomposed into its resulting blocks, and the groups were labeled into vertical/horizontal elongated, angle and noisy. Finally, the critical points were found, using the procedure described in the last section. The second stage of the system was the classifier. Here, the critical points were fed into a second-order neural network, which had a built-in feature extraction mechanism. This provided invariant classification with respect to translation anil rotation. The input layer of the higher-order neural network had 256 inputs. This number was selected to correspond to the maximum number of critical points extracted from any one of the training images. Since a binary representation encoding was employed in the output layer, there were only 8 output units. The number of units in the hidden layer was determined by using a genetic optimisation scheme [10] which provides the minimal-optimal network topology for a given classification problem. The network was able to learn to discriminate between the 35 types of printed characters alter 500 epochs. To evaluate the performance of the trained system, two testing procedures were applied. We firstly tested the system with a set of 700 patterns, obtained by applying 20 random translations and rotations to each of the 35 characters. It was still able tb distinguish between all translated and rotated versions of the characters with 100% accuracy. Next, the characters were corrupted using variable percentages of binary salt-and-pepper noise. It was observed that the system was able to distinguish with 1(10% accuracy for additive noise of up to 10%, and still had a satisfactory performance (recognition accuracy > 70%) for noise levels of 25%. VI. C o n c l u s i o n s A new approach to the problem of Optical Character Recognition was presented. The proposed system uses an efficient object representation scheme called image block representation, which decomposes the characters into nonoverlapping rectangular regions, which are then used to find the critical points. Next, the critical points are fed into a higher-order neural network with invariances to translation and rotation. This alleviates the problem of the combinatorial explosion of the higher-order terms, associated with the use of HONNs. The optimal number of hidden units, for solving the character recognition problem, was determined using a Genetic Algorithms (GA) scheme. The structure of the neural network was selected to be 256 inputs -5 hidden -8 outputs. The system was able to identify the translated/rotated patterns with 100% recognition accuracy, while it demonstrated robustness to additive noise.
351
Future work will investigate the performance of the system using a number of font styles as well as handwritten characters. Another interesting application for this type of system is visual inspection. In particular, the problem of detecting blemishes in industrial workpieces, where the only discriminating feature between the two classes is the presence (or absence) of the defect.
Acknowledgments The authors wish to acknowledge the British Council and the Greek General Secretariat for Research & Development for providing financial support for this research.
References [1] M.W. Roberts, M. Koch and D.R. Brown, 'A multilayered neural network to determine the orientation of an object',Proc. Int. Joint Conf Neural Networks, Vol. 2, pp. 421-424, 1990. [2] K. Fukushima, ~A hierarchical neural network model for associative memory', Biol. Cybern., Vol. 50, pp. 105113, 1984. [3] S.E.. Troxel, S.K. Rogers and M. Kabrinsky, 'The use of neural networks in PRSI target recognition', Proc. 1EEE Int. Cot~ Neural Nem,orks, Vol. 1, pp. 569-576, 1988. [4] E. Barnard and D. Casasent, 'Invariance and neural nets', IEEE Trans. Neural Networks, Vol. 2, No. 5, pp. 498508, 1991. [5] N. Papamarkos, I. M. Spiliotis and A. Zoumadakis, 'Character recognition by signature approximation', Int. Jour. Patt. Rec. Art. lntell., Vol. 8, No. 5, pp. 1171-1187, 1994. [6] T. Maxwell, C.L. Giles, Y.C. Lee and H.H. Chen, 'Nonlinear dynamics of artificial neural systems', in Neural Nem,orksfor Computing, AIP Conf. 151, UT, pp. 299-304, 1986. [7] C.L. Giles and T. Maxwell, 'Leaning, invariance, and generalisation in higher-order neural networks', Applied Optics, Vol. 26, No. 23, pp. 4972-4978, 1987. [8] M.B. Reid, L. Spirkovska and E. Ochoa, 'Simultaneous position, scale and rotation invariant pattern classification using third-order neural networks', Neural Networks, Vol. 1, No. 3, pp. 154-159, 1989. [9] P. Liatsis, P.E. Wellstead, M.B. Zarrop and T. Prendergast, 'A versatile visual inspection tool for the manufacturing process', Proc. CCA'94, Vol. 3, pp. 1505-1510, 1994. [10] P. Liatsis and Y.J.P. Goulermas, ~Minimal optimal topologies for invariant higher-order neural architectures using genetic algorithms', Proc. ISIE'95, Vol. 2, pp. 792-797, 1995. [11] J. Sullins, ~Value cell encoding strategies', Tech. Rep. 165, CS Dept., Rochester Univ., New York, August 1985. [12] I. M. Spiliotis and B.G. Mertzios, ~Real-time computation of two-dimensional moments on binary images using image block representation', accepted for publication in IEEE Trans. Image Process. [13] I. M. Spiliotis and B.G. Mertzios, 'Fast algorithms for basic processing and analysis operations on block represented binary images" submitted to Patt. Rec. Letters. [14] B. G. Mertzios, I.M. Spiliotis and N. Papamarkos, 'Image block representation and its applications to manufacturing and automation', accepted in 5th Int. Work. Time-Va~ing Image Process. and Moving Object Recognition, Florence, Italy, September 5-6, 1996. [ 15] M. L. Minsky and S. Papert, Perceptrons, Cambridge MA: MIT Press, 1969. [ 16] H. White, Artificial Neural Networks: approximation and learning theory, Oxford: Blackwell, 1992. [ 17] D.A. Baylor, T.D. Lmnb and K.W. Lau, 'Responses of retinal rods to single photons', J. Physiol., No. 288, pp. 613-634, 1979. [18] S. J. Farlow (ed.), Self-organising methods in modeling: GMDH algorithms, New York: Marcel Dekker Inc., 1984.
This Page Intentionally Left Blank
P~oceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
353
IMAGE SEGMENTATION BASED ON BOUNDARY CONSTRAINT NEURAL NETWORK F.KURUGOLLU, S. BIRECIK, M. SEZGIN, B. SANKUR TUBITAK MARMARA RESEARCH CENTER, INFORMATION TECHNOLOGIES INSTITUTE, P.O. BOX 21 41470 GEBZE KOCAELI-TURKEY E-MAIL: [email protected]
ABSTRACT Recently, artificial neural networks based image segmentation methods have gained more acceptance over other methods due to their distributed architectures allowing real-time implementation.. Another important advantage of the neural networks is their robustness structures to the unexpected behavior of input image such as noise. On the other hand, their disadvantage is that the learning phase could be too long and resulting segmentation has noisy boundary. In this study, a neural network based image segmentation method, which is called Constraint Satisfaction Neural Network, has been investigated. A modification of Constraint Satisfaction Neural Network has been proposed to alleviate both problems. It has been observed that when the edge field is brought in as a constraint the convergence improves and the boundary noise is reduced.
1. INTRODUCTION : THE SEGMENTATION PROBLEM Image segmentation, an important step in image analysis, aims to divide an image into uniform and homogeneous segments, hopefully reflecting the semantic content. Image segmentation methods can be divided into three main category: Region based methods, edge based methods and pixel classification based methods [1 ]. Despite the plethora of segmentation algorithms in the literature, the quest for new innovative methods continues mainly to overcome - Computational complexity for real-time applications, Robustness to handle as large a variety of scenes as possible, To match algorithmic results to semantic content. -
-
In this respect, it is believed that neural networks judiciously used with image information constraints is a promising approach not only for computational speed but also for robust results. Recently, a number of neural network based schemes have been advanced for image segmentation purposes. Some of these algorithms use measurement space information while the others use the spatial information. The former ones usually use histogram of the image. These algorithms try to determine the peaks of the histogram. The latter ones use some spatial information such as gray value, local average, local variance, etc. The main advantages of the neural networks is that they have distributed architecture, they can be implemented in hardware to meet real-time demands [2] and they can handle the nonlinear relationships between measurement space and spatial information.
2.PREVIOUS WORK:CONSTRAINT SATISFACTION NEURAL NETWORK One such neural network based image segmentation method is Constraint Satisfaction Neural Network (CSNN) proposed by Lin et.al.[3]. In this method, Image segmentation is casted as constraint satisfaction problem. The principle of the method is to assign the segment labels into the pixels under certain spatial constraints. The method uses a network topology as shown in Figure 1 and the constraints defined between a pixel and its neighbors to accomplish the segmentation. The network topology (Figure 1) consists of m layers one for each of the segments. There are nxn (image dimension) neurons in each layer, each neuron representing an image pixel. Each neuron which has the same index in each of the layers holds the probability that the pixel belongs to the segment represented by layer index. Connections between a pixel and its neighbors are depicted in the Figure 2. In this example, 8 neighborhood connectivity is chosen. The neighborhood connectivity order may be varied depends on application. The weights of these connections represent the constraints in this topology. After an initialization, CSNN converges to a segmentation result which satisfy all constraints through a parallel and iterative process. Whenever CSNN converges to a result, the neuron in the correct layer approaches to 1 while the neurons in a column in the other layers reduce to 0. The layer label which approaches to 1 is assigned to corresponding pixel. The gray value distribution of the input image is used to as the initial condition. The gray values are classified into a number of segment categories by means of a Kohonen self-organizing neural network. These categories constitute the initial labels (i.e., probabilities of belonging to a segment). The network weights are adopted by a heuristic manner. These weights are determined so that a neuron excites those neurons that represent the same label, and inhibits the ones that represent the significantly different labels.
354 The method used in determination of the weights and the mathematical construction of the CSNN can be found in [3]. In summary, the major advantage of the CSNN is that segmentation is performed by using image information constraints in a parallel manner. Input Image
. ~ o o oo- -oo - - - oo.~, o/ /6oo~X~! o ooo" nxy-o o o - - - o 0 y o o 0
0oo-
9
0
0
o o o
o
o
.
9
9 0-
0
-
O0
9 .0-
0
-0-
0
-
'
-0 O0
0
O0
0 0 0
O~DO
CJU i(j+l,)l
9 - 0 0
99
000 0 0 0 .
U i(j-1)C}"
U (i-1)jl U (i-1)(i,1)1 . uJ:i. ~. ~. ~ ~ , ' .'
0 0'0 - 0 - ' ' 0 0
() 0 O 0 ~0
m
~)" , .
(i+1)(j-1)1
0oo.--o--.o0
0 0 0
U~
9
9 Q
0
9
9
~0 0
O0
0'0
-00
Figure 1. The Topology of the CSNN. Each layer
represents the segments. (i,j) th n e u r o n in each layer holds the probability that the (i,j) th pixel belongs to the segment represented by the layer.
Figure 2. Connections between a neuron and its neighbors. The weights of these connections are interpreted as constraints. T h e s e weights are determined heuristically to provide exciting the neurons with similar intensities and inhibiting those with different intensities
3. IMPROVEMENTS ON THE C S N N METHOD BOUNDARY CONSTRAINT SATISFACTION NEURAL NETWORK Convergence of CSNN to a meaningful segmentation is time-consuming and the error at convergence does not decrease beyond a certain value. This effect can be shown in Figure 3. The convergence error is 10 order after 50 iterations for a 256 by 256 image. Actually, the algorithm lets the segments grow rapidly, but it hesitates in the assignment of pixels around the segment boundaries. This problem causes a lot of the futile iterations. On the other hand, the neuron allocation problem arises when using images such as 256 by 256. For example, for the 256 by 256 size image with 8 potential, 524288 (256x256x8) neurons are necessary. If the image size are reduced to meet the real-time constraints and the alleviate capacity problem, then undesired smoothing in the segments become apparent because reduced segments are absorbed by strong segments. The segment borders are also noisy in this case. To solve these problems, an algorithm called Boundary-Constraint Satisfaction Neural Network (B_CSNN) is proposed in this study. Since boundary uncertainties were a major handicap, a coarse boundary map of the input image is used. For this reason, the weights between the boundary pixels and their neighbors are set to 0, so that contributions of the boundary pixels to their neighbors are precluded as depicted in Figure 4. Consequently B_CSNN is observed to accomplish the segmentation process more consistently. The flow diagrams of the B_CSNN algorithms is shown in Figure 6. While image pixels are only processed in the CSNN algorithm, the boundary constraints are used to process the image pixels in the B_CSNN algorithm. Therefore, the number of iterations and convergence error of the B_CSNN are both reduced significantly (See Figure 3). The segmentation results for 256 by 256 images and 4 segments are compared in Figure 7. One can notice that segmentation boundary noise resulting from the CSNN is removed by the B_CSNN algorithm.
355
Figure 3. Convergence error with respect to the number of iterations of both algorithm for a 256 by 256 image. At 23rd iteration step, convergence error of B CSNN is under 1 while that of CSNN is still 10 -
Figure 4. Adjustment on connection weights by using edge map of the input image. The black dots indicate the edge pixels. The contribution of the boundary pixel to their neighbor are precluded by assigning 0 to its connection weights.
For 64 by 64 images the segmentation results of both algorithms are compared in Figure 8. Notice the segment absorption and the noisy segment border problems evident in the CSNN algorithm while these problems are mostly fixed by the B_CSNN. Convergence error with respect to number of iterations of both algorithm are compared in Figure 5. After the improvements brought in by B_CSNN, we will investigate enhanced CSNN to multiresolution or pyramidal decomposition's in order to include the spatial information content of the image.
Figure 5. Convergence error with respect to the number of iterations of both algorithms for a 64 by 64 image. At 13th iteration step, convergence error of B_CSNN reaches to 0 while that of CSNN is still 2.
Figure 6. Flow diagram of the B_CSNN algorithm. Besides the original image, a coarse boundary information is used to determine segment boundary
REFERENCES: 1. R.M. Haralick, L.G. Shapiro, Computer and Robot Vision, Vol 1., Addison-Wesley Pub., 1992. 2. N.R. Pal, S.K. Pal," A Review on Image Segmentation Techniques ", Pattern Recognition, Vol. 26, No. 9, pp. 12771294, 1993. 3. W.C Lin, et. al. ," Constraint Satisfaction Neural Networks for image Segmentation ", Pattern Recognition, Vol. 25, No. 7, pp. 679-693, 1992.
356
Proceedings IWISP '96," 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
357
A HIGH PERFORMANCE NEURAL MULTICLASSIFIER SYSTEM FOR GENERIC PATTERN RECOGNITION APPLICATIONS Dimitris A. Mitzias and Basil G. Mertzios Automatic Control Systems Laboratory Department of Electrical and Computer Engineering, Democritus University of Thrace 67 100 Xanthi, HELLAS e-mail: mitzias/[email protected] Abstract A high performance NEural MUlticlassifier System (NEMUS) is proposed, which is characterized by a great degree of modularity and flexibility, and is very efficient for demanding and generic pattern recognition applications. The NEMUS is composed of two stages. The first stage is comprised of several classifiers that operate in parallel, while the second stage is a Decision-Making Network (DM-Net) that performs the final classification task, combining the outputs of all the classifiers of the first stage. In general, the inputs of each classifier are the features extracted from different Feature Extraction Methods and correspond to various levels of importance. The performance of the proposed NEMUS is demonstrated by a shape recognition task of 2-D digitized objects, considering various levels of shape distortions. Three different kind of features, which characterize a digitized object, are used: (a). Geometric features, (b). 1-D scaled normalized central moments and (c). The angles of a fast polygon approximation method. 1. Introduction Pattern recognition applications are usually executed in two basic stages. In the first stage, each pattern is described by a set of features, using a Feature Extraction Method (FEM), according to the requirements of each particular task. In the second stage, the pattern is recognized using a classification procedure, which requires a set of input data that are usually expressed in the form of an input feature vector. A significant number of classifiers is available in the literature that are based in deterministic and statistical techniques (e.g. Euclidean distance, least mean square error, cross correlation, nearest neighbour rule and leave-one-out algorithm) [1],[2] or in distributed processing techniques, where pretrained systems serve as classifiers (e.g. neural networks) [3]. The selection of the appropriate FEM depends on the specific conditions and requirements in order to achieve the higher classification efficiency. To this end, it is essential in demanding applications to use a combination of different FEMs. The underlying idea is that multiple FEMs contribute in the classification procedure different features of the same pattern that correspond to considerably different levels of importance, carrying different and complementary information. Therefore, the actual contribution of each FEM can not be explicitly determined. Thus, for multiclassifier applications, an answer should be given to the following questions: (a). How many and which FEMs should be used, (b). Which kind of classifier presents the better performance for each particular FEM and (c). Which is the contribution of each classifier in a multiclassification scheme. A neural multiclassifier system, which is characterized by a great degree of modularity and flexibility, and is very efficient for demanding and generic pattern recognition applications, is proposed. The NEMUS is composed of two stages. The first stage is comprised of several classifiers that operate in parallel, while the second stage is a decision-making network that performs the final classification task, combining the Outputs of all the classifiers of the first stage, so that the whole discrimination efficiency is optimized. The NEMUS gives a satisfactory answer to the above questions by the automatic selection of the contribution of each classifier's output to the multiclassification procedure. This practically means that a classifier is partially accepted or rejected in proportion to its ability to contribute in a particular task. Thus, the prior knowledge, concerning the type and the number of the classifiers, which should be used in a particular task, is not required. The performance of the proposed NEMUS is demonstrated by a shape recognition task of 2-D digitized objects considering various levels of shape distortions. Three different kinds of features, which characterize a digitized object, are used: geometric features, 1-D scaled normalized central moments and angles of a fast polygon approximation method. 2. The Neural Muiticlassifier System (NEMUS) The proposed NEMUS is composed by a number (S) of classifiers at the first stage and a DM-Net at the second stage, as it is shown in Fig. I. The k th, k= L2,...,S, classifier in the first stage operates with input, the feature vector Xk, which is produced by the k m FEM. The second layer (DM-Ne0 of the
358 NEMUS performs the fusion of the outputs, Ok of all the classifiers, so that the discrimination efficiency is optimized. The DM-Net is a Neural Network and consists of simple Neural Elements (NEs), which operate in parallel [4],[5]. The elements of the final output vector Y=[yl, y2..... ym] of the multiclassifier are given by: s
y,-f(as), as =XWk,jo~j+0j
(1)
k=l
where f('). is the sigmoid function, Wka is the connection weight between the jth output of the k th classifier and the jm NE of the DM-Net, and 0j is the internal threshold value of the jth NE of the DM-Net.
Classifier-I
Classifier-2 DMoNet
Classifier-S
Figure 1. The schematic diagram of the NEMUS. The training of the DM-Net is defined as the calculation of the connection weights Wk,j, k=l,2,...,s, j=l,2,...,m, of the DM-Net with the outputs of the classifiers of the first stage. These weights represent a measure of the discrimination efficiency of each classifier's output. After the training phase of the classifiers has been completed, the DM-Net is trained using an adaptation algorithm according to the type of the selected neural elements of the DM-Net. The classification efficiency, as well as the contribution of each classifier, depend on their ability to discriminate under various conditions resulting by pattern variations, which usually appear in each particular pattern recognition application. Thus, the training of the DM-Net is achieved by presenting to the classifiers a set of destroyed patterns, which are called Training Patterns and should represent a sufficient samLa!e of pattern variations among the possible appeared variations in each particular task. Specifically, the jm NE operates with the jth output of each classifier as its inputs and its weights Wk,j, 0j are automatically determined by the adaptation algorithm, using the set of training patterns. Also, the NEs are independent of each other and they are trained in a different number of adaptation circles. The DM-Net is trained until a measure of the total DM-Net's output error becomes smaller than a specified value. This measure can be simply given by the Mean Square Error (MSE) as: 1 dj(t)- yj(t) (2) T M t-I j--1 In (2), the term [dj(t) - yj(t)] represents the classification error of the jth output os the DM-Net, when the t th training pattern is presented on the NEMUS. The adaptation algorithms converge in a few adaptation circles and the computation time for each circle is significantly low, due to the simple architecture of the DM-Net. M~E -
3. Experimental Results In this Section, we present a pattern recognition application in order to demonstrate the efficiency of the proposed multiclassification technique. The NEMUS is employed to discriminate between ten 2Ddigitized objects considering various levels of shape distortions. It is assumed that the shape variations are those resulting from three quantized artificial types of bounded distortions, which produce a finite number (10560) of shape variations. Three different l-D typical FEMs are used, which are implemented on the region boundaries and are invariant of position, size and orientation of the shape. They also provide
359 different types of features corresponding to various levels of importance and they are sensitive on different level of shape's distortion. (a). Geometric features. The geometric features of a shape are extensively used in pattern classification applications [6],[7]. These features are useful in applications, where the patterns are not very complicated and the desired discrimination efficiency is not very high. Three geometric features are used to produce a feature vector; the Normalized Inverse Compactness C , the Normalized Area A and the Normalized Length L of the shape, which are determined as" -- 4~rA -A -Y
c= p--V-, A
rra~' L
am
(3)
where P is the length of the shape's perimeter, A is the area of the shape, dm is the maximum distance of the shape and g is the mean of the distances from the boundary to the centroid of the shape. (b). The Scaled Normalized Central (SNC) set of moments [8],[9]. Statistical moment-based methods are used in a large number of image analysis and pattern recognition applications. The features resulting from a moment based method usually provide good description of a shape and therefore have good classification properties. In this application, a special case of the 1-D SNC moments is used. The considered I-D SNC moment of order k is defined as follows: I - _ _ - ' 1
(4) i...
v
...i
where hk are the I-D central moments, m0 is the zero-order geometric moment, c~ is the scaling factor corresponding to the hk moment, which is used in order to avoid the exponential increase of the high order moments, and 13is the normalization factor. (c). The Step by Step Polygon Approximation technique (SSPA) [10]. Polygon approximation techniques are often used in shape analysis and data reduction applications. The SSPA technique gives a satisfactory solution to the problem of the direct selection of a fixed number of vertices of the polygon, which approximates a contour of a shape, especially in the cases where time is a critical factor. Thus, a region boundary may be approximated by a polygon with a prespecified number of vertices. The angles of the extracted polygon are used as discrimination features. In the considered application, a suitable version of the NEMUS is used as a classifier, having three Neural Networks (NNs) in the first stage, while three simple neural elements form of the DM-Net. The following three feature vectors are used as inputs of the NNs of the first stage of the NEMUS: G = [C ,A ,L], H=[h2,h3,h4], A=[F~,~2..... a-7] (5) where C , A and L are the three geometric features, h k are the 1-D SCN moments and ak are the normalized internal angles of the polygon, approximating the contour of a shape. The NNs of the first stage are selected to be three-layer perceptrons, which are trained using the back-propagation algorithm, to discriminate among the ten prototypes. The number of inputs of each NN equals the dimension of the three feature vectors G, H and A respectively, while the number of the outputs equals the number (10) of the original prototypes. Table 2 demonstrates the classification efficiency of the three independent classifiers (corresponding to the three different FEMs: geometric features, 1-D SNC moments and polygon angles), using a sample of 1000 randomly selected patterns. After the training phase of the NNs has been completed, the DM-Net is trained, with a set of training patterns using the back-propagation algorithm. A sample of 500 training patterns, which are randomly selected from the set of the 10560 possible different versions, produced by ten prototypes, is used and the adaptation algorithm is applied for only 1000 iterations. For comparison purposes, the classification procedure is demonstrated using three different operations for the DM-Net. (a). Each NN contributes to the multiclassification procedure with the same level of importance, i.e. the DM-Net performs only linear summation of the outputs of the NNs, (b). Each classifier contributes to the multiclassification procedure in proportion to its ability to discriminate among the set of training patterns. This is achieved by determining statistically the contribution of each NN, and (c). The DM-Net is a Neural Network itself and determines the contribution of each singe output of the NNs of the first stage by performing a fusion of the classifiers' outputs, using the set of training patterns. Table 3 demonstrates the classification efficiency of the NEMUS (considered with the three versions of the DM-Net), for various combinations of the independent FEMs, in terms of the percentage of the right classified patterns and of the classification error, using the sample of the 1000 randomly selected patterns.
360 TABLE 2 The Classification Efficiency, of the Indel~endent FEMs Independent FEMs Percentage of right classified] Classification Error (MSE) ] I 9 patterns I Geometric Features G I 55.4 ] 0.0816 I I-D SNC Moments H 84.1 0.0302 Poly[[on Angles A 75.6 0.0550 TABLE 3 The Classification Efficiency, of NEMUS using several combinations of the three FEMs Combined Percentage of right classified patterns Classification Error (MSE) FEMs (a) (b) (c) (a) (b) (c) G, H 81.3 93.8 96.0 0.0372 0.0234 0.0176 G, A 77.6 86.2 88.7 0.0413 0.0337 0.0273 H, A 93.6 93.0 97.4 0.0248 0.0369 0.0129 G, H, A 95.5 97.1 98.3 0.0274 0.0192 0.0098 4. Conclusions
A NEural MUlticlassifier System (NEMUS) is proposed for high performance classification applications. The proposed multiclassifier is composed of two stages. The first stage is comprised of several classifiers that operate in parallel, while the second stage is a Decision-Making Network (DM-Net) that performs the discrimination task, using a fusion of the outputs of all the classifiers of the first stage, so that the discrimination efficiency is optimized. The NEMUS is applied to a shape recognition task of 2D digitized objects, under various levels of shape distortions. Three different kind of features, which characterise a digitized object, are used: geometric features, I-D scaled normalized central moments and the angles resulting from a fast polygon approximation method. NEMUS is suitable for generic classification applications, such as shape discrimination, signal detection and remote sensing, and has the following advantages and characteristics: * Different types of classifiers may be used simultaneously. Thus, for each independent FEM, the most appropriate type of classifier may be used, in order to achieve its higher discrimination efficiency. * The contribution of each FEM is stored as synaptic weights and bias of the DM-Net, which are automatically determined using a set of training patterns. Thus, the prior knowledge, concerning the type and the number of the classifiers, which should be used in a particular application, is not required. 5. References
[1] K. S. Fu, Syntactic Pattern Recognition and Applications, Englewood Cliffs, NJ: Prentice-Hall, 1982. [2] P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach, London, U.K." PrenticeHall, 1982. [3] R. J. Schalkoff, Pattern Recognition: Statistical Structural and Neural Approaches, New York: John Wiley and Sons, 1992. [4] R.P. Lippmann, "An introduction to computing with neural nets," IEEE Acoustics, Speech and Signal Process Magazine, pp. 4-22, April 1987. [5] D. E. Rumelhart and J. L. McClelland, (Eds.), Parallel Distributed Processing, Cambridge, MA: MIT Press, 1986. [6] L. Shen, R. M. Rangayyan and J. E. Leo Desautels, "Application of shape analysis to mammographic calcifications", IEEE Trans. on Medical Imaging, vol. 13, No. 2, pp. 263-274, 1994. [7] R.C. Gonzalez and P. Wintz, Digital Image Processing, Second ed., Reading, MA: Addison-Wesley, 1987. [8] B. G. Mertzios, "Shape discrimination in robotic vision using scaled normalized central moments," Proceedings of the IFA C Workshop on Mutual Impact of Computing Power and Control Theory, pp. 281-287, Prague, Chechoslavakia, September 1-2, 1992. [9] B. G. Mertzios and D. A. Mitzias, "Fast shape discrimination with a system of neural networks based on scaled normalized central moments," Proceedings of the International Conference on Image Processing: Theory and Applications, pp. 219-223, San Remo, Italy, June 10-12, 1993. [10] D. A. Mitzias and B. G. Mertzios, "Shape recognition with a neural classifier based on a fast polygon approximation technique," Pattern Recognition, vol. 27, No. 5, pp. 627-636, 1994.
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
361
Application of a Neural Network for multifont Farsi character recognition using fuzzified Pseudo- Zernike moments Mehdi Namazi, Ms. Student & Karim Faez, Associate professor. E.E. Dep. Amirkabir University of technology Tehran - Iran, 15914 Email: [email protected] Abstract In this paper an algorithm is developed for recognition of printed Farsi characters with various fonts, irrespective of size, rotation and stroke. The system operates on an image of characters that can be obtained from a standard digital scanner (e.g. 300 dpi). The approach consists of four main stages: pre-processing, feature extraction, fuzzification and classification. In the pre-processing stage, the image is first binarized and then passed through a noise reducing filter [1 ]. The image is finally thinned to compensate for any difference in stroke width (e.g. bold characters). In the feature extraction stage, the selected features are derived from a set of moments known as Pseudo - Zemike Moments (PZM). These moments offer several advantages over the regular moments that have been used in pattern recognition problems. In the third stage, the moments are fuzzified to a set of objects with a continuum of grades of membership. In the last step, input characters are classified by a feedforward neural network using backpropagation learning method. In this paper, we have also compared the classification results using fuzzified moments and nonfuzzy moments as network input. The learning patterns are printed characters with a group of fonts, and the test patterns are another printed character group with some fonts different from learning group. This comparison shows that neural networks with fuzzy inputs (FNN) has a better performance with unclear inputs. Key words Multifont character recognition, Farsi/Arabic characters, Fuzzified Pseudo - Zemike Moments (PZM), Fuzzy Neural Network. I - Introduction Dealing with uncertainty is a common pattern recognition problem, and fuzzy set theory has proved to be significantly important in pattern recognition problems[6]. Feedforward neural networks usually are trained with examples to learn the rules and functions of a real word system through the error back propagation algorithm. Fuzzy - neural networks (FNN) combine the learning and knowledge representational capabilities of neural networks and fuzzy sets. When the input pattern is too clear to classify, NN gives us a reliable answer. From the point of uncertainty, when the input pattern is not clear enough, NN results are unreliable, but FNN can easily classify theses patterns and can learn the difference between similar patterns. The network learns the patterns using backpropagation learning method. Note that we can use the feature moments directly as network inputs, but using fuzzified moments as network inputs has two advantages: decrease in network learning convergence time and decrease in system error. The structure of this paper is as follows: Section II discusses about the properties of Farsi characters and also introduces the selected character classes. Section III discusses about Pseudo - Zemike Moments (PZM). In section IV the assigned fuzzy sets are introduced. Section V presents the experimental results. The conclusion is presented in section VI.
II - Farsi characters In Farsi language, there are 32 main characters. Depending on the position of each character in a word, it may have 2 to 4 different shapes (fig. 1). We have only considered the main form of each character. Dots play an important rule in Farsi characters. For example as seen in fig.2, there are three different characters~ but
362 graphically they differ only in the number of dots down and up the character. To simplify, we neglect these dots and consider characters in main form without dots. So the number of classes reduces to 18 classes (see fig.3).
fig.2 Various Farsi characters having different dots
fig. 1 Typical different forms of a Farsi character depending of its position in a word
fig.3 Simplified character classes III - Pseudo - Zernike Moments Zemike polynomials were first introduced in 1934 [2], and were later derived from the requirement of orthogonality and invariant properties. Zemike polynomials, being invariant with respect to rotations of axis about the origin, are polynomials in x, and y variables. A related orthogonal set of polynomials in x, y, and r as derived in [3] which has properties analogous to Zemike polynomials, called pseudo - Zemike polynomials, which differs from that of Zemike in that the real - valued polynomials are defined as:
--
n-Ill
.
( 2 n + 1 - s)V"
R~ Z,__o(-1)" s!(n_ll I_ s)!(n+ll l_ s)!
.
r"-"
-
~
,=i,I
S nlll k r k
Here n=0, 1, 2, ..,oc and l takes on positive and negative integers values subject to [/[_<_nonly. By simple enumeration, this set of Pseudo Zemike polynomials contains (n+ 1) 2 linearly independent polynomials of degree_~n. In this research we have used PZM of order 5 [5] [4]. IV - Fuzzy sets Pseudo Zemike moments are normalised to vary between -1 to 1. We define five classes of objects with membership function shown in fig.4. These classes are: Negative Negative Zero Zero Positive Zero Positive
(N) (NZ) (Z) (PZ) (P)
:For the numbers equal to or less than -0.5. :For the numbers between -1 and 0. :For the numbers between -0.5 and +0.5 :For the numbers between 0 and + 1. :For the numbers equal to or grater than +0.5.
It means that each moment is presented as a five element row vector contaning membership values of the moment in five sets. We name this vector FPZM (Fuzzy PZM). FPZM is defined as follows:
FPTA/I=[N,NZ,Z,PZ,P ]
363 These fuzzyfied moments are fed directly to network inputs. For example, if a PZM moment is 0.885, the related FPZM is [ 0,0,0.23,0.77], so the number of network input nodes is five times more than a network with nonfuzzy inputs. Note that this set of fuzzy classes is not the only solution. Some other fuzzy sets may push the system to have a better performance.
-1
-0.5
T 0
0.5
1
"
-1
-0.5
a
0
0.5
1
~
-1
-0.5
0
0.5
b
i / P -1
-0.5
0
0.5
-1
1
-0.5
0
0.5
1
P -1
-0.5
0
0.5
1 "~
d fig.4 Related membership function of assigned Fuzzy sets a) Class "hi" b) Class "NZ" c) Class "Z" d) Class "PZ" e) Class "P" f) Overlapping Classes V - Experimental results Fuzzy sets are used as the inputs of a feedforward neural network with 180 inputs, 50 nodes in hidden layer and 18 output nodes. We have used two groups of fonts (One sample for each character of each font) Group 1 (4 fonts)' ZAR, LOTUS, TRAFFIC, BADR Group 2 (2 fonts)" SADEH, B-LOTUS (These are Farsi font names) All of the images are passed through a noise reducing system and then thined to remove the differences of stroke width. Two experiments are done. In the first experiment, group 1 is used as learning data and group 2 as test pattern. In the second experiment, group 2 is used as learning data and group 1 for test pattern. The results can be seen in table 1. The system is compared with a similar system without fuzzifing moments (See fig.5). Input I J I .IMembership ~ image 1 7 P Z M [ 7 functions Input image
7
II
[
Feedforward I Network ]7
class
I I class
J Feedforward I No,work
.IOutput
JOutput
fig.5 Fuzzy system used for classification (top) and nonfuzzy system used for comparison (bottom) Number Number of Conversion Learning Test data of errors time Epoches Input 1588 1/32 3014 group 2 group 1 Fuzzy moments 4224 5185 2/32 group 2 group 1 Numerical moments 510 382 10/64 group 1 group 2 Fuzzy moments 913 18/64 401 group 2 ~roup 1 Numerical moments Table 1" Experimental results Considering the table contents, it is clear that when the network is learned with more samples, there is a small decrease in system error with respect to regular network, but the main goal is a smaller learning time and Input features
364 fewer iterations (see fig.6). On the other hand, when the network is taught by few samples, the relative system error is high. In this case, a fuzzy neural network uses more conversion time but has less errors than regular one. In the highly learned system, this system has %3 decrease in system error and %28.6 decrease in convergence time and in poorly learned systems, it has %12.decrease in system error.
Fig.6: Sum - square error vs epoche for fuzzy neural net.(top) and regular net. (bottom) VI - Conclusion In this paper a set of fuzzy features based on Pseudo-Zemike moments has been used for size, rotation and translation invariant recognition of printed Farsi characters with various fonts. A fuzzy neural network is used as classifier. It is shown that using fuzzy features as inputs for neural networks improves the system accuracy for uncertain inputs. References [ 1] R. Haralick and L. shapiro. Computer and robot vision, vol. 1. Addison - Wesley. 1992 . [2] F. Zernike. Physica, vol. 1 p. 689, 1934. [3] A. B. Bhatia and E. Wolf, Proc. Camb. Phil. Sco., vol. 50. pp. 40-48, 1954. [4] M. H. S. Shahreza, K. Faez and A. Khotanzad, "Recognition of handwritten Farsi numerals by Zemike moments features and a set of class - specific neural network classifiers", ICSPAT, Dallas, Texas, USA, Oct. 18-21,1994. [5] C. H. Teh and R.T. Chin, "On Image Analysis by the Methods of Moments", IEEE Trans. on Pattern Analysis and Machine Intelligence (PA3~MI), vol. 10. no. 4 July 1988. [6] Serial Sass and Janitor Sing Bedi, "Fuzzy - Neuro Identification System For Hand-written characters", ACCV'95 Second Asian conference on Computer Vision, December 5-8, Singapore. pp 111-753-757. [7] H.K. Kwan and Y.Cai, "A Fuzzy Neural Network and its application to Pattern recognition.', IEEE Trans. on Fuzzy Systems, vol. 2., no. 3, Aug. 1994, pp. 185-193. [8] S.Sasi and J.S.Bedi, "Handwritten Character Recognition using Fuzzy Logic', 37th MWCWS, Lafayette, USA, 1994, pp. 1-11.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
365
Integrating LANDSAT and SPOT images to improve land cover classification accuracy Alessandra Chiuderi Space Applications Institute Agriculture Information Systems Joint R e s e a r c h Centre - 21020 Ispra (Va), Italy
In this paper the use of multi-sensor and multi-temporal data for land cover classification purposes is investigated. In particular a multilayer Neural Network trained by means of the Back Propagation algorithm is employed for classification experiments on remotely sensed images. The data set employed is composed of two co-registered images of the agricultural area surrounding the city of Valladolid (Spain) acquired in two different dates, June and July, by two different satellites, SPOT and LANDSAT respectively, having different spectral and spatial resolutions; ground truth data were acquired during a in situ campaign carried out by the Spanish Ministry of Agriculture in 1993. Three different data sets: i) SPOT data, ii) LANDSAT data and iii) (SPOT + LANDSAT) data are employed here for the same land cover classification task in order to investigate the role of data integration and to compare the results.
l.Introduction
Remote sensing (RS) is defined as the science of acquiring information on a given object, without any physical contact with the object itself [ 1]. Even if this extremely wide definition includes every kind of mean which allows long distance information acquisition, in the present context we shall be concerned only with remotely sensed images acquired by space satellites. As a matter of fact, RS constitutes an extremely interesting application and research field as far as image processing is concerned as we are firstly asked to deal with real data, secondly the amount of data available is enormous, thirdly there is a growing interest in RS image processing as it can be considered one of the most powerful tools for Earth observation and change monitoring, and, last but not least, the amount of data concerning the same area acquired by different sensors and the future developments in sensor technology make it mandatory to develop techniques which can both deal with different sources and select, among all available data, the ones carrying more information for a given task [2]. The work presented here has been carried out within the MARS project of the Joint Research Centre of the European Commission [3], one of the widest projects in terms of RS applications. The aim of MARS (Monitoring Agriculture with Remote Sensing) project is to provide decision support to the Commission as far as agricultural policies are concerned. Within this context, LANDSAT and SPOT images of 60 selected European sites acquired throughout the growing season together with ground surveys are available at the Joint Research Centre of Ispra.
2. Integrating different data sets The acquisition of different images throughout the growing season allows crop monitoring and early estimation of land cover type, increasing therefore class separability. As a matter of fact, different cultures usually have different growing cycles, and the comparison between two successive images may allow to separate crops that are within the same image indistinguishable. It could be believed that class separability increases as the season advances, and therefore earlier images will always carry less information and therefore lead to less accurate classifications; this is generally true but not always.
366 In a mid summer image for instance, permanent cultures such as pastures or forages may be confused with maize, but if an earlier image of the same site is available, a simple comparison of the two should allow class separation as maize is not appeared yet (the field would give a response similar to baresoil) and forages will have instead the typical strong reflectance in the mid-infrared. In this paper two images VaUadolid (Spain) acquired by two different satellites, SPOT and LANDSAT, at beginning of June and mid July respectively are employed. The advantage offered by this data set are the following: SPOT having 20 m. spatial resolution is more suitable for agriculture monitoring in a country such as Spain whose landscape is characterised (as most of Europe) by small fields, LANDSAT, on the contrary, despite the coarser spatial resolution (30 m.), acquires data also in the mid-infrared region of the electromagnetic spectrum which is particularly important for vegetation response. The use of Neural Networks (NN) in Remote Sensing image processing, is not new: starting from the late eighties, several authors have been employing this technique as a useful and suitable processing tool in particular for image classification, as illustrated in the interesting review paper [4] It has also been shown that NN, not requiring any hypothesis on data distribution, represent an extremely powerful tool when dealing with the integration of different data sets [5]. In this study, the same area was classified by employing only SPOT data, only LANDSAT data and successively by using both SPOT and LANDSAT data set together. The results reported show how the integration of the two data sets increased classification accuracy. 4. The data SPOT image, acquired on June 1~t, 1993, counts three bands: (0.50-0.59 l.tm), (0.61-0.68 t.tm) and (0.79-0.89 l.tm). The image was geo-corrected by means of the GRIPS software ([6]) and a subscene of 2000 lines and 2000 pixels was employed for these experiments. LANDS AT TM image, composed of 6 bands ranging from the visible to the near and midinfrared, was acquired on July 17t~ 1993, This image was coregistered to SPOT image, resampled in order to have the same ground resolution, geo-corrected and successively a subset of 2000 lines and 2000 pixels was extracted for bands 2 (0.52-0.60 l.tm), 3 (0.63-0.69 l.tm), 4 (0.76-0.90 l.tm) and 5 (1.55-1.75 t.tm).
node Second hidden layer Weights First hidden layer
Input nodes Figure 1" A Multi Layer Neural Network
367 Ground truth data was acquired by an on site campaign carried out by the Ministry of Agriculture in Spain in 1993, independently from the MARS project. The data set employed for the experiments reported in the next section counted 27080 pixels representing 13 different land cover classes; 18055 were used to train the NN, whereas 9025 were employed exclusively to evaluate the results obtained. 5. Experiments and Results
Three sets of experiments were performed on the data available: data extracted from SPOT, LANDSAT and (SPOT+LANDSAT) images respectively were used to train a multilayered Neural Network (as shown in Figure 1) by means of the Error Back Propagation algorithm [7]. In each case the network was therefore constituted by a variable number of input neurons (according to the data set employed), a variable number of hidden nodes arranged into 1 or 2 layers and 13 output neurons corresponding to the 13 land cover classes listed in table 1. Establishing which is the best architecture for a given task is not easy, as usually different architectures have different performances in terms of single per class accuracy; the results reported here therefore refer to the "best" architecture in terms of average omission precision. In table 1 the omission and commission accuracies on the test set obtained for the three different data sets are summarised LANDS AT
SPOT Class
Omission
Omission
Commission
Commission
86.88 95.18 94.8 87.67 Cereals 70.37 8O.04 79.12 50.79 Sunflower 96.30 74.29 17.14 54.55 Potatoes 87.72 84.75 6.78 100.0 Sugar Beet 71.19 42.21 12.06 43.64 Forage 95.27 83.07 36.56 55.63 Set aside 50.00 37.14 4.29 31.58 Permanent 67.24 92.73 68.08 82.39 Woods 80.0 58.97 46.09 48.30 Built 65.92 24.83 27.87 34.96 Lands 53.49 29.11 75.95 75.95 Dry Pulses 85.19 25.56 18.89 47.22 Other Cereals 100.0 81.82 100.0 100.0 Water i 80.7% Overall. 71.75% Table 1" Classification accuracies on different data sets []
SPOT+LANDS AT Omission
Commission
97.67 92.22 81.25 86.36 ! 60.80 93.08 61.31 75.85 74.31 72.27 79.01 65.52 . 100.0 1 88.66%
97.13 86.30 83.87 91.94 76.10 88.74 50.30 91.04 83.4 57.03 98.46 85.07 100.0
It can be noticed that, generally speaking, the accuracies on SPOT data set are lower than the corresponding ones for the other two data sets. In particular for classes 4 (sugar beet) and classes 7 (permanent cultures, such as vineyards or olives) practically all pixels are classified into class 2 (sunflower) which therefore shows, a quite low commission accuracy. This mis-classification could partially be due to the great difference in the number of pixels for each of the 3 classes: class 2 being so numerous(1092 samples versus 59 for class 4 and 140 for class 7), might over train the network in it's favour. It must also be said that, due to climatic conditions, sunflowers (class 2) at beginning of June in southern Europe are usually not very blooming, giving therefore a quite confused signal (baresoil, spontaneous vegetation, and sunflowers) which, combined with the high number of samples, can lead to high commission errors, as far as class 2 is considered and low omission precision for all other classes.
368 6. Conclusions In this paper, the use of a multi-layer neural network for land cover classification purposes has been investigated. In particular, two data sets have been selected for the experiments reported here, all data referring to the same area, the agricultural region surrounding the city of Valladolid (Spain). The purpose of this paper was to evaluate the contribution of data integration as far as classification accuracy is concerned, and therefore compare the results obtained on the three data sets outlined above. It is common opinion that neural networks represent a suitable tool for classification problems, especially when the mathematical modelling of input data is difficult: as a matter of fact, such techniques, not requiring any hypothesis on data distribution, are particularly useful in applications as the one presented here. The results reported in the previous section highlight the importance of data integration and in particular, of the use of multi-temporal data: the classification based on SPOT lead to very poor results, both in terms of overall accuracy and in terms of average omission and commission precision, 45.2% and 62.51% respectively; the experiments carried out on LANDSAT data showed an omission precision of 62.28% and commission precision of 77.67% certainly due to the increased amount of information concerning the mid-infrared region of the electromagnetic spectrum; the (SPOT+LANDSAT) data set, on the contrary, scored an overall accuracy of 88.65% and omission and commission accuracies of 79.97% and 83.79% respectively. This dramatic difference cannot be due just to the higher number of components of the third data set. As a matter of fact, in [8] it has been shown that high correlation between input channels decreases classification accuracy, and SPOT bands 1, 2 and 3 overlap with LANDSAT bands 2 3 4, therefore the increase in accuracy between the results on LANDSAT and on (SPOT+LANDSAT) should be definitely due to the difference between acquisition dates of the two images. AKNOWLEDGEMENTS The author would like to thank Professor Vito Cappellini of the University of Florence for the helpful and encouraging discussions during the preparation of this paper.
REFERENCES [1] Manual of Remote Sensing, 1983, R.N. ColweU (Ed.). American Society of Photogrammetry, Falls Church, VA [2] G. Wilkinson, A. Chiuderi: "I1 telerilevamento aUa fine del ventesimo secolo: una nuova sfida nel campo dell'informatica" - Proc. of the workshop II telerilevamento ed i sistemi informativi territoriali nella gestione delle risorse ambientali- Trento, October 27, 1994, Published by the Office for Official Publications of the European Communities - Luxembourg [3] F. J. GaUego, J. Delinc6, C. Rueda: "Crop area estimates through remote sensing: stability of the regression correction" Int. J. Remote Sensing, 1993, Vol 14, N. 18, pp. 3433-34459 [4] J.D. Paola, R.A. Schowengerdt "A review and analysis of backpropagation neural networks for classification of remotely-sensed multi-spectral imagery". Int. J. Remote Sensing, 1995, Vol 16, N. 16, pp. 3033-3058 [5] A. Chiuderi, S. Fini, V. Cappellini: "An Application of Data Fusion to Land cover Classification of Remote Sensing Imagery: a Neural Network Approach" - Proc. of IEEE International Conference on Multisensor Fusion and Integration Systems - pg. 756-762 (1994) [6] C. Casteras, G. Doyon, E. Martin, V. Rodriguez: "Corrections Geometrique et atmospherique Manuel Utilizateur" CISI-Geo Design, CCR (Ispra), RSO/DOT-CGA-MU, Ed. 2 (1994). [7] P.D. Wasserman: Neural Computing, Theory and Practice, Van Nostrand Reinhold, 1989, New York [8] A. Chiuderi "Improving the Counterpropagation network performances", Neural Processing Letters, 1995, Vol. 2, N. 2, pp. 27-30
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
369
Classification of bottle rims using Neural Networks - an LMS approach C. Teoh and J. Braham Levy Control Systems Centre U M I S T , P . O . Box 88 Manchester M60 1 Q D
United Kingdom Abstract In the glass making industry, the quality, of bottle rims is of great importance. Uneven or 'wavy' rims on the bottles can lead to the deterioration of the contents held in the bottles due to air ingress. Depending on what is to be held in the bottles, different limits of waviness can be tolerated. It is important, therefore, that bottles can be accurately classified according to their rim quality. Generally, the comainers' quality is measured by applying pressure to a sealed container and observing the pressure profile as the air seeps out of the uneven rim. A means of using such data to classify bottles already exists through the use of a multi-layer perceptron (MLP) neural network with 100 inputs, a single 3 neuron hidden layer and a single neuron output layer. A classification success rate of 95% had been recorded. However, because of the large number of inputs [2] and complex algorithm used [4], this required neural networks that were unpractical for real time analysis. Applying such a neural network to classify bottles in real-time may cause problems because of the amount of calculation involved. This paper describes a way of simplifying the classification neural network by first finding an LMS filter to model the pressure profile. By feeding the weights of the resulting filter (which is unique for each pressure profile) into the classification neural network, it was found that a much simpler neural network results. By choosing the length of the filter and the number of hidden neurons carefully, a 100% success rate for bottle classification can be achieved. Typically, this involved an MLP neural network with just 4 inputs, a 2 neuron hidden layer and a 6 neuron output.
Introduction A Dual Head Gauger (DHG) machine is usually used to measure the quality of glass bottle rims. At the heart of a DHG is a plunger which can move up and down at various speeds. A hole in the centre of the plunger allows air to be pumped through it. A bottle whose rim quality is to be measured will be placed below the plunger with its rim facing upwards. At the start of the DHG measurement cycle, the plunger will descend towards the bottle rim. Air will be pumped through the plunger at the same time. When the plunger comes into contact with the bottle rim, it will stop moving and the air which is now pumped into the bottle will cause the pressure within the bottle to build up. This increasing pressure will be measured by a pressure sensor within the DHG machine. After a fixed amount of time, the plunger will move up and away from the bottle rim. The air which was trapped in the bottle will be released and the measured pressure reduces rapidly.
370 The result is that for each bottle, the DHG machine will produce a pressure profile. Depending on the quality of the bottle measured, the pressure profile will be different. The quality of bottle rims is then determined by examining the pressure profiles. Bottles classified according to their rim quality are given classifications like wavy-40, wavy-50 etc. where wavy-40 indicates that it has a waviness of 0.4mm. In this study, we are required to be able to classify each bottle into one of 6 different categories. They are good bottles (i.e. less than wavy-40), wavy-40, 45, 50, 55 and 65 bottles.
The Pressure Profiles
6 different bottles, one from each classification were used to obtain the pressure profiles used here. Using a DI-IG simulator, 30 separate pressure profiles were obtained from each bottle [3]. This was done by making the DHG perform 30 consecutive measurement cycles on each bottle and storing each pressure profile obtained in separate files. Altogether, 180 pressure profiles were obtained (30 profiles from each of 6 bottle types). The pressure profiles of the different bottles can be seen in Figure 1.
Figure,l" Plot of Pressure Profiles from the 6 Bottle types
Each pressure profile is made up of 200 evenly spaced sampled points. The first point is obtained when the plunger starts its descent toward the bottle, and the last one is obtained soon after the plunger starts its ascent back to its home position. In each of the pressure profiles obtained, the DHG cycle time was set to 0.5 seconds. This makes the time between sample points approximately 2.5 ms. 10 of the profiles from each bottle type were used to train the classification neural network, and the remaining 20 profiles were used to test it.
371 It can be seen from the plots of the pressure profiles that although there were distinct boundaries between classes good, wavy-40 and wavy-55, classes wavy 45, wavy 50 and wavy-65 were very close together. It was anticipated, therefore, that any problems in the classification process would come from the latter 3 classes.
Review of Past Results
It had been reported [2] that by using a logistic-sigmoid function classification neural network with 100 input points, 1 hidden layer of 3 neurons and a single neuron output layer, a 95% success rate was obtained in the classification of the different bottles. The 100 input points were obtained from pressure readings around the peak of each pressure prone. The single output indicated the bottle classification according to its output value (between 0 and 1). The larger the output value, the less wavy the bottle was. Although reasonably successful in classifying the bottles, this method required a large number of input points, and hence large amounts of computation. It would be useful if a small number of parameters could be used to represent each pressure profile and be fed into a similar MLP neural network to classify the bottles. In order to reduce the computation complexity, a method was sought whereby a set of parameters unique to the problem could be utilised. Initial success was found by using Principal Component Analysis. This reduced the input space from 100 points to 5 [4] with consequent reduction in neural network complexity. The determination of the principal component, however, still required considerable computational effort. In an attempt to further reduce this, an approach utilising the LMS Algorithm [6] was used. This would produce an on-line estimate of a set of parameters for each pressure curve using the tap weights of the LMS filter.
The LMS Method
The LMS method of classifying bottles using its pressure prone requires first a pre-processing LMS filter stage. In this pre-processing stage, each of the 100 sampled points of a pressure profile around its peak is passed through an LMS filter. The length of the filter determines the number of weights present in the filter. The idea is to adjust the weights in the filter so that when a consecutive time sequence of pressure samples were presented to the filter, it would be able to "predict" the next pressure value in that sequence. All the weights of the LMS filter were initially set to zero. Using a fixed learning rate of 0.1, the weights were updated using Widrow Hoff rule as each set of input was presented to the filter. When the last set of input was reached, the weights would have converged to some value which was unique for each pressure profile. In a way, the 200 points of the pressure profile would have been "reduced" to just a handful. Once the weights to the filter have been obtained, they are fed into the neural network for classification.
372 A 4-input LMS filter was used. 10 pressure profiles from each of the 6 bottle types were presented to the filter. The resulting range within which each of the 4 weights fell is summarized in the table below.
wl w2 w3 w4
Good Bottle
Wavy-40
Wavy-45
Wavy-50
0.2517-0.2526 0.2407-0.2416 0.2300-0.2309 0.2197-0.2206
0.2545-0.2550 0.2431-0.2436 0.2319-0.2324 0.2211-0.2217
0.2546-0.2552 0.2418-0.2423 0.2292-0.2298 0.2170-0.2176
0.2548-0.2552 0.2421-0.2426 0.2298-0.2304 0.2179-0.2186
Wavy.55 0.2544-0.2549 0.2427-0.2431 0.2312-0.2316 0.2201-0.2206
Wavy-65 0.2500-0.2508 0.2378-0.2385 0.2259-0.2265 0.2142-0.2149
Table 1- Weight Space Analysis It can be seen that each type of bottle occupies its own characteristic weight space range. These ranges are observed to be quite narrow. It is also observed that there is considerable overlap in the weight 1 (w l) ranges among the different bottle types. The overlap in the other weights are much less.
The Classification Neural Network Once the LMS filter weights had been found for a pressure prone, they were fed into a multilayer perceptron (MLP) neural network. The MLP used here had a single hidden layer with 2 neurons, and an output layer with 6 neurons. The network is trained so that each of the 6 output neurons correspond to a particular bottle type. Each bottle class would cause one of the outputs to be set to "1" while the rest of the outputs would be "0". Because of the continuous nature of quality of bottle rims, however, it would not be realistic to expect an unclassified bottle to produce such a well defined output. It was, therefore, envisaged that after the network had been trained, a post-processing stage consisting of a competitive transfer function would be used to "shape" the output. 10 curves from each bottle type were used for the training while the remaining 20 were used to verify the effectiveness of the network. Training was done using the Levenberg-Marquadt algorithm from the neural network toolbox for MATLAB. The weights for the first layer of the neural network were initialized using the Nguyen-Widrow technique [5]. When the trained neural network was verified, it was found that all the pressure profiles not used in the training were classified correctly.
Conclusion Using the LMS filter as a pre-processor to a bottle classifying neural network, it was possible to effectively reduce the number of inputs to the neural network from 100 previously, to just 4. It was found that the resulting MLP neural network was able to classify 100% of the profiles introduced to it.
373 While this method incurred a pre-processing stage as an overhead, this was more than compensated for by a much simpler neural network. The pre processing is easily implemented and can be performed in real time.
References
1. Markatos P., Industrial bottle inspection machine: Control and monitoring, Design Exercise, Control Systems Centre, UMIST, 1995 2. Levy J. B. and Markatos P., A Classifier for Sealing Rims of Glass Containers, 2"d International Workshop on Image and Signal Processing, Budapest, November 1995 3. Veryard M., Acquisition and Analysis of Data from a Bottling Machine, Design Exercise, Control Systems Centre, UMIST, 1996 4. Levy J. B. and Stefanis E., Signal Analysis for a Bottle Inspection Machine using Neural Networks, IEE Control96, Exeter, Sept 1996 5. Nguyen and Widrow B., Improving the Learning Speed by Choosing Initial Values of the Adaptive Weights, Intl. Conference on Neural Networks vol.3 pp21-26, 1990 6. Widrow B., Adaptive Filters in Aspects of Network and System Theory, (ed.) N. DeClans & R. E. Kalman, Holt, Rinehart and Winston, 563-590, 1970
This Page Intentionally Left Blank
Invited Session M: WAVELETS AND FILTER BANKS IN COMMUNICATIONS
This Page Intentionally Left Blank
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
377
DATA C O M P R E S S I O N , DATA F U S I O N A N D K A L M A N F I L T E R I N G IN W A V E L E T T R A N S F O R M Q.Jin t, K.M.Wong tt, Z.Q.Luo tt and E. Bosse tit t Mitel Corp., Ottawa, Ont., Canada K2K 1X3 tt CRL, McMaster University, Hamilton, Ont. Canada L8S 4K1 tit DREV, Courcelette, Quebec, Canada
A B S T R A C T In this paper, we propose an optimum fusion and Kalman filtering algorithm using wavelet packet decomposition. The performance is evaluated and simulation results are presented. 1. I N T R O D U C T I O N Let the system equation for the dynamic variable ~(n) be given by
x(n) = A(n):r,(n- 1 ) + v(n),
n = 0, 1 , . . . ,
(1)
A(n) is the N x N matrix, and v(n) is the N x 1 noise vector of zero mean and its correlation function is E[v(n)vt(m)] - P~(n)6,~,~ with 6,~ being the Kronecker delta and t being the conjugate transpose. Suppose that
where
there are K sensors, and the measurement vectors at the nth discrete instant of the kth sensor is represented by
yk(n) = Ck(n)x(n) + rlk(n), k = 1,2,..., K, where
(2)
Ck(n) is an M x N matrix (M _ N), and rlk(n) is an M x 1 noise vector of zeros mean and its correlation
function is E[rlk(n)rl~(m)]- Rrl~ (n)6m,~. It was proved [1,2] that for the above signal model, the optimum fusion algorithm for multi-sensor Kalman filtering is equivalent to fusing the measurement vectors yk(n) first and then applying the Kalman filtering algorithm. The fused measurement vector is
~t (n) --
IfVk(n)Ck(n) ~(n) + E ITVk(n)rlk(n)" k=l
(3)
k=l
where the optimum weighting vectors are[2]
I;Vk(n) - ,~Kl(n)(Ck(n)C#l (n)) ~R rlk -1 (n), and ~'K1 (n) - E(Ck(n)C#l(n))t R -lrlk(n)(Ck (n)C#l (n)),
(4)
k whit # represents the pseudo inverse of a matrix. In this paper, we apply the optimum fusion algorithm in the wavelet transform domain. At each sensor location, the observed data yk(n) is first decomposed using wavelet packet decomposition[3]. In most cases, signal energy will be concentrated in the low pass band, which contains most of the information and thus is transmitted. For other bands, only the coefficients above the threshold are transmitted. In this way, we may save the communication bandwidth. At the fusion center, we first fuse the information in the low pass band and then apply the Kalman filtering algorithm. Due to the down sampling procedure, the sampling rate is reduced at the decomposition level. The Kalman filtering can now be implemented at a low rate, and the system complexity is reduced. As for the other bands, most of the expansion coefficients are zeros, we fuse the coefficients in the corresponding band using the optimum fusion algorithm with no Kalman filtering being applied. These fused coefficients are combined with the outputs from Kalman filtering in the low pass band to form the estimated output by the wavelet packet reconstruction algorithm. A general system diagram is shown in Fig.1. 2. SYSTEM M O D E L I N G Let the sampled signal be denoted by s(n), (n = 0 , + 1 , . . . ) . Then, we can compute the sequences so(n) = ~_~kf ( 2 n - k)s(n), and sl(n) = ~ k g(2n- k)s(n), where {f(k)} and {g(k)} are wavelet coefficients[3]. If the signals so(n) and sl(n) are further decomposed using the same relation as the above, we obtain four sequences s00(n), s01 (n), Sl0(n), sll (n). If this decomposition is carried on iteratively, we obtain the components of the original signal s(n) when decomposed at various levels with the general component being denoted by sbm (n) where b,~ is a binary number having m digits corresponding to the mth level of decomposition. In order to carry on the Kalman filtering, the system dynamic model and the measurement model must be known at the low level decomposition.
378
Assuming b m = 000, x(000)(n) is related to x(n) as x(000)(n) - Z
fz(8n - i)x(i),
(5)
i
where the z-transform of fz(n) is related to that of for z(000)(n) can be approximated as
f(n) as Fl(z) = F(z)F(z2)F(z4). Using (1), the dynamic model
x(000)(n) - [A(n)] s x(000)(n - 1) + v(000)(n),
(6)
The noise for the system dynamic model is v(000)(n), which is zero mean and its covariance matrix
RV(ooo)(n, m) is
approximately as )'-~i~,i2=0 [A(8n)] i' Rv(8n)[A(8n)] i2 R1~(i2 - il)6mn, (where Rlz is the correlation function of fz). The measurement model for sensor k at the low resolution space can be obtained from (2) as
~(000)(n) = Ck(S~)~(000)(n)+ "~(000)(~),
(7)
where the measurement noise is zero mean with the covariance matrix being R~%(ooo)(n, m) = R~Tk(8n)6n,~. With the signal dynamic model of (6) and the measurement model in (7), the Kalman filtering algorithm can be applied to estimate the signal ~(000)(n) from the observation of sensor k. The similar relation can also be derived for other bin. 3. O P T I M U M F U S I O N In this paper, we assume that three level wavelet packet decomposition is applied with eight packet outputs and the Kalman filtering is only applied on the low pass band yk(000)(n). Based on the assumption that most of the expansion coefficients yk(b)(n) (b r 000) are very small and can be ignored, we may set a threshold "~k(b) such that the element of yk(b)(n) is set to zero when it is smaller than the corresponding element of ~%(b) (b :/= 000). In this way, we may save the communication bandwidth by transmitting only those yk(b)(n) whose values are larger than ~'k(b)" In the receiver, we have a set of coefficients yk(b)(n) (k = 1 , . . . , K, b = 000,001,..., 111) and most of them are zeros for b :/: 000. Before the Kalman filtering and wavelet packet reconstruction algorithm, the measurements Yk(b) of different sensors in the same decomposition band should be fused into a single measurement with a set of weighting ITVk(b)(n). In order to derive the optimum weighting matrix I~dk(b), we consider the bands b = 000 and b :/= 000 separately. For b = 000, with the measurement model of (7), the optimum weighting vectors I?Vk(000)(n) can be obtained using the same relation as (4) and (3). With these weighting vectors, we can obtain the optimum fused measurement ~(000) (n). Now with the fused measurement model and the system dynamic model of (6), the standard Kalman filtering can be applied to calculate the estimated values for 5~(000)(n). For b :/= 000, most of the coefficients yk(b)(n) are small enough to be ignored, and we have a set of new observations received at sensor k after the data compression is applied with a threshold ~'k(b)" The approximate optimum fused vector for b ~: 000 is obtained as
Z(b)= ~ ~iz~(Sn)
lYk(b)(n)l --'rk(b)
Xyk(b)(n)
,
(8)
k=l
where • represents the direct product of two vectors or matrices. For an N-element vector x, the ith element of [~]+ is defined as Ix] + -
I'
~, 0,
xi-0 xi < 0
.
(9)
o
After the thresholding, most of z(b) (n) are zeros for b r 000, and it is not an efficient way to apply Kalman filtering in these bands. The estimated values ~(b)(n) for b :/: 000 can be approximately obtained as
~(b)(n) ~
ITVk(8n)Ck(8n)
z(b) (n).
k=l
With ~(b)(n) from all bands, we may apply the wavelet packet reconstruction algorithm to obtain ~(n).
(10)
379 4.
PERFORMANCE
ANALYSIS
The estimation performance includes the estimation bias ~ x ( n ) and the variance
V:r.(n). In the following, we still
keep the assumption that the Kalman filtering is applied only to the low pass band ~(0oo) (n). Then the estimation bias at each wavelet decomposition band is derived as
~(~)(~1 -
0,
b - 000
~(b)- E[~(b)(~)], b # 000.
(11)
For b # 000,/hx(b) (n) is
~ r162
[~k=l I~rk(8n)([~'k(b)-
ICk(8n)X(b)(n)N+} • { Ck(8n)Xk(b)(n) }]
k--1
The average bias/3x can be obtained by wavelet packet reconstruction algorithm based on/3x(b) (n). For the bands where data is compressed and no Kalman filtering is applied, the major error introduced is the bias due to the data compression. The error due to the measurement noise is very small. This is because that the measurement noise is evenly distributed on the coefficients if the original measurement noise rlk(n) is independent identical distribution. By the fact that most of the coefficients (generally, more than 95% ) are cut off to zeros due to the compression, the remain noise power is very small, which is then further reduced by the fusion algorithm in the subband. Therefore, the total contribution of measurement noise to the final variance in the estimation of ~?(n) is very small, in comparison with the one introduced by the lowpass band. For the lowpass band (b - 000), because the Kalman filtering is asymptotic unbiased, the major error will be the variance especially when n is large. This variance can be obtained from the Kalman iterative algorithm[i,2], which is denote as Vx(000). The average variance for the signal estimation is V~ -- - Y~(ooo) S " 5. S I M U L A T I O N RESULTS We assume that the target is moving with a constant anglar velocity w0 - 2~r/300 in a perfect circle with the radius r0 - 50 centred at the origin with abscissas and ordinates represented by the co-ordinates system Ul and u2. The equations of motion of the target in discrete time are given by ul(n) - r0 cos(w0n), u2(n) - ro sin(w0n). The state equation for the above movement can be found in [2]. We use 3 sensors to carry out the multi-sensor Kalman filtering with the co-ordinate system [Ul, u2, c~] being [0, 0, 0~ [100, 0, 30~ and [100,100, 60 ~ respectively, (ct is the angle with respect to the reference co-ordinates): The noise covariance matrix is Rv(n) = e~diag(1, .01, 1, .01). In each of the examples, we compare the performance of target tracking employing the multi-sensor Kalman filtering using the optimum measurement fusion developed in [2] with that employing the algorithm derived in this paper for data compression. The measurement noise vectors in all the sensors have covariance matrix given by Rrlk - 10diag(1, 1). For the target tracking, we are interested only in the position of the target and not its velocity. Therefore, we only estimate ~s - [Ul,/L2]. The wavelet we use is Daubechies orthogonal wavelet with length L - 6[3]. In the first example, the noise vector in the state-equation has a covariance matrix with ~rv2 _ 0.4. Fig.2(a) shows the target trajectory and tracking trace employing the algorithm developed in this paper and Fig.2(b) shows the corresponding total mean-square error at each sampling instant. The mean-square error using the optimum algorithm developed in [2] and the theoretical estimation error are also plotted for comparison. It can be found that the algorithm developed in [2] converges to the theoretical optimal solution, while the algorithm developed in this paper offers a better performance than both of them. We also calculate the estimation bias and variance respectively as /?x - 0.72, and Vx - 0.78, which gives out the total mean square error ~x - 1.50. It is almost same as the measurement in Fig.2(b). This shows that our analysis and approximation in the performance derivation is valid. It can be observed that in the above example, the algorithm developed in this paper offers even better performance than the optimum solution. The reason is that in this example, the target moves smoothly with a small variance Therefore, in the wavelet packet decomposition, most of signal energy is concentrated on the low-pass band (3" v2 . x(000)(n). When we use thresholds to cut most of the coefficients in other bands to zeros, we introduce a small bias ~ . In the mean time, we also suppress 7/8 of noise power. Therefore, with a small increasing in the estimation bias, the estimation variance is greatly reduced. As a result, the total mean square error is decreased. In our second example, all the parameters in simulation are the same as in the first example except now the noise variance in the state-equation is increased to ~ 2 - 10. Fig 2(c) shows both the target trajectory and the tracking trajectory, and Fig. 2(d) shows the mean square error. Once again, the algorithm developed in [2] converges to the theoretical optimum solution. However, for this example, the performance of the algorithm developed in this paper is not as good as the optimum solution. This is because that in this example, the target moves much more unregularly due to the high noise variance cr~2 in the state-equation. After the data compression, we have a relative large bias ~x - 10.46 and the variance Vx - 0.83 is almost negligible comparing with the bias. As the result, the total mean --2
square error ~x - 11.29 is larger than the optimum value of 5.4.
380 System Structure
Fig.1
Fig.2 6. CONCLUSIONS In this paper, we have examed the problem of multi-sensor Kalman filtering using wavelet packet decomposition. After wavelet packet decomposition, most of the expand coefficients at high frequency bands can be ignored. As a result, the data transmission rate is reduced and the communication bandwidth is saved. The optimum fusion and the corresponding Kalman filtering algorithms in transform domain are developed. Due to the reduced sampling rate in transform domain, the computation load for Kalman filtering is eased on the fusion center. Performance of this algorithm is analyzed and the simulation results are shown for both the algorithm developed in this paper and the one for optimum fusion proposed in [2]. The performance varies with different target movement. However, the algorithm in this paper shows obvious advantageous in both the computation rate and the communication bandwidth. REFERENCE
[1] Willner, D., Chang, C.B. and Dunn, K.P., "Kalman Filtering Configurations for Multiple Radar Systems", Massachusetts Institute of Technology Lincoln Laboratory, Technical Note 1976-21, April 14, 1976. [2] Jin, Q., Li, J.Y., Luo, Z.Q., Wong, K.M. and Yip, P., Data Fusion and Data Compression for Multi-Sensor Kalman Filtering", CRL Report No. 296, McMaster University, March, 1995. [3] Wickerhauser, M.V., "INRIA Lectures on Wavelet Packet Algorithms", Numerical Algorithm Research Group, Dept. of Mathematics, Yale University, 1991. "
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
381
Performance of wavelet packet division multiplexing in timing errors and flat fading channels J. Wu, K. M. Wong, and Q. Jin Communications Research Laboratory, McMaster University, Hamilton, Ontario, L8S 4K1, Canada. Abstract Wavelet Packet Division Multiplexing (WPDM) is a multiple signal transmission scheme based on the orthogonality of wavelet packet basis functions. We investigate the performance of WPDM under two types of interference: (1) timing errors between the transmitter and the receiver; (2) the presence of fast flat fading. Pilot Symbol Assisted Modulation (PSAM) is aplhhed to combat the irreducible error due to flat fading. The probability of error of WPDM system with PSAM in flat fading channels is analysed. Performance comparison between WPDM and FDM are presented. It is shown that the pilot spacing for WPDM can be larger than the spacing for FDM. Therefore WPDM wastes less energy in pilot symbols and gains capacity when compared with FDM scheme.
1. Introduction
Wavelets and wavelet packets have attracted considerable attention recently among researchers in many fields. In the area of communications, wavelet packets are applied, among others, in multiplexing [1], spreading codes in CDMA [2], coding waveform in binary transmission [3]. In this paper, we apply wavelet packets to multiplexing and develop a scheme called Wavelet Packet Division Multiplexing(WPDM). From a scaling function 001(t), a wavelet packet can be formed [4], and represented as a binary tree. Let (gm) represent the ruth node at &h level. The basis functions at node gm are given by the following iterative formulae:
Ce,2m--1 (t) = E
h[n]r
-
-
n T g - 1 ),
n
r
"-
E g[n]r
-nTe-,
).
rl
By choosing h[n] and g[n] such that 001 (t) satisfies (r (t), r (t-nTo)} = 6[n] (where 6[n] is the Kronecker delta), all Cem(t) in the leaf nodes in a complete tree and their translated versions are orthogonal; i.e., (0era(t), r 2~kT0)} = 6 [ g - A ] 6 [ m - # ] 6 [ k ] . These orthogonal functions can be used to code the data sequences aem[n] of different users, though the functions may overlap in both time and frequency. The multiplexing model is shown in Fig. 1.
Figure 2: B E R vs.
Figure 1: The system model of W P D M
Eb/No
with the timing e r r o r = 3 / 2 0
Let s denote the set of levels containing the terminals of a given tree, and Me denote the set of indicies of the terminals at the g level. The multiplexed signal in Fig. 1 can be expressed as ~(t) = eE E , m E J~41 n
An equivalent expression for
s(t)
is
s(t) - E aol[n]Col(t- nTo),
(1)
n
where a01[n] is obtained from aem[n] by wavelet packet reconstruction. Eqs. (1) corresponds to an equivalent multiplexing model. At the receiver, the information sequence gem[n] can be recovered by correlating s(t) with the subcarrier function r This is simply the reverse operation of multiplexing. Similarly, by'the reverse operation of Eq. (1), i.e., correlating the corrupted s(t) with r (t- nTo), samphng the result at the correct instants, and passing the sample sequence through wavelet packet decomposer, we can also recover the information sequences for each user. The scheme of assigning channels to different users based on waveform orthogonality in both dilation and translation is WPDM. In WPDM, the flexibility in choosing the wavelet packet functions makes it possible to
382 advantages of WPDM include greater bandwidth efficiency and higher security compared to TDM and FDM systems. However, before WPDM can be promoted as a practical multiplexing technique, the performance of WPDM in various channels has to be fully investigated. In this paper, we analyse the performance of WPDM under two types of interference: (1) timing errors between the transmitter and receiver; (2) the presence of fading.
2. T i m i n g error In practice, the synchronization between the transmitter and the receiver is imperfect. This may result in intersymbol interference(ISI) and cross-talk in a multiplexing system. In this paper, the effects of timing error in WPDM system are briefly summarized. For the detailed analysis, please refer to [1, 5]. The transmitted signal is s(t)in Eq.(1). Assume at the receiver, the correlating signal is r ( t - k T o - A), where A is the time discrepancy between the transmitter and the receiver. Then the output of the correlator is #01 In] = y ~ aol[k]Rr k
- nTo + A) + v[n],
(2)
where R e ( r ) is the autocorrelation function of r (t), and v[n] is due to Gaussian noise. We assume the first order derivative of Re(t) exists, and expand Eq.(2) by Taylor's formula with remainder, #ol[n] = a0a[n] + A y ~ aol[k]R'c(kTo - nTo) + O(A 2) + v[n]. k
Denote I01[n] = A ~ k a~162 kT~ - nTo). Observing I[n] = A (a01[n]. R~c(-nTo)), we find that a wavelet minimizing ~"]~, IR~c(-nTo)l 2 gives the minimum interference energy. Therefore good performance can be expected from such an optimum wavelet. The signal #t,n[n] is obtained by passing ~01[n] through the filter ftm[n] and down-sampling by 2t; i.e.,
~[~] = y ~ f ~ [ k
-
2~]~o~[k] = ~ [ ~ ]
+ h~[~] + ~ [ ~ ] ,
(3)
k
where O(A 2) is ignored and Itm[n] = A ~-~xe:., ue.'.x ~']
[2tn - 2xJ] with Ji~[n] being a function of ft,.,.,[n], fxu[n], and R~(kTo). The term with Jf~,[n] represents the ISI and the remaining terms are cross-talk. In (3), Itm[n] and vtm[n] are independent additive noise, so the probability of error can be derived.
Fig.2 illustrates the performance of the WPDM in the presence of timing errors when 14th order Daubechies wavelet and the corresponding optimum wavelet are in use, and for each user 4-1 is transmitted. Both the theoretical derivation and simulation results showed that the optimum wavelet has lower probability of error than the Daubechies wavelets.
3 . F l a t fading channels In mobile communications, fast fiat fading degrades the system performance and introduces an irreducible bit error rate or error floor. In order to suppress the error floors caused by fading, channel sounding techniques can be employed to compensate for fading distortion. One of the commonly used channel sounding techniques is Pilot Symbol Assisted Modulation (PSAM)[6]. In PSAM, prespecified symbols are inserted periodically into the information sequence prior to modulation. Both pilot symbols and data symbols experience the same fading distortion after they pass through channels. At the receiver, the received signal is splitted into two streams. One stream consists of faded pilot symbols. Since the pilot symbols are known, an estimate of the channel distortion to pilot signals can be extracted. The estimate then may be interpolated to form an estimate of the channel state if the channel does not change too fast. By this channel estimate, channel distortion to data stream can be compensated and better performance can be expected. Flat fading effects can be modelled as a multiplicative noise process on the transmitted signal. Furthermore, we assume the multiplicative noise has a Rayleigh distributed amplitude and a uniform distributed phase. Denote so(t) = s(t)coswot as the transmitted signal, where w0 is the carrier frequency, and s(t) as in (1) with pilot symbols being inserted in a01[n] at n = kM. The low-pass equivalent of the output of a flat fading channel is given by
~(t) = ~(t)~(t) + ~(t), where ue(t) is AWGN with power spectral density No in both real and imaginary components and u(t) represents the
1 fading which is a complex zero mean Gaussian process characterized by its power spectrum W ( f ) = -~2~r v/f2D_f2' with a ,2 being variance and fD Doppler spread. At the receiver, passing r(t) through a filter matched to the transmitted pulse, and sampling the output of the matched filter at exact instants kT, we have where u[k] are samples of the fading process u(t) at t = kTo, and v[k] = f vc(t)s(t- kTo)dt. To carry out fading estimation at k = raM, r~,[mM] is simply divided by the known symbol a01[mM]. The fading estimates at k = m M are therefore
,~r..ul = ~.[mM] _ , , r . . u l +
~[mM]
383
Since the fading process is bandlimited, the fading distortion on the data symbols can be interpolated from the K nearest pilot samples ~[mM]. In [6], a method was proposesd using a Wiener filter to obtain estimates ~2[k] for - L M / 2 ] < k < LM/2J such that [g/2J
~[k] =
E b*[i,k]a[iM], i=-LK/2J
where the ~oem~ient~ b'[i, k] ~,n be determined by minimi~.ing the v,~i~Ln~ Of the e~tim,tion e~o~ ~[k] = u [ k ] - ~,[k]. when th~ ~oem~icnt ,,e~to~ Denot~ b[k] = [... b [ - M , k] b[O, k] b[M, k] ...]~'. An optimum e~tim,t~ i~ ,r b satisfy the normal equation Rb[k] = w[k], where the elements of t t and W are given by
No~[i- k] Rik = 1 E(~[iM]~*[kM]) = R , ( (iM - kM)To ) + E[laol[iM][2],
wi[k] = 1E(u*[k]~[iM] ) = R , ( (iM - k)To )
and R~(r) is the autocorrelation function of u(t). Fading compensation is carried out by dividing each received symbol r~[k] by the corresponding fade estimate ~[k]. The compensated samples are given by
~[k] = ~[k]/~[k].
(4)
The above procedure can be used in any digital communications system. Here we analyze the performance of a WPDM system with r and ~b12(t) being employed by two users to transmit 4-1 sequence. Passing ~[k] through wavelet packet decomposition filters, we obtain the estimated all[n]
and ~2[n],
~1~[-] = ~ ~[k]h[k- 2~], ~ [ . ] = ~ ~[k]g[k- 2.]. k
(5)
k
The analysis of the probability of error of &12[n] would be similar to that of #~l[n]. We will concentrate on #11[n]. Substitution of Eq.(4) into (5) results in
~ [~] : ~
~[k]~0~[k] + ~[k] h[k 2~] : ~ [ ~ ] + ~
~[k]~0~[k]+ ~[k] h[k 2~].
-
k
(6)
-
k
D~not~ C[k] = ~t~l~o,C~l+~t~J ~ymmet~y of a[k] , Z[k] = ~[k]~o~[k] + ~,[k], D[k] = 1/l~[k]l Because of the ci~r random variables e[k] v[k], and fi[k] C[k] can be expressed as C[k] = e [ k ] e r ~ "-- Z[k]D[k] without change of ' ' I,:,[h]l ' the distribution property of C[k]. As found in [7], the probability density function (pdf) of Z[k] and D[k] are
1
p z ( z ) = y / r ( ~ + l)'a0~'k"=a~ tJl "exp
(
)'
+,
pD(z)=
exp (--aol~ 1 ) aa'~
'
respectively. Since Z[k] and D[k] are independent, we can obtain the pdf of C[k] as
pc(y) =
/0
pz(ylX)pD(z)dx =
/0 1
(
~xexp
--2a2x 2
. exp
( 1/
- - x - r a y / dx,
(7)
where the substitution of a 2 = a~ + ]a01[k][ 2a~ has been used. From Eq.(6), we find that it is the real part of C[k] that is of interest. Therefore, lyl ~ in (7) can be replaced by y2, and pc(y) can then be simplified as 1
vo(~) =
~/~
.
(s)
Recall that the interference to ~la[n] due to flat fading and additive Gaussian noise is I~[n] = ~ C[k]h[k- 2n], where C[k] is a random veriable with the pdf as Eq.(8). Although C[k] is related to a0~[k] through a 2 - - IT~2 -1i~0~[~]I~a~, a0~ [k] itself is a random variable. We may assume C[k] is a stationary process. By this assumption, I1 [n] is also stationary. Furthermore, C[k] at different instants are independent. Therefore the interference to each bit of x l[n] has the pdf,
pz,(y)=lh[o]lvc
17[o]1
9
9
,...,
(9)
to ~,~[,~] where 9me,~n~ convolution. If the o~thoson~ w~v~]r i~ used, i.r ~[,] = ( - ~ ) " h [ ~ - ,], the intr162162162 due to fading and Gaussian noise has the same pdf as Eq. (9). A general closed form expression of the pdf of/1 [n] cannot be obtained since it depends on h[n]. The evalution of probability of error P. = Pr(&a~[n] > 0Inulin] = -1~ = Pr(I~[n] > 1~ =
Pr(v~dv
(10]
384 however, can be carried out numerically. Following the same procedure as with WPDM, we can derive the probability of error for a FDM system in flat fading channels. To compare the performance between W P D M and FDM systems, we present an example in which only two users are in either W P D M or FDM system. The wavelet packet used in W P D M is generated from Daubechies scaling function of order 14. Let us assume f D T = 0.05, where 1 / T is the transmission rate of one user. For W P D M , To = T/2; therefore, fDTo = 0.025. Fig. 3 shows the theorectical and simulation results of BER vs. the pilot insertion rate M with SNR (signal to additive white Gaussian noise ratio) 20dB. The theoretical results closely agree with the simulation results in both FDM and W P D M systems. It is also clear that for the same BER the pilot spacing for W P D M can be larger than the spacing for FDM. In other words, W P D M needs less frequent pilot symbols; therefore W P D M wastes less energy in pilot symbols and gains capacity when compared with the FDM scheme. In the given example, we assume M = 8 in FDM and M = 16 in W P D M for reliable communications. Then it is easily calculated that 12.5% capacity is used by pilot symbols in FDM, while only 6.25% capacity is used in W P D M . With more users in W P D M and FDM systems, fDTo is much smaller than l o T . Thus, smaller percentage of capacity is spent in pilot symbols. However, To has to be sufficiently large to maintain the validity of the flat fading assumption. Thus, fDTo cannot be decreased indefinitely.
ow o st
theoretical results -WPDM ... FDM I l m u l l t l o n
iI
i i I i
M .
.
.
.
results
o
o o o
lO
2
~
~
,~
M
,~
,~
~h
,h
~o
F i g u r e 3: P r o b a b i l i t y of error vs. different pilot spacing M for W P D M and F D M . E b / N o = 2 O d B .
4. Conclusion The system model of W P D M was reviewed. The main analysis results under timing error was summarized. The ISI and cross-talk due to timing error was modelled as signals passing through interference filters which were functions of both timing error and the wavelets in use. Based on the interference models, optimum wavelets minimizing the interference have been designed and the probability of error formulae derived. Both the theoretical derivation and simulation results showed that the optimum wavelets have lower probability of error than Daubechies wavelets. To combat the irreducible error due to fast flat fading, PSAM was applied. The performance of a W P D M system in flat fading channels with PSAM was derived. For the purpose of comparison, the performance of an FDM system in the same channel environment was also evaluated. The calculation and corresponding simulation results illustrated that the pilot spacing for W P D M can be larger than the spacing for FDM. In other words, W P D M needs less frequent pilot symbols and therefore gains capacity in flat fading channels compared to the FDM scheme.
References [1] J. Wu, Q. Jin, and K. M. Wong, "Multiplexing based on wavelet packets," in Wavelet Applications II (H. H. Szu, ed.), vol. 2491 of Proceedings o] the SPIE, pp. 315-326, Apr. 1995. [2] R.E. Learned, H. Krim, B. Claus, A. S. Willsky, and W. C. Karl, "Wavelet-packet-based multiple access communication," in Wavelet Applications in Signal and Image Processing H (A. F. Laine and M. A. Unser, eds.), vol. 2303 of Proceedings o] the SPIE, pp. 246-259, Oct. 1994. [3] P. P. Ghandi, S. S. Rao, and R. S. Pappu, "On waveform coding using wavelets," in Proceedings o] the ~7th Asilomar Con]erence on Signals, Systems and Computers, (Pacific Grove, CA), pp. 901-905, 1993. [4] R. R. Coifman and M. V. Wickerhauser, "Entropy-based algorithms for best basis selection," IEEE Transactions on Information Theory, vol. 38, pp. 713-718, Feb. 1992. [5] J. Wu, K. M. Wong, Q. Jin, and T. N. Davidson, "Wavelet packet division multiplexing and wavelet packet design under timing error effects," IEEE Transactions on Signal Processing. (Submitted to). [6] J. K. Cavers, "An analysis of pilot symbol assisted modulation for Rayleigh fading channels," IEEE Transactions on Vehicular Technology, vol. VT-40, pp. 686-693, Nov. 1991. [7] M. G. Shayesteh and A. Aghamohammadi, "On the error probability of linearly modulated signals on frequencyfiat Ricean, Rayleigh, and AWGN channels," IEEE Transactions on Communications, vol. COM-43, pp. 1454-1466, feb/mar/apr 1995.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
385
Time-Varying Wavelet Packet Division Multiplexing T. N. Davidson and K. M. Wong, Communications Research Laboratory, McMaster University, Hamilton, Ontario, L8S 4K1, Canada. davidson~sunspot,
crl. m c m a s t e r ,
ca
Abstract
Wavelet Packet Division Multiplexing ( W P D M ) is an emerging multiplexing scheme in which the properties of wavelet packet basis functions and their close relationships with perfect reconstruction filterbanks are exploited to provide higher capacity, flexibilityand robustness to several adverse channel environments. Recently, a parameter 'hopping' scheme was incorporated into the W P D M scheme, offering potential performance and security improvements analogous to those of frequency hopped communication schemes. Whilst that Wavelet Packet Hopping ( W P H ) scheme provided a general hopping framework, it required intricate implementation. In this paper, we show that by giving up a littleof the generality of the W P H scheme we can avoid these technical difficulties,whilst retaining the fundamental benefits of wavelet packet hopping.
1
Introduction
Wavelet Packet Division Multiplexing (WPDM) [1, 2] is an emerging multiplexing scheme in which the self- and mutual-orthogonality properties of wavelet packet basis functions [3] are exploited for multiplexing purposes. In contrast to the conventional Time Division Multiplexing (TDM) and Frequency Division Multiplexing (FDM) schemes, the waveforms used to represent the data symbols of each user overlap in both time and frequency. However, they are intrinsically o r t h o g o n a l - they form a wavelet packet - - s o the symbols can be recovered using a simple correlator receiver. The fact that the waveforms overlap in time and frequency provides an increase in capacity over TDM and FDM, and substantial robustness to adverse channel environments [4, 5], whilst their close relationships with multi-rate filter banks provide particularly simple transmitter and receiver structures. In frequency-hopped communication schemes, the carrier frequency is 'hopped' between several frequencies in a pattern which is known by the receiver. In addition to offering the potential for improved (average) performance over that of the underlying time-invariant scheme, a frequency-hopped scheme also offers greater security. (An interceptor requires knowledge of the hopping pattern in order to decode the message signal.) In a previous paper, we examined a framework for 'hopping' the parameters of a W P D M scheme without any reduction in the data rate [6]. The resulting Wavelet Packet Hopping (WPH) scheme offered the potential for the analogous (average) performance and security improvements over the WPDM scheme. Whilst the WPH framework provides general hopping patterns, it may require buffering at the receiver, with a consequent delay, increased computational resources in the so-called 'transition zones,' and intricate pulse shaping. In this paper, we shown that by giving up a little of the generality of the WPH scheme - - we only allow 'branch hopping' we can avoid these technical difficulties, whilst retaining the fundamental benefits of wavelet packet hopping.
2
Wavelet Packet Division Multiplexing
In this section, we briefly review WPDM, 1 emphasising a notation that leads to a simple extension to the time-varying case. Given the impulse response, g0[n], of a unit-energy FIR filter of length L = 2K which is orthogonal to its even translates and satisfies some additional mild technical conditions [7], we can use one of a number of algorithms [7, 8] to find a (finite-duration) scaling function
r
= ~~
k
g0[k]r
~T0),
which is self-orthogonal at integer multiples of To. Furthermore, we can form a (conjugate) quadrature mirror filter gl[n] = (-1)ng0[L - 1 - n], and subsequently define a family of functions elm(t), ~ >_ 0, 1 < m < 2t in 1Connections with related communication schemes are mentioned elsewhere [2, 6].
386
the following tree-structured manner [3]" Ct+l,2rn-l(t)
= Ego[k]r k
- kTt),
Ct+l,2m(t) = E g l [ k l f t m ( t - kTl), k
where Tt = 2tT0. For any given tree structure, the (finite-duration) functions at the 'leaves' or terminals of the tree form a wavelet packet. They satisfy
(r
-
nTt), Cxu(t
kT~))
-
= 6[g- X] 6[m-/~] 6 [ n - k],
where (., .) denotes the s inner product, and hence are a natural choice for multiplexing applications. WPDM is based on the use of such functions for waveform coding of the data streams (which may have different data rates) and the exploitation of close relationships with perfect reconstruction filter banks to obtain particularly simple realizations of the transmitter and the receiver. - nT1) and r - nT1) as coding waveforms In the simple case where there are only two users, using r for the data streams all[n] and (r12[n] respectively, the transmitted signal can be written as se(t) -- ~T1~11(t ) ~" o'T2q]12(t) "- 6rTlr
(1)
where the nth element of the (infinite dimensional) vectors #tm and Ctm(t) are arm[n] and C t m ( t - nTt), respectively. That means that we can replace the two waveform coders employing r and r by a multirate filter bank and a single waveform coder employing r and operating at twice the rate. The relationships between the components in Equation 1 can be written as
0"12
q~12(t)
'
respectively, where G is a doubly infinite block Toeplitz matrix with a transpose of the form [8] "..
e T =
9"" 9.. 9~
"..
9
"'" ...
0 ..
...
..
..
"..
"..
if-1
0
..
..
g112i]
]
".
.
.
.
... ..
with filters of length L = 2K,
Si =
[
go[2i] g012i+l]
gt[2i + l]
'
and P = pT = p-1 is an interlacing operator. The receiver performs the correlations v01[n] = (re(t), r nTo)), where re(t) is the received signal, which we stack in the vector Jot. The data is then recovered using
Since G-erG = I, if u01 = a01 then 01m = ~lm, m = 1,2, and the data is exactly recovered. The key step to the current extension to time-varying WPDM is to augment the unitary operator G by a simple time-varying switching structure, as we will now show. 3
Time-Varying
Wavelet
Packet
Division
Multiplexing
In the previous section, we highlighted the central role of a tree-structured orthonormal filter bank (transmultiplexer) in the WPDM scheme. Once the filter g0[n] and the tree-structure are chosen, all the coding waveforms elm(t) are defined. Therefore, we can 'hop' the parameters of the WPDM scheme, without compromising the data rate, if we can 'hop' the parameters of the filter bank in an appropriate way. In previous work [6] we provided a general framework to achieve 'hopping' of both the tree structure and the filter coefficients, based on a 'transition filter' approach to time-varying filter banks [9, 10]. In this paper, we focus on hopping just the branch structure of the tree, and hence we will refer to the current scheme as Branch-Hopped WPDM (BHWPDM). This provides a framework which retains the fundamental features of the general WPH scheme, but is far simpler to implement. As a simple motivating example, consider the three-user system in Figure 1, in which one user has twice the data rate of the other two. The dashed boxes represent 'switching' units which provide
387
u s e r 1 ~ .... ~
user
]
2
~,ol (t -
nTo)
~oa[nll Pulse I s ~ ( t ) _ -I Sh~p~ I
3
Figure 1: A three-user BH-WPDM transmitter system in which user 3 has twice the data rate of users 1 and 2.
Figure 2: The Fourier transforms of the equivalent waveforms in Figure 1 for Daubechies length 14 filters. either a 'parallel' or 'cross' connection at each instant. The Fourier transforms of the waveforms assigned to each user for certain switch settings are given in Figure 2. If we flip the switches in a pattern which is known to the receiver, we obtain a more secure communication scheme. We may also be able to obtain an average performance improvement over any one of the underlying time-invariant systems. Moreover, since the switching units are memoryless, the switching is achieved without compromising the data rate. We can construct a model for the system in Figure 1 using a simple augmentation of the notation developed in Section 2. Let fii be the data vector from user i, i = 1, 2, 3, and let i~i(t) denote the vector of waveforms onto which the data from the ith user is coded. Then the transmitted signal can be written as
~o(t)
3
= ~ i=1
~T~,(t)
=
~O'l~Ol(t),
where
and i~3(t)
0
1 PA~IGT~Ol (t).
(2)
I j
Here G and P are the operators defined in Section 2 and At,, is a (doubly infinite) block diagonal matrix with 2 x 2 blocks. Each block represents the state of the switch at the (~, m)th node at a given instant, with
[1 0],e os o io , ,lol on ectio , o[0 1] 0
1
1 0
The receiver performs the correlations v01[n] = (re(t), r we stack in the vector P01. The data is then recovered using [v:~]
=[
PA~IG-~ O 0
nTo)),
where
a cross connection.
rc(t)
(3)
is the received signal, which
I ]jp A T G r ~oi.
Since A~mGTGAI,n = I, if uol = ~oI then fii- fii,i = 1,2,3, and the data is exactly recovered.
388 These techniques can be extended to deeper trees by simply cascading the operators in a manner which matches the shape of the tree (as we have shown above in a simple case). Thus, the discrete-time component of a branch hopping scheme can be implemented simply by adding a time-varying 2 x 2 switching unit to the upsamplers and filters attached to each node of the tree, as shown in Figure 1. 2 Note that in contrast to the general WPH framework developed previously [6], there are no buffering requirements, and there is no increase in the computational load of the scheme. The implementation of the continuous-time component of the BH-WPDM scheme is also simpler than in the general WPH case. In fact, the pulse shaping requirements are exactly the same as those of the underlying time-invariant scheme, since the elements of r are simply translations of r by integer multiples of To. These shapes can be computed 'off-line' and implemented using standard pulse shaping techniques, such as tapped-delay line filtering. The equivalent waveforms at other nodes can be simply calculated from r using an equation of the form in Equation 2. In contrast, the general WPH scheme requires intricate design procedures to ensure desirable properties of r and time-varying pulse-shaping techniques to implement it.
4
Conclusion
Incorporating parameter 'hopping' into a Wavelet packet Division Multiplexing (WPDM) scheme, without compromising the data rate, offers greater security and the potential for improved (average) performance over the underlying time-invariant scheme in an analogous way to that of frequency-hopped communication schemes over conventional frequency division multiplexing. However, in its general form, Wavelet Packet Hopping (WPH) is fraught with implementation difficulties [6]. In this paper, we have proposed a simple branch-hopped WPDM (BH-WPDM) scheme in which some of the generality of WPH is given up - - we only allow 'branch' hops in order to retain some of the desirable implementation attributes of the underlying WPDM schemes, such as computational efficiency, no buffer requirements and straightforward waveform construction. The quantification of the potential performance improvements of both the WPH and the BH-WPDM schemes are currently under investigation and will be reported in due course.
References [1] J. Wu, Q. Jin, and K. M. Wong, "Multiplexing based on wavelet packets", In Szu [13], pp. 315-326. [2] J. Wu, K. M. Wong, Q. Jin, and T. N. Davidson, "Wavelet packet division multiplexing and wavelet packet design under timing error effects", Preprint, Communications Research Laboratory, McMaster University, Hamilton, Ontario, Canada, June 1996. [3] R. R. Coifman and M. V. Wickerhauser, "Entropy-based algorithms for best basis selection", IEEE Transactions on Information Theory, vol. 38, no. 2-Part II, pp. 713-718, Feb. 1992. [4] J. Wu, K. M. Wong, Q. Jin, and T. N. Davidson, "Performance of wavelet packet division multiplexing in impulsive and Gaussian noise channels", In Unser et al. [14]. [5] J. Wu, K. M. Wong, and Q. Jin, "Wavelet packet division multiplexing", in Proceedings of the 3rd International Workshop in Signal and Image Processing, Manchester, England, Nov. 1996. [6] T. N. Davidson and K. M. Wong, "Wavelet packet hopping", In Unser et al. [14]. [7] I. Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992. [8] M. Vettedi and C. Herley, "Wavelets and filter banks: Theory and design", IEEE Transactions on Signal Processing, vol. 40, no. 9, pp. 2207-2232, Sept. 1992. [9] C. Herley, J. Kova~evi~, K. Ramchandran, and M. Vetterli, "Tilings of the time-frequency plane: Construction of arbitrary orthogonal bases and fast tiling algorithms", IEEE Transactions on Signal Processing, vol. 41, no. 12, pp. 3341-3359, Dec. 1993. [10] C. Herley and M. Vetterli, "Orthogonal time-varying filter banks and wavelet packets", IEEE Transactions on Signal Processing, vol. 42, no. 10, pp. 2650-2663, Oct. 1994. [11] J. Mau, J. Valot, and D. Minaud, "Time-varying orthogonal filter banks without transition filters", in Proceedings ol the 1995 International Conference on Acoustics, Speech, and Signal Processing, Detriot, MI, May 1995, vol. 2, pp. 1328-1331. [12] R. A. Gopinath and C. S. Burrus, "Factorization approach to unitary time-varying filter bank trees and wavelets", IEEE Transactions on Signal Processing, vol. 43, no. 3, pp. 666-680, Mar. 1995. [13] H. H. Szu, Ed., Wavelet Applications 11, vol. 2491 of Proceedings o.f the SPIE, Apr. 1995. [14] M. A. Unser, A. Aldroubi, and A. F. Laine, Eds., Wavelet Applications in Signal and Image Processing IV, vol. 2825 of Proceedings of the SPIE, 1996. Some closely related filter bank structures have been developed for signal analysis applications [11, 12], but the simplicity of the switching approach is particularly appealing in multiplexing applications.
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K.
B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
CO-CHANNEL TIME-SCALE
389
INTERFERENCE DOMAIN:
THE
MITIGATION CIMTS
IN THE
ALGORITHM
Sam Heidari* and Chrysostomos L. Nikias t * Torrey Science Corporation 10065 Barnes Canyon Road, San Diego, CA 92121 sheidari 9 t Signal and Image Processing Institute U n i v e r s i t y of S o u t h e r n C a l i f o r n i a 3740 M c C l i n t o c k A v e n u e , E E B 400B Los A n g e l e s , C A 9 0 0 8 9 - 2 5 6 4
Abstract
In many communication systems, the problem of co-channel interference is encountered when along with the signal of interest (SOI), one or more interfering signals are present in a common receiver. The SOI and the interference(s) which are correlated, possess similar characteristics and power, and share the same region of support both in time and frequency domains. In this paper, we present the Co-channel Interference Mitigation in the Time-Scale domain (CIMTS) algorithm which estimates the signal of interest (SOI) and the interfering signal from their superposition in the presence of additive noise. This method is inspired by the reconstruction of the interference from the null space of the SOI in the time-scale domain. Once the null space of the SOI is determined, the interfering signal is reconstructed via a set of linear operations. Thus, the SOI is estimated by a simple subtraction of the estimated interference from the observations.
1
Introduction
The subject of co-channel interference mitigation has received much attention for many years [1, 2]. In such systems as mobile communications, radio networks and radar, the problem of co-channel interference is encountered when along with the signal of interest (SOI), one or more interfering signals are present in a common receiver. Applications may include cellular technology and wireless multi-media. A particular application of interest is the military communication systems operating in an intentionally hostile interference environment. It is known that the co-channel interference often degrades the SOI more severely than the additive noise or intersymbol interference. The performance of the interference reduction systems, which treat the interference as white additive noise, will degrade significantly in the presence of co-channel interference. The conventional approach where each signal (the SOI and the interfering signals) is demodulated as if it were the only one present, is not optimum in terms of error probability. This method is not suitable when the SOI and the interference are wideband. The maximum-likelihood and maximum a posterior symbol detection methods [1] achieve a higher performance; however, these methods are computationally very expensive. The utilization of digital filter banks are not new in communications and digital signal processing applications [3, 4]. The new algorithm presented in this paper, utilizes the Wavelet Transform (WT) [5] to estimate the SOI and the interfering signal from their superposition in the presence of additive noise. The W T has several properties which make it attractive for this particular problem. Two of the most important properties are the linearity and low complexity (for discrete wavelet transform.) In this paper, we present and analyze the Co-channel Interference Mitigation in the Time-Scale domain (CIMTS) Algorithm [2] for M P S K signals. This method is inspired by the reconstruction of the interference from the null space of the SOI in the time-scale domain. The CIMTS algorithm enjoys the simplicity of the matched filter, however it demonstrates a higher
390 performance. Furthermore, unlike the matched filter, the new method is near - far resistant, i.e. it is blind to dissimilarity of the signal energies. Section 2 is primarily concerned with the problem definition. In Section 3 , the CIMTS algorithm is presented and analyzed. The algorithmic assestment is presented in Section 4.
2
P r o b l e m Definition
Usually, we are only interested in recovering the SOI, however in certain applications, such as multi-access communication systems [1], it may be required to recover all the signals. The objective of this research is to recover the SOI and the interfering signal(s) from the observation of their superposition which is embedded in additive noise. In this literature, it is assumed that only one interfering signal is present. However, in the future, the results may easily be extended for the case of many interfering signals. The problem is formulated as r(t) = S(t) + I(t) + w(t)
(1)
where r(t) is the observation, S(t) is the SOI, I(t) is the interference, and w(t) is the white Gaussian noise. The SOI and the interference are both MPSK signals and given respectively by,
N-1 S(t) -- E Al,k exp(jOk )X[kTl,(k+l)Tx](t -+""rl) k=O
N~_I
I(t) - E
k=0
A2,k exp(--j(27ra ft +
Ck))XtkT~,(k+~)T~J(t+ r2)
(2)
(3)
where the signal S(t) is baseband and the signal I(t) has a very small modulation frequency 5f, 1/Sf < < T2. The received energies of the SOI and the interfering signal for the k th time slot are Al,k and A2,k, respectively. I t is assumed that the symbol duration of the signals S(t) and I(t) are given and are equal to T~ and T2, respectively. Furthermore, it is assumed 7"2 = 7'1 + ST, where 5T is small, i.e., 0 < ISTI < < T~. There are no other restrictions on T1 and T2, however as shown below, if T1/T2 = N / M where N and M are coprime numbers, then the complexity will significantly reduce. In the following derivations, it is assumed that 7"1 = r2 = 0, however, the results can be easily extended for the cases where 7"1 and 7"2 are not equal to zero. The algorithm will fail when any individual member of the function set {X[kT~,(k+~)T~](t+ 7"2)}, for k = 0 , . . . , N - 1, is linearly dependent on the function set {X[tT~,(t+~)T,](t + rx)}, for l = 0 , . . . , N - 1, since the interfering signal would be zero in the null space of the SOI.
3
The CIMTS Algorithm
The idea of the algorithm lies on the reconstruction of the signal I(t) from the null space of S(t) in the timescale domain. The following theorem can be utilized to identify the null space of a baseband MPSK signal in the time-scale domain. This is a fundamental theorem as it serves as the basis of the new CIMTS algorithm. is a arbitrary bounded support wavelet function, (i.e. for some c, F u n d a m e n t a l T h e o r e m : If r r = 0 if t ~ [0, c]), then, the Bounded Support Discrete Wavelet Transform (BSDWT) of a baseband MPSK
signal will be zero at scale b = T1/cL, WTs(t)(cn, T1/cL) - f f S(t)r
~ ct
- cn)dt - 0
(4)
where S(t) is a baseband MPSK signal with symbol duration T1, and L any integers. As a result of the fundamental Theorem, we can establish the following fundamental proposition, which relates the BSDWT of the observed signal r(t), in the null space of the SOI S(t), to the interfering signal I(t), sampled at every T2 duration.
391 F u n d a m e n t a l P r o p o s i t i o n : There exist a linear relation between the sampled interfering signal and the BSDWT of the observation at scale b - T1/cL,
WTr(t)(T1/cL) - A ~ _ / + _n,
(5)
where T1 is the symbol duration of the SOI, L is an arbitrary integer, I is the vector of the sampled interference signal, / = [ I(0.5T2) I(1.5T2) ... I ( ( N ' - O . 5 ) T 2 ) ] T (6) and WTr(t)(T1/cL) = [ WTr(t)(O, T1/eL)
WTr(t)(c, T1/cL)
...
W T r ( t ) ( c ( g - 1),T1/cL) ]T.
(7)
Since the frequency-offset, 6f is unknown, the matrix A is only an approximation. However, if the modulation frequency of the interfering signal is zero then the results in the fundamental proposition are exact. For the CIMTS algorithm, the element of matrix A ~ at n th column and k th row is expressed as,
X[kT2,(k+l)T2](t)rT~ L--L-t -
an
(S)
.)dt.
Using Eq.(5) the interfering signal is reconstructed via the Singular Value Decomposition (SVD),
I_- AtWTr(t)(T1)
(9)
where the matrix At is the SVD inverse of the matrix A-~. The SVD method is a simple, numerically stable way of finding a generalized solution. In summary, the CIMTS algorithm can be decomposed into two steps 9 Step one is the projection of the received signal into the null space of the SOI in the time-scale domain (separation of the interfering signal from the SOI). Step two is the reconstruction of the IS form the null space of SOI (recovering the interfering signal in the presence of ISI). If the processing is done for fixed block size data, T1 and T2 are time-invariant, and T1/T2 = N / M where N and M are co-prime numbers, then the matrix A and its pseudoinverse are calculated only once. Hence, by combining these two steps the complexity of the algorithm will reduce significantly. Given the matrix A is the pseudoinverse of matrix A, A = At,
Ao,o '~1,0 .
Ao,1 ,~,1 1 .'
...
)~N- 1,0
/~N- 1,1
"'"
A-
then, the
i th
AO,N-1 A1,N- 1 .
"'"
(10)
)~N-1 N-1
element of the vector / is given as,
g-1
L t -- n) dt N-1
= f r(t) ~ ~,_~,.r n=0
L
dt
h~-~(t)
(11)
= f(I(t) + w(t))h~-l(t) dr. Note that h~-l(t) for i - 0,.--, N - 1 is calculated only once. Using Haar Wavelet Transform and the matrix A derived in [2], the receiver function hiR-l(t) is given as, h~_l(t) = ( r
7'1 t - ( i -
1))(t)- r
L
1 t - i)(t))-8--~.
(12)
Therefore, the estimate of the i th bit of the interference is given as, /*i = 2iA2,ih. Furthermore, the variance of the random noise, n - f w(t)hiR-1 (t) dt is given as ~r~ - 2 6No where ~T 2 - T2 - T1. Assuming that the T 2 , interference is a BPSK signal the probability of error of the i th transmitted bit is given as follows: - -
P{Ei} = Q where, Q(a) - ~ 1 f~oo exp(-v2/2) dv.
-~~
]
(13)
392
Figure 1" The BEP as a function SIR.
4
T h e algorithmic A s s e s s m e n t
In all the reported cases, The Bit Error Probability (BEP) was calculated via 500 Monte-Carlo runs, processing the data in blocks, BEP =
Number of bit errors Number of bits
(14)
assumed that the mean of the signals are known. Fig. 1 shows the error probability distribution for each symbol of the processed data block. It is clear that the probability of error is higher at the ends of the block, where the signature waveforms of the interfering signal and the SOI are more correlated. The BEP of the algorithm is examined as a function of Signal-to-Noise Ratio (SIR), and a comparison is made to the performance of the matched filter. The algorithm is studied using the W T at scale T1. The symbols durations of the SOI and the interference are equal to T1 - 1 and T2 - 1.004, respectively. As it is shown in Fig. 1, the CIMTS algorithm is near - far resistance and its performance is not a function of the energy ratio of the signals. This could have been intuitively concluded by examining Eq. 5, where the energy of the signal is recovered along with the transmitted data.
References [1] S. Verdu, "Minimum probability of error for asynchronous Gaussian multiple-access channels," IEEE Trans. Inform. Theory, vol. 32, pp. 85-96, 1986. [2] S. Heidari and C. L. Nikias, "Co-channel Interference Mitigation in the Time-Scale domain: The CIMTS algorithm," to appear in IEEE Trans. on Signal Processing.
[3] D. L. Donoho, "Nonlinear wavelet method for recovery of signals, densities, and spectra from indirect and noisy data," Proceeding of symposium in Applied Mathematics, pp. 173-205, 1993. [4] T. Q. Nguyen, "Partial spectrum reconstruction using digital filter banks," IEEE Trans. on Signal Processing, vol. 41, 1993. [5] I. Daubechies, Ten Lectures on Wavelets. Philadelphia, Penn: SIAM, 1992.
ProceedingsIWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
393
Design and Performance of DS/SS Signals Defined by Arbitrary Orthonormal Functions Jeffrey C. Dill, Ph. D. Dept. Electrical and Computer Engr. Ohio University Athens, OH 45701 Phone: 614-593-1585 email" [email protected]
William W. Jones, Ph. D. STM Wireless, Inc. One Mauchly Irvine, CA 92718 Phone: 714-789-2681 [email protected]
Abstract A unified framework is presented for the design and performance of anti-jam DS/SS signals defined by arbitrary orthonormal functions with various wavelet functions being of most interest. In this framework, the receiver is generically viewed as a projection operation followed by a filter weighting strategy. A theoretical analysis is presented comparing the optimum weighting with conventional uniform and excision weighting. Based on these results, we are led to conclude that the DS/SS signal should be designed such that the receiver processing localizes the interference and then excises those dimensions deemed corrupted. Finally, wavelet-based waveforms are examined which localize in time, frequency and both time and frequency. Based on our theoretical conclusions, a waveform characterized by time and frequency dimensionality appears to be an effective solution against disparate interference types. I. Introduction In recent years, a number of new spread spectrum modulation formats have been defined which can be characterized by a set of orthonormal functions upon which the PN sequence modulates. For example, in multicarrier modulation (MCM), the PN sequence modulates a set of complex exponentials with proper frequency separation to insure orthogonality [ 1]. Using M-band wavelets, a waveform similar to MCM is given by
re(t)-
--~ a j CoO - j + M-1 ~ c, lff , (T - j II ~Ej~.ei((@)
(1)
n=l
where {aj} is the data sequence with symbol energy E and duration T, {c.} is the chip sequence, M is the dimensionality and 00and {~n} are the M-band scaling and wavelet functions, respectively. This waveform has been designated spread spectrum M-band wavelet modulation (SS-MWM) [2]. In spread spectrum multiscale modulation (SS-MSM) [2], the orthonormal functions are the scaling function and dyadic scalings of the wavelet function, specifically,
C2n-l~+(k2n_ l (~ _ J)- k ))(2) m(t)_ ~jeIaa /~/2_(jq) rE c o 0 (T_ j) + ~~2_(j_,, ) E ~2n-I n=l
T k=0
where d~and ~ are now the dyadic scaling and wavelet functions, respectively. Indeed, classical DS/SS can be placed into this general framework. Namely, the orthonormal functions are time translations of the ISI-free chip pulse as seen in the following
m(t)-~j~.~aJ~=olck~(M(T-J) - k )
(3)
Traditionally ~ is an NRZ chip pulse, but more generally, it can be any scaling function. Each of these signals can be viewed as having the following generic definition
--
jsl aJ k~__oCkflk
- J
(4)
where 15kidentifies the k th orthonormal function defining the waveform. Although (4) is a convenient analytical representation, the physical manner in which dimensionality, hence processing gain, is achieved is quite different between the waveforms. As a consequence, their performance against different interference models is quite different.
394 The purpose of the present paper then is to make a general examination of DS/SS signals defined on arbitrary orthonormal functions as given in (4). This unified viewpoint will allow us to draw general conclusions concerning the design and performance of DS/SS signals under different interference conditions. Once this is accomplished, a more detailed examination of each of the waveforms defined in (1)-(3) will be provided.
II. Receiver Processing We begin by generically viewing the receiver as a projection operation onto the orthonormal functions defining the particular waveform and then developing an optimal filter which operates on the coefficients of these projections. The received signal can thus be written as
(5)
r(t) = m(t) + j ( t ) + n(t)
where m(t) can be any of the waveforms defined in (1)-(3), j(t) is the additive interference and n(t) is the AWGN. The receiver projects the received signal upon the basis functions defining the signal and then multiplies these coefficients with the appropriate chip. This projection operation is equivalent to sampled match filtering. The projection coefficients associated with the aj symbol after PN removal can be expressed as the M dimensional column vector r__= aj ~]---~U + J + N
(6)
where U is the ones vector. Applying a weighting to this vector, the soft decision output is
hj = w r r
(7)
Note, in the classical DS/SS system employing the waveform in (3), the weighting is uniform, i.e., W = U . Since BER performance is proportional to SNR, we will choose the weights which achieve the maximum SNR. This problem was solved in the field of sensor arrays [3]. In particular the optimum weights are
~-----opt= ~Rs 1U
(8)
where # is an arbitrary constant and R u = Rj + R N is the correlation matrix of the combined disturbance in (6). The optimum SNR is given by
SNRopt = E.if__U r Ru I U
(9)
M--
when the weighting is given by (9).
III. Performance Analysis Assuming the thermal noise is AWGN, its correlation matrix is given by R N = N~ I where I is the identity 2 matrix. To characterize the interference, we assume the interference and the PN sequence are statistically independent. Assuming further the PN sequence decorrelates the components in J , Rj is a diagonal matrix whose k th entry along the diagonal is o~k E j where o~k is the fractional interference energy in the k th dimension and E j = Pj T is the total interference energy in a data symbol period. With these results, the optimum weights become
--.-~opt " - ~
1
1
No
OtoPjT + ~ 2
No
OtM-IPJT+--2
(10)
395 K Under typical interference conditions, the interference is confined to a fraction t9 = ~ of the available waveform M dimensions. Letting NK denote the set of indices corresponding to where interference energy is present, the optimum SNR becomes E Z 1 2E(l_p - - M k~Nx o~ k P j T + ~No +-~-o
SNRopt -
)
(11)
2
1 When the interference power is equal among the dimensions, i.e., ct k - ~ , the optimum weights correspond to a uniform weighting. This is easily verified with an appropriate selection of ~t. In general, the SNR with uniform weighting is given by SNR U =
1
Pj P~M
No 2E
(12)
Another suboptimum strategy is termed excision. Considering (10), if the interference is very large in any particular dimension then the weight associated with that dimension approaches zero. The filter weighting is then effectively excising this dimension. In this case, the SNR becomes 2E
SNRex C : 7 0 ( 1 - / 9 )
(13)
Comparing (11) and (13), we see that these two weighting strategies yield comparable performance when the interference is localized to a minimum number of waveform dimensions independent of the interference level. To understand what level of localization is required, we will compare (11)-(13) as a function of the fraction of corrupted E waveform dimensions for the case where M = 1000 and = 10dB. Figure 1 illustrates the detection SNR under a No low interference condition, that is, an interference-to-signal power ratio (ISR) of 10 dB. From these results, we see that the uniform weighting strategy approaches the optimum when a large number of dimensions are corrupted. But, when the interference is localized to less than 18% of total dimensions, the excision weighting yields superior performance even approaching the optimum as the fraction diminishes. This behavior is even more pronounced at a moderate level of interference (25 dB) as illustrated in Figure 2. Here, as much as 50% of the dimensions can be removed while maintaining excellent performance. Consequently, we are led to conclude that the DS/SS waveform should be designed such that the interference is concentrated to a minimum number of waveform dimensions and then to apply an excision strategy. Further performance results including bit error probability can be found in [2].
IV. Discussion As examples, we consider the projection of two contrasting types of interference on the functions defining classical DS/SS, SS-MSM and SS-MWM. The interference types are the time domain impulse and the single frequency tone. A simple tiling diagram [2] clearly demonstrates that classical DS/SS localizes the impulse but the tone is spread across all coordinates. Conversely, SS-MWM localizes the tone but the impulse is distributed across all of its coordinates. With SS-MSM both the impulse and tone are localized as expected from dyadic wavelet theory. Thus, based on our theoretical conclusions, SS-MSM with time-frequency excision appears to be an effective solution against disparate interference conditions. Finally, recent developments in wavelet packets, which can be placed into the framework presented in this paper, indicate that superior interference localization with super-symbol tuning can be achieved [4]. But, this improved localization comes with certain practical drawbacks. Namely, the channel conditions must be known which is questionable in a dynamic hostile environment. Further, this knowledge must be synchronously made available to both the transmitter and receiver which greatly complicates the potential for adaptive processing. In conclusion, it appears that SS-MSM not only represents a good tradeoff in localizing diverse interference sources but is also a good tradeoff in transceiver complexity since the waveform itself is not required to be adaptive to provide effective interference mitigation.
[1]
V. References S. Kondo, L. Milstein, "On the use of multicarrier direct sequence spread spectrum systems," Proc. 1993
396
[21 [3] [4]
IEEE MILCOM Conference, October 11-14, Boston MA, pp.52-56. W. Jones, "A unified approach to orthogonally multiplexed communication using wavelet bases and digital filter banks," Ph.D. Dissertation, Ohio University, August 1994. R. Compton, Jr., Adaptive Antennas Concepts and Performance, Prentice Hall, 1988. A. Lindsey, "Generalized orthogonally multiplexed communication via wavelet packet bases," Ph. D. Dissertation, Ohio University, June 1995.
Figure 1. Detection SNR Versus Fraction of Dimensions Corrupted, ISR=10 dB,
E
= lOdB
Figure 2. Detection SNR Versus Fraction of Dimensions Corrupted, ISR=25 dB, _EE = 10rib
No
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
397
C O F D M , M C - C D M A , and W a v e l e t - B a s e d M C - C D M A KyungHi Chang
and
X u D u a n Lin
ETRI Taejon 305-600, KOREA [email protected]
Abstract--This t u t o r i a l p a p e r a d d r e s s e s various a s p e c t s of m u l t i c a r r i e r m o d u l a t i o n ( M C M ) t e c h n i q u e s , a n d c o m p a r e s t h e i r p e r f o r m a n c e s . P a r a m e t e r s for t h e c o n c e p t u a l design of ort h o g o n a l f r e q u e n c y division m u l t i p l e x i n g ( O F D M ) a n d m u l t i c a r r i e r C D M A ( M C - C D M A ) s y s t e m s a r e given b a s e d on t h e p e r f o r m a n c e of c o m m u n i c a t i o n link. T h e e v o l u t i o n of m e n t i o n e d c o n v e n t i o n a l M C M s y s t e m s r e s u l t s in w a v e l e t - b a s e d M C - C D M A s y s t e m s , which are p r o p o s e d in this p a p e r . W i t h all t h e a d v a n t a g e s of t h e c o n v e n t i o n a l M C - C D M A , t h e w a v e l e t - b a s e d M C - C D M A s y s t e m s p r o v i d e n o t only h i g h e r b a n d w i d t h efficiency t h a n t h e c o n v e n t i o n a l M C - C D M A s y s t e m s , b u t n e w d i m e n s i o n s for t h e a n t i - f a d i n g a n d interference i m m u n i t y by t h e s u i t a b l e choice of t h e wavelet functions a n d t h e wavelet f r e q u e n c y b a n d s . B y t h e results, w a v e l e t - b a s e d M C - C D M A can be t h e one feasible c a n d i d a t e for t h e a p p l i c a t i o n s in f u t u r e public land mobile t e l e c o m m u n i c a t i o n s y s t e m s ( F P L M T S ) / i n t e r n a t i o n a l m o b i l e t e l e c o m m u n i c a t i o n s ( I M T ) - 2000 a n d mobile m u l t i m e d i a .
1
Introduction
For the third generation mobile communication system, i.e., FPLMTS/IMT-2000, handling high data rates in wireless radio environments is one of the key issues. The near target of data rates for the FPLMTS/IMT-2000 is 2 Mbps, but for the environment of mobile multimedia the rate should be above 2 Mbps, e.g., upto 20 Mbps eventually. Common air interface (CAI) standard based on single carrier transmission may be troublesome due to the complex equalizers for the high-rate data under the environment of multiple propagation. Moreover, the single carrier transmission system with an adaptive time-domain equalizer does not perform well against intersymbol interferences (ISI) on channels with very long impulse response, due to the physical limitation on the number of filter taps. Similar technologies as current IS-95 of D S - C D M A can not be a good candidate either, due to the small spreading gain and ultra high-speed requirement at the demodulation stage for the transmission and reception of high-rate data respectively. The receiver of the DS-CDMA system, which is utilizing time diversity, needs high-speed raking to increase the resolution of the reception. Usual MCM systems enjoy the advantage from the narrowband communication, which enables the transmission of high-rate data [1]. Each of the modulator and the entire bank of correlators in the demodulator of MCM systems is nothing but a single IFFT and F F T block respectively. The I F F T and F F T would be computed once per N samples or once per symbol interval, where N is the number of the carriers. Due to the remarkable achievements in VLSI technology, F F T has been the favorite choice for the implementation of the MCM. Today, even 0.18 um of fabrication technology has been developed. In this paper, starting from the conceptual design of the OFDM and MC-CDMA systems, three novel MC-CDMA systems, based on the wavelet orthogonality and other properties, are proposed. The basic advantages and performances of the three novel wavelet-based MC-CDMA systems; waveletbased MC/BPSK-CDMA, wavelet-based MC/QPSK-CDMA and fractal MC-CDMA, are investigated compared with the conventional MCM systems. Their potential applications in the broadband wireless communications to improve bandwidth efficiency and combat the fading and narrowband interference are also discussed.
398
2
Coded Orthogonal Frequency Division Multiplexing
O F D M , which is a sort of the MCM technique, has been known since [2] and [3], and it was used in military HF communication systems. The OFDM signal at the m-th transmitter is represented as
sin(t) = bin. pTb(t -- kTb) . cos(2rfct + ~),
(1)
where bm is obtained from the input am, PTb designates the transmit filter impulse response in the interval of [0, Tb], f~ is a carrier frequency, and ~ is a carrier phase. Here, k-th output sample of the j-th IDFT block bin[k, j] becomes N-1
bm[k, j] = ~1 E am[i, j] eJ2rki/N '
k = 0 , 1 , . . . , Y - 1,
(2)
i=0
where am[i,j] represents the i-th input symbol of the j-th IDFT block. Then, the resulting real part and imaginary part of (2) are concatenated and transmitted, or only the real part of (2) is transmitted [4]. For the latter case, the receiver needs to sample twice as fast as that of the transmitter, and should perform 2N-point DFT operation. Thus, the tradeoff between the efficient channel usage and the receiver complexity does exist. The basic principles of the OFDM rely on the transforming serial bit stream to parallel bits, and on splitting the information to be transmitted over a moderate number of RF carriers. This serial-toparallel conversion can make the pulse duration larger than the time delay spread Td of the channel, sometimes with the help of guard interval in the time domain, thus decreases ISI. That is, the bandwidth of a transmitted narrowband signal is smaller than the coherence bandwidth BW~ of the channel, which results in a flat fading channel, and so it alleviates the channel equalization problem in the time domain. The conversion also makes high data rate of multimedia transmission and increased robustness against impulse noise in time domain possible. Time-domain impulse interference is averaged out over the entire F F T block. Additional degree of freedom in the OFDM system is the independently selected signal constellations used for the different carriers, in accordance with the channel attenuation, interference and impulse noise at the corresponding frequencies. Due to its nature, however, the OFDM system is vulnerable to frequency-domain impulse interference or tone interference. Moreover, different data symbols are transmitted on different subcarriers in the OFDM. Thus, powerful channel coding is very essential in the OFDM. Channel coding with frequency interleaving allows the link to transmit original data on separated carriers, so that the carrier under fading can be recovered by the help of other unfaded carriers. That is, implicit kind of frequency diversity exists. Time interleaving is also utilized in the OFDM system. Concatenated coding can be adopted to obtain high coding gains with moderate decoding complexity. For the two levels of forward error correction (FEC), the inner code uses convolutional code or trellis coded modulation (TCM), and Reed-Solomon (RS) code is usually employed as a outer code [5]. TCM coding only increases the constellation size and uses this additional redundancy to trellis-code the signal without bandwidth expansion. Here, one role of outer code is to handle the burst errors generated by the inner code. Besides, turbo code, which is all convolutional-concatenated code, offers good performance and reasonable decoding complexity [6]. Adaptive loading of each carrier is desired so that bit error rate (BER) in all the sub-channels are equal to achieve maximum transmission rate. Simplified block diagram of the OFDM transmitter employing concatenated FEC coding is shown in Fig. 1. With the well choice of the channel coding, coded OFDM ( C O F D M ) system can also enjoy the implicit time-domain diversity under the channel of appropriate time dispersion, e.g., 1.5 M H z of BW~ for outdoor propagation in European digital audio broadcasting (DAB) project [7]. For the digital video broadcasting (DVB), realtime implementation of a TV COFDM well supports a bit-rate of 21 Mbps in a frequency selective channel [6], [8]. By the name of discrete multitone (DMT), the OFDM has also become a standard multiple access scheme for asymmetric digital subscriber line (ADSL) services [8]. Contrast to the nice behavior of the COFDM under the time-dispersive channel, the COFDM system under the frequency-dispersive channel caused by rapid time variations of the channel is easy to lose the orthogonality of subcarriers, which results in increased BER. Other frequency nonlinear
399
Channel Coder
i............................................................................................
Modulator
!i ...........................................
Dat
To
i
Rec ....................................................................................................................
Figure 1: Block diagram of the OFDM transmitter employing concatenated FEC coding.
characteristics, including instability of oscillators and nonlinear amplification, also cause interchannel interferences (ICI). The requirement of linearity for the amp is tighter than in other MCM systems. Therefore, for each branch of the COFDM receiver, an equalizer against frequency-dispersive fiat fading channel is necessary [9], which increases the overall complexity of the receiver. The COFDM system with the frequency-domain equalizer can handle channels with larger impulse response than in the case of single carrier transmission system, and the frequency-domain equalizer takes the form of complex multiplier bank at the FFT output in the receiver. To avoid ICI, even though the overlapped carrier spectrum, carrier spacing is selected so that the carriers are located at the zero-crossing points to achieve frequency-domain orthogonalities. This situation is just a frequency-domain correspondence of ISI cancellation in the time-domain. The design of orthogonal waveforms on each subcarrier [6] and the design of interdependent subcarrier waveforms to minimize the peak-to-average power ratio of the total signal [10] may be a conflicting matter. Nonlinear distortion and carrier synchronization issues, inherent in the OFDM system, are the most difficult ones to handle. Especially, the carrier synchronization problem is not solved at all by the frequency-domain equalization. Multiplexing/multiple access of the COFDM system takes the form of OFDM/FDMA. So, the COFDM is more adequate for the broadcasting purpose. For the cellular use, techniques such as dynamic channel allocation is required. Dynamic channel allocation alleviates the need of frequency planning. There is another approach to utilize the OFDM technique in the cellular network, which is M C / D S - C D M A system [11]. It is a mixed version of the OFDM and the DS-CDMA, but it may lose the inherent advantage of the OFDM in the time domain. Moreover, the validity for the application in the mobile multimedia environment, which needs the transmission of high-rate data, becomes weak. Even though the MC/DS-CDMA approach seems to increase the capacity of wireless cellular network by the use of spreading code, overall performance increase of the cellular network, compared to the COFDM, is not so optimistic due to the mentioned drawbacks. Variants of the above concept are also introduced in [12]-[14] with the comparison of the DS-CDMA. At least, above structures outperform over the DS-CDMA specially for the large number of users.
3
Multi-Carrier
CDMA
The M C - C D M A is a digital modulation technique where a single data symbol is transmitted at multiple narrowband subcarriers with each subcarrier encoded with a phase offset of 0 or 7r based on a spreading sequence. The code rate of the spreading sequence in the MC-CDMA system is same to the rate of incoming bit stream, so there is no actual spreading, except by the subcarriers. So, the concept of MC-CDMA is not same to that of MC/DS-CDMA. This modulation scheme is also a multiple access technique in the sense that different users will use the same set of subcarriers but with a different spreading sequence that is orthogonal to the sequences of all other users [15]. That is, there exists double orthogonality by the spreading code and the multicarrier. Therefore, cochannel interference (CCI) is reduced by the usage of the spreading code, and ICI by the multicarrier. Same data on different subcarriers in the MC-CDMA system guarantees the explicit form of frequency diversity, hence frequency-domain rake. Thus, there is no strict requirement of linear amp. Instead, powerefficient broadband amp may be sufficient. Multiplexing/multiple access of the MC-CDMA system
400
takes the form of MC-CDM/CDMA, that is, the MC-CDMA is more suitable for the use in the cellular applications than the COFDM. The MC-CDMA signal at the m-th transmitter can be represented as N-1
/71
sin(t) = E{c~[i]am[k]PTb(t -- kTb) cos(27rfct + 27ri~t)},
(3)
i=0
where c~[i] is a chip from the m-th spreading sequence of length N, am[k] is the k-th input data symbol for the m-th user, PTb is an unit amplitude pulse that is non-zero in the interval of [0, Tb], fc is a carrier frequency, and F describes the spacing between subcarrier frequencies. With F = 1, the structure of the signal is exactly that of the OFDM. The spreading code employed so far is normal PN code or Walsh-Hadamard code. The desirable code in the MC-CDMA system should have the large number of sequences in a class, whose sequence length is same to the number of subcarriers N. Compared to the COFDM, the distinguishable advantage is the explicit form of frequency diversity, but the main drawback is the bandwidth efficiency. The current MC-CDMA is with less hardware burden, due to the unnecessity of frequency-domain equalizer, than the COFDM. Instead, only a very simple form of optimum gain combining is necessary. From the ISI point of view, it is a compromise between channel code rate r in the COFDM and the number of subcarriers N in the MC-CDMA. Hence, except the bandwidth efficiency, overall performance of MC-CDMA is better than that of the COFDM in time and frequency-dispersive channel. To combat the frequency selective fading, the DS-CDMA and the MC-CDMA employ time and frequency diversity respectively. For the moderate bandwidth of flat fading channel, the MC-CDMA may still show good performance. However, the DS-CDMA is unable to combat due to the overly wise choice of the preset threshold values, for power control, in a microcontroller. Compared to the COFDM, the only disadvantage of the MC-CDMA, which is bandwidth efficiency, can be alleviated by the use of the wavelets. Excellent time and frequency locality of wavelets will be devoted to increase the bandwidth efficiency of the MC-CDMA system. Additional degree of freedom to combat the effect of the fading channel will be another advantage. The tradeoff is the increase of the complexity of modulator/demodulator. Therefore, it can be the one feasible approaches for the applications such as FPLMTS/IMT-2000 and the mobile multimedia.
4
Wavelet-Based
MC-CDMA
Wavelet has been a very hot topic in recent years. Its application ranges from the function approximation, signal multiresolution representation, image compression to signal processing and other fields. The popularity of wavelets is primarily due to the interesting structure they provide based on dilation and location. A few investigators have begun to exploit those features of wavelets that suggest their applications in communications [16]-[21]. 4.1
Wavelets
and Related
Properties
Let r be a mother wavelet waveform, then a complete orthogonal set of daughter wavelets Ca,b(t) can be generated from r by dilation by a factor a = 2j and shift by an amount b = 2Jk. Then, we denote Cj,~(t) = 1 t - 2Jk (a) ~-2r 2J )" It can be shown that the dilated and translated wavelets are orthogonal to each other: < Cj,k(t), Cm,n(t) >
=
. f ~ Cj,k(t)r
= ~j_~_~
(5)
For any given wavelet Cj,k(t), there exists corresponding scaling function Cj,k(t), which is also generated from a mother scaling function r The scaling functions satisfy the following relations [18]"
< Cj,k(t), Cj,~(t) > = 5k-n
(6)
401
a m [ k ] ~
~ ]
C~i]
Cm[N- 11
~
cos
(2xfct + 2~F__i t)
~(t/Tb)
cos(2afct+21t~) t)
,, Tb
Figure 2: The m-th transmitter model for the wavelet-based M C / B P S K CDMA system. For any j < m ,
< Cj,k(t), Cm,n(t) > -- ~j-m(~k-n 9
(7)
These relations are the basis of the wavelet transform applications in communications. There exist many families of wavelets and scaling functions. In communications applications, usually it is required that the wavelet should be smoother than the simplest Haar wavelet and provide better temporal as well as spectral localization. 4.2
Wavelet-Based
MC-CDMA
By using the self and cross-orthogonality of the scaling functions r and the wavelet functions now we propose novel wavelet-based MC-CDMA systems. In our wavelet-based MC-CDMA systems, there exist three levels of orthogonalities: the subcarrier frequencies are orthogonal to each other, the wavelets and scaling functions are orthogonal to each other, and the spreading sequences are also orthogonal to each other. The w a v e l e t - b a s e d M C / B P S K C D M A signal for the m-th transmitter can be described as follows:
r
N-1
s,~(t) - E ( c~[i]am[k] r ~:o v~
-
kTb) + c~[i]bm[k] r Tb x/~
- kTb)} cos(27rf~t + ~--)F Tb 2ui t ,
(S)
where Tb is a power of 2, and am[k] and bin[k] are two independent data symbols at the k-th bit interval. Shown in Fig. 2 is a model of the wavelet-based MC/BPSK-CDMA transmitter for the m-th user. At the receiver, assuming there are M active users and the channel is noiseless, the received signal is M-1N-1 r(t) = ~ ~{c~[i]am[k] r - kTb c~[i]bm[k]r kTb F m=0 i:0 ~ Tb ) + V~ Tb )} cos(27rfct + 27ri~t). (9) Assume that m = 0 corresponds to the desired signal. In the 0-th receiver, there are N passband filters with the i-th one corresponding to the frequency fc + iF/Tb, so the received signal r(t) is first converted back to the baseband signal in each i-th branch of the receiver: M-1
?~i(t) Now the signal and r
rc~[i]am[k] : t - kTb c~[i]bm[k] / t - kTb )} m:oE~ v/~ r % )+ x/~ ~L Tb
(10)
ri(t) is filtered separately by two matched filters with the impulse responses r Tb : respectively, where T = JTb is the duration of r and r and the filter
402
outputs are sampled at t =
nTb, which
result in the following variables
y~(nTb) = ~(t) 9T~1/2r JTb - t
Tb )]t=~T~,
M-1
= E c~[ilam[n- J],
(11)
m--O
and zi(nTb)
--
r i ( t ) 9 T b l / 2 ~ ) ( J T b -- t M-1
= ~ c~[i]bm[~- J].
(12)
m=0
Then,
yi(nTb)
is multiplied by c0[i], and taking summation over i gives N-1
u(n) = E co[i]yi(nTb) = coin- J].
(13)
i=O
Similarly, we have N-1
v(n) = E co[i]zi(nTb) = bo[n- J].
(14)
i=0
Therefore, we recover the data symbols coin- J] and bo[n- J] for n = 0, • • -... Now we generalize the wavelet-based MC/BPSK-CDMA system to the following w a v e l e t - b a s e d MC/QPSK-CDMA system. The transmitted signal at the m-th user in the wavelet-based MC/QPSK-CDMA system is:
~(t)
=
N-1
~([~[i]a~[k]
~--0
~
r
- kT~
+ [c~[i]a~[k] t - kTb
v~
~[i]b~[k]
% )+~r
r
t-TbkTb)] cos(2rf~t
c~[i]b~[k]r
) + ~
+
2~i~t)
- TbkTb )] sin(2~'fct + 2~i ~--~t)},
(15)
where the sequences {am[k]}, {bm[k]}, {a~[k]} and {Um[k]} are four independent data symbols, usually taking the values of =t=v/~. At the receiver, first, the in-phase and quadrature signals are separated by the orthogonality of cos(2~fct + 27ri~t) and sin(2nfct + 2ni~t) for i = 0, 1, ..., N - 1, then the separated in-phase and quadrature signals Sm,i(t I ) and sQ,i(t) can be demodulated by the separate matched filters with as the impulse responses respectively. Those matched filters are r and r followed by sampling and hard decision devices. Assuming Tb -- 2j, it can easily be seen that the above wavelet-based MC-CDMA systems use only • .... a single wavelet frequency band corresponding to the j in Cj,k(t) and Cj,k(t) for k = 0, • So in each branch i, we can form the 'near-baseband' signals by summing several single frequency band wavelet-modulated signals, and we use the resulting 'near-baseband' signals to replace the corresponding baseband signals in wavelet-based MC/QPSK-CDMA systems. So obtain the following fractal M C C D M A system: N-1
~m(t) = ~ ~{[~[i]o~,j[k]r i=0
+ c~[i]b~,j[k]r
co~(2~f~t+ 2 ~ i ~ t )
+ c~[i]b~,j[k]r
.F sin(2~'f~t + 2m~BBt ) },
j~U
+ [c~[i]a~,j[k]r
(~6)
where {am,j[k]}, {bm,j[k]}, {a ~ md[k]} and {b~ m,j[k]} are four independent data symbol sequences for the j-th band. U is a subset of integers, such as U = {1 - M, 2 - M, . . . , 0}, and it can be chosen according to the channel characteristics.
403
Table 1: Variation of bandwidth efficiency with different wavelet waveforms. BE ]l Waveform n 90.57 Full-Width Rectangular Pulse n 90.65 Daubechies Wavelet (order 4) n 90.71 Daubechies Wavelet (order 6) n . 1.43 Daubechies Wavelet (order 8) n , 1.48 Daubechies Wavelet (order 10) n . 1.74 Battle-Lemarie Wavelet .
4.3
.
.
.
.
.
Performance Analysis
In this section, we discuss the advantages and performance of our wavelet-based MC-CDMA systems compared with the conventional MC-CDMA used in wireless communication systems. As the case of conventional MC-CDMA system [15], wavelet-based MC-CDMA systems address the issue of how to spread the signal bandwidth without increasing the adverse effect of the delay spread. A wavelet-based MC-CDMA or a fractal MC-CDMA signal is composed of N narrowband subcarrier signals, each of which has a symbol duration much larger than the delay spread Td, so it will not experience an increase in susceptibility to delay spreads and ISI as does the DS-CDMA system. Since the parameter F can be chosen to determine the spacing between subcarrier frequencies, a smaller spreading factor N than the factor required by the DS-CDMA can be used not to make that all of the subcarriers are located in a deep fade in frequency. Then, frequency diversity is achieved. In addition, the mother wavelet function and the set of wavelet frequency bands U can be chosen according to the characteristics of the channel. Thus, two new dimensions to improve the system performance are obtained. If the effects of the channel have been included in pm,i and Ore,i, and n(t) is AWGN, the received signal for the wavelet-based MC/QPSK-CDMA can be represented as follows: M-1N-1 {[am[i]am[k]r V/~ r(t) = ~ ~ Pm,i m=O i=O
+ [
c~[i]a~[k] r - kTb v/T~
Tb
)+
Tb
Cm[i]bm[k]r
)+ ~
c~[i]b~[k]r V~
...... Tb
)] cos(27rfct +
kTb Tb )]sin(2rf~t +
F
27ri~bt Ore,i) +
2ui~t + 0m,~)} + n(t)
(17)
Then, by comparing the wavelet-based MC-CDMA demodulation processes with the conventional MCCDMA demodulation processes [15], it can be shown that both the wavelet-based MC-CDMA system and the conventional MC-CDMA system possess the same BER under above channel condition. For other fading channels, however, the suitable choice of the wavelets provides another way to combat the distortion of the transmitted signals and improve the system performance. Under the assumption of AWGN channel, the BERs of the wavelet-based MC/BPSK-CDMA and the fractal MC-CDMA systems can also be shown to be equal to the BERs of the corresponding conventional BPSK and QPSK systems respectively. The bandwidth efficiency (BE) of a modulation system is defined as
BE=
Total bit rate Bandwidth'
(bits/sec/gz).
(18)
Assuming 99% power bandwidth, based on the results in [18], the BE variation with some different wavelet waveforms is shown in Table 1. Here, n - 1 and 2 correspond to the BPSK and QPSK respectively. Consequently, for the wavelet-based MC-CDMA systems, significantly higher bandwidth efficiencies can be obtained, compared with the conventional MC-CDMA system, by the introduction of the compactly supported orthogonal wavelets.
404
5
Conclusions
In this tutorial paper, we compare the performance of various MCM techniques, such as O F D M and MC-CDMA, with an emphasis on the proposed wavelet-based MC-CDMA systems. The proposed wavelet-based MC-CDMA systems possess all the desirable characteristics, e.g., frequency diversity and small ISI, which the conventional MC-CDMA system has. In addition to those advantages, the wavelet-based M C - C D M A systems provide not only higher bandwidth efficiency than the MC-CDMA systems, but new dimensions for the anti-fading and interference immunity by the suitable choice of the wavelet functions and the wavelet frequency bands. By the results, the wavelet-based MC-CDMA systems can be the one feasible candidate of the multiplexing/multiple access technique for the use in the F P L M T S / I M T - 2 0 0 0 and the mobile multimedia applications.
References [1] J.A.C. Bingham, "Multicarrier modulation for data transmission: An idea whose time has come," IEEE Commun. Magazine, pp. 5-14, May 1990. [2] M.L. Doelz, E.T. Helad, and D.L. Martin, "Binary data transmission techniques for linear systems," Proc. IRE, vol. 45, pp. 656-661, May 1957. [3] H.F. Harmuth, "On the transmission of information by orthogonal time functions," AIEE Trans. Commun. Electron., vol. 79, pp. 248-255, July 1960. [4] S.B. Weinstein and P.M. Ebert, "Data transmission by frequency-division multiplexing using the discrete Fourier transform," IEEE Trans. Commun. Tech., vol. 19, pp. 628-634, Oct. 1971. [5] Y. Wu and B. Caron, "Digital television terrestrial broadcasting," IEEE Commun. Magazine, pp. 46-52, May 1994. [6] B. Le Floch, M. Alard, and C. Berrou, "Coded orthogonal frequency division multiplex," Proc. IEEE, vol. 83, pp. 982-996, June 1995. [7] M. Alard and R. Lassalle, "Principles of modulation and channel coding for digital broadcasting for mobile receivers," EBU Technical Review, no. 224, pp. 168-190, Aug. 1987. [8] H. Sari, G. Karam, and I. Jeanclaude, "Transmission techniques for digital terrestrial TV broadcasting," IEEE Commun. Magazine, pp. 100-109, Feb. 1995. [9] L.J. Cimini, Jr., "Analysis and simulation of a digital mobile channel using orthogonal frequency division multiplexing," IEEE Trans. Commun., vol. 33, pp. 665-675, July 1985. [10] A.E. Jones, T.A. Wilkinson, and S.K. Barton, "Block coding scheme for reduction of peak to mean envelop power ratio of multicarrier transmission schemes," Electronics Letters, vol. 30, pp. 2098-2099, Dec. 1994. [11] L. Vandendorpe, "Multitone spread spectrum communication systems in a multipath Rician fading channel," in Proc. IZSDC, Mar. 1994, pp. 440-451. [12] S. Kaiser, "OFDM-CDMA versus DS-CDMA: Performance evaluation for fading channels," in Proc. IEEE ICC, June 1995, pp. 1722-1726. [13] S. Kondo and L.B. Milstein, "Performance of multicarrier DS CDMA Systems," IEEE Trans. Commun., vol. 44, pp. 238-246, Feb. 1996. [14] E.A. Sourour and M. Nakagawa, "Performance of orthogonal multicarrier CDMA in a multipath fading channel," IEEE Trans. Commun., vol. 44, pp. 356-367, Mar. 1996. [15] N. Yee and J.P. Linnartz, "Multi-carrier CDMA in an indoor wireless radio channel," Memo. No. UCB/ERL M9~//6, Electronics Research Lab., UC-Berkeley, Feb. 1994. [16] M.A. Tzannes and M.C. Tzannes, "Bit-by-bit channel coding using wavelets," in Proc. IEEE GLOBECOM, Dec. 1992, pp. 684-688. [17] R. Orr, C. Pike, and M. Bates, "Covert communications employing wavelet technology," in Proc. IEEE Asilomar Conf. on Signals, Systems and Computers, Nov. 1993, pp. 523-527. [18] P.P. Gandhi, S.S. Rao, and R.S. Pappu, "On waveform coding using wavelets," in Proc. IEEE Asilomar Conf. on Signals, Systems and Computers, Nov. 1993, pp. 901-905. [19] M. Medley, G. Saulnier, and P.K. Das, "Applications of wavelet transform in spread spectrum communications systems," in SPIE Proc. Wavelet Applications, vol. 2242, pp. 54-68, Apr. 1994. [20] K.H. Chang, X.D. Lin, and H.J. Li, "Wavelet-based multi-carrier CDMA for PCS," in Proc. IEEE ICASSP, May. 1996, pp. 1443-1446. [21] K.H. Chang, X.D. Lin, and M.G. Kyeong, "Performance analysis of wavelet-based MC-CDMA for FPLMTS/IMT-2000," in Proc. IEEE ISSSTA, Sep. 1996, to be published.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
405
SIGNAL DENOISING THROUGH MULTIFRACTALITY W. Kinsner and A. Langi Department of Electrical and Computer Engineering, Signal and Data Compression Laboratory, University of Manitoba, Winnipeg, Manitoba, Canada R3T-5V6, email: {kinsnerllangi} @ee.umanitoba.ca and TRLabs (Telecommunications Research Laboratories) 10-75 Scurfield Boulevard, Winnipeg, Manitoba, Canada R3Y-1P6 ABSTRACT This paper presents a new framework for signal denoising based on multifractality, and demonstrates its practicality with several examples. Signal denoising is concemed with the separation of noise from a signal, and then with reducing the noise without altering the signal significantly. This paper demonstrates that a multifractal measure can be used to guide the process of noise reduction so that the fractal spectrum is preserved in the signal. INTRODUCTION
Denoising is critical in many signal applications in which noise contamination reduces the performance of signal processing. For example, signal analysis often results in incorrect characterization due to noise [ScGr91]. In signal compression, contaminated signals are often difficult to compress because their entropy values are very high [LaKi96a]. Unfortunately, proper denoising is difficult because neither the signal nor the noise is known. Although the concept of denoising is not new theoretically, it is now entering a practical phase due to several recent developments in the area of wavelets, contextual prediction, and multifractality. The current denoising algorithms are based on preserving selected characteristics of signals that do not occur in noise, such as regularity, smoothness, predictability, power spectrum density, and linearity [Dono92], [KoSc93], [CoMW92]. Although such algorithms perform well for classes of relatively smooth signals, they fail to apply well to noise-like signals (i.e., signals having noise appearances) such as image textures or speech consonants. We have developed an approach based on singularity preservation as the denoising criterion for regular as well as noise-like signals. This approach was prompted by previous work on singularity characterization using wavelets, indicating that singularity could represent regular and noise-like signals faithfully, i.e., signals reconstructed from wavelet-detected singularities are perceptually indistinguishable from the original ones [MaHw92]. In particular, multifractal measures of signals (e.g., a spectrum of singularities or the R6nyi generalized dimensions [Kins94]) can be used to characterize singularities [LaKi96b], [FKPG96], [Lang96]. Hence, denoising schemes should preserve signal multifractality. Furthermore, the removed parts must have multifractal characteristics of noise. EXAMPLES OF DENOISED IMAGES
This papers shows examples of applying the measures in various image denoising schemes (i.e., wavelet shrinkage [Dono92] and prediction [KoSc93]) as well as some high-quality highbit-rate image compression schemes (e.g., joint photographic expert group, JPEG) to
406
Fig. 1. Comparisonof (a) a 512x512 aerial ortho image and (b) the denoised image, using wavelet shrinkage at a level suitable for 2.03:1 lossless compression ratio. demonstrate the relation between the measure and the perceptual reconstruction quality, as well as the practicality of the framework. In one example, we have denoised an aerial ortho image to enable a compression ratio (CR) of at least 2:1. The importance of this example is that the image was almost incompressible (1.06:1) from Shannon's entropy point of view. This was achieved by wavelet shrinkage in which an image is first transformed into a wavelet domain, then the wavelet coefficient values are shrunk according to a soft thresholding, and the image is reconstructed from the shrunk coefficients. Increasing the thresholding level results in an increase in a lossless compression ratio of the denoised image. Figure 1 compares the original and the denoised images at a thresholding level of 0.011 for a 2.03:1 CR. Although the reconstructed image is smoother (with a 35.5 peak signal-to-noise ratio, PSNR), all sharp edges are still preserved, which makes denoising superior to classical filtering techniques that tend to blur edges (i.e., the high-frequencyparts of the image are altered). In another example, we have used prediction for denoising [LaKi96a], as shown in Fig. 2.
Fig. 2. Comparisonof (a) a 256x256 aerial ortho image and (b) a denoised image using prediction suitable for 2.22:1 lossless compression ratio, and (c) the residual image (enhanced for visual presentation).
407 This contextual predictive scheme removes noise while preserving image predictability. The approach results in a PSNR of more than 49.9 dB at a 2.22:1 CR and preserves image perceptual quality (i.e., the original and denoised images are perceptually indistinguishable). The removed part of the original image (called the residual image) has noise characteristics, as demonstrated in Fig. 2c which is amplified to the maximum range from 0 to 255. It is seen that the enhanced image contains no trace of the original image. We have verified experimentally that the prediction-based denoising preserves image multifractality (as measured by the R6nyi generalized dimension), while high-quality lossy compression schemes such as ]PEG do not. This constitutes the novelty of this paper. Figure 3a compares the R6nyi generalized dimensions D q of the original, denoised, and residual images, as well as JPEG 1 (CR of 2.08:1) and IPEG 2 (1.87:1) images [Brad94]. The Dq plots of the original and denoised images coincide, while those of the JPEG schemes deviate at low q. Using a Legendre transform, we can also calculate the singularity spectraf(e0, with similar results (see Fig. 3b). The f(c~) curves of the original and denoised images also coincide, while those of the JPEG images deviate at high singularities. This indicates that the JPEG schemes
Fig. 3. Multifractalmeasures of the original and various denoised images (2.22:1 CR prediction, 2.08:1 CR JPEG 1, and 1.87:1 CR JPEG 2 schemes): (a) the R6nyi generalized dimensions Dq, (b) spectra of singularitiesf((~), and (c) a zoomed-inregion of thef(~), showingthat while singularity spectra of the original and denoised images coincide, those of the JPEG images deviate at high singularities (~.
408 alter high singularity components of the original image. Figure 3c shows the discrepancy clearly in a zoomed-in plot at a high singularity region. It is important to notice that the multifractal measure is a clear indicator of the noise-like nature of the residual image which has a single fractal dimension as demonstrated by either the Dq flat dashed line in Fig. 3a, or, alternatively, a single point on thef(o0 curve in Fig. 3b. The high performance of the predictionbased denoising has prompted us to implement it in a commercial application (compressing otherwise incompressible aerial ortho images each 25 Mbytes in size) [LaKi96a]. CONCLUSIONS Denoising of signals appears to be a very important development in signal preprocessing for compression and other feature extraction procedures. Multifractality provides a framework for denoising through a multifractal measure for denoising quality. Such a framework can cover both regular and noise-like signals. The approach has become practical through our accurate schemes to compute the R6nyi generalized dimension and the spectra of singularities. This framework can be extended to other signal processing applications. REFERENCES
[Brad94] J. Bradley, XV v.3.10a (a Unix program). Available at [email protected], 1994. [CoMW92] R. R. Coifman, Y. Meyer and M. V. Wickerhauser, "Wavelet analysis and signal processing" in Wavelet and Their Applications, M. Ruskai (ed.), Boston: Jones and Bartlett, pp. 153-178, 1992. [Dono92] D. L. Donoho, "De-noising via soft-thresholding", Technical Report, Department of Statistics, Stanford University, 1992, 37p. (Available through ftp from: ftp://playfair.stanford.edu/pub/donoho) [FKPG96] M. Farge, N. Kevlahan, V. Perrier, and E. Goirand, "Wavelets and turbulence," Proceedings of the IEEE, vol. 84, no. 4, pp. 639-669, April 1996. [Kins94] W. Kinsner, "Fractal dimensions: Morphological, entropy, spectrum, and variance classes" Technical Report, DEL94-4, Department of Electrical and Computer Engineering, University of Manitoba, 146 pp, April 1994. [KoSc93] E. J. Kostelich and T. Schreiber, "Noise reduction in chaotic time-series data: A survey of common methods" Physical Review E, vol. 48, no. 3, pp. 1752-1763, September 1993. [LaKi96a] A. Langi and W. Kinsner, "Compression of aerial ortho images based on image denoising" in Proc. NASA/Industry Data Compression Workshop 1996, (Snowbird, Utah; 4 April 1996), A.B. Kiely and R.L. Renner (eds), pp. 81-90. (Available from the Jet Propulsion Laboratory, California Institute of Technology, as JPL Publication 96-11. Contact: Dr. Aaron B. Kiely, [email protected]) [LaKi96b] A. Langi and W. Kinsner, "Singularity processing of nonstationary signals" in Proc. IEEE Canadian Conf. Elect. and Comp. Eng., ISBN 0-7803-3143-5 (Calgary, Alberta; 26-29 May, 1996) pp. 687-691. [Lang96] A. Langi, "Wavelet and fractal processing of nonstationary signals" Ph.D. Thesis, Department of Electrical and Computer Engineering, University of Manitoba, 1996, 456 pp. [MaHw92] S. Mallat and W. L. Hwang, "Singularity detection and processing with wavelets" IEEE Trans. Inform. Theory, vol. 38, no. 2, pp. 617-643, 1992. [ScGr91] T. Schreiber and P. Grassberger, "A simple noise-reduction method for real data" Physics Letter A, vol. 160, pp. 411-418, 1991.
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
409
Application of Multirate Filter Bank to the Co-Existence Problem of DS-CDMA and TDMA Systems Shinsuke Hara, Takahiro Matsuda and Norihiko Morinaga Graduate School of Engineering, Osaka University, Osaka, Japan E-Mail : hara@ comm.eng.osaka-u.ac.jp A b s t r a c t - I n this paper, we discuss the co-existence problem of DS-CDMA and TDMA systems where both systems share the same frequency band to improve the spectral efficiency. We propose a complex multirate filter bank (CMRFB) based adaptive notch filtering technique for the DS-CDMA systems, which can observe the received signal with different frequency resolutions at the same time, and easily form the most suitable notch filter for rejecting the TDMA signal. I. I n t r o d u c t i o n DS-CDMA (Direct Sequence-Code Division Multiple Access) system has the attractive feature of capability to share frequency band with narrowband communication systems without intolerable degradation of either system's performance. A DS-CDMA overlay has been suggested to improve the
spectral efficiency as well as to share the frequency band with existing narrowband systems[I]. The spread spectrum signal causes little damage to the narrow band signal due to the low spectral profile. On the other hand, it is inherently resistant to the narrowband interference, because the despreading operation has the effect of spreading the narrowband energy over a wide bandwidth. However, it has been demonstrated that the performance of spread spectrum system in the presence of narrowband signal can be enhanced significantly through the use of active narrowband interference suppression prior to despreading[2]. The Fast Fourier Transform (FFT) based adaptive notch filtering technique[3] first observes the received signal composed of a desired spread spectrum signal and some undesired narrowband interference in the frequency domain through the FFT, and then rejects the frequency band containing the interference component by forming a notch filter. Among the narrowband interference rejection techniques, this technique is attractive in terms of hardware complexity, however, it has to divide the whole received frequency band into a lot of narrow bands with the same bandwidth. It could result in the increase of computational cost and the distortion in the spread spectrum signal. We do not have to observe and divide the frequency band where there is no narrowband interference. In this paper, we propose a complex multirate filter bank (CMRFB) based adaptive notch filtering technique to solve the co-existence problem of CDMA and TDMA systems. We show the principle of the CMRFB based adaptive notch filtering technique, and discuss the bit error rate (BER) performance for both CDMA and TDMA systems. II. C o m p l e x M u l t i r a t e Filter B a n k Fig.1 (a) shows a complex multirate filter bank (CMRFB) in a DS-CDMA receiver, which is composed of an analysis filter bank and a synthesis filter bank. At the first stage, a down-converted discrete-time received.~','nal rrn) is passed through a pair of FIR digital filters (analysis filters:Ao(z) and At(z)) with'frequency responses as demonstrated in Fig.l(e). The filtered signals can be decimated by two, because they are approximately band-limited (lowpass and highpass, respectively). The analysis filters can be recursively used at any filter output. Fig.1 (f) shows the frequency response after the fourth stage in Fig.1 (a), where we can see four types of bandpass filters with different pass bandwidths. The decimated subband signals are recombined in the corresponding synthesis filter bank composed of expanders and synthesis filters:So(z) and St(z). Since multirate system has been mainly discussed with real filters[4], it can deal with only the positive frequency component of the input signal. In quasi-coherent detection systems, however, the down-converted signal processed in the baseband has positive and negative frequency components. 9
.
..,
:
~.i ~ , ~ .
.
c
~
-
410
Fig. 1 Complex multirate filter bank and adaptive notch filtering Therefore, we need to design the multirate filter bank with complex filters. In this case, the perfect reconstruction condition is written as So(z) = -jAo(z), (1) S~r(z) = jA ~r(z)=jAo(-Z), (2) where Ao(z), A~r(z), So(z) and Sl(z) are the frequency responses in terms of the z-transform. Ill. Adaptive Notch Filtering Technique When the received signal is composed of (wideband) DS-CDMA and (narrowband) TDMA signals as shown in Fig.1 (b), the hatched filter output in Fig.1 (a) contains mainly the (undesired) TDMA signal component. Therefore, setting the corresponding synthesis filter input to be zero in the synthesis filter bank, we can easily reject the narrowband interference (TDMA signal) (see Fig.1 (c)). The CMRFB based notch filtering technique does not try to divide the frequency band where there is no narrowband interference, and it can easily form the most suitable notch filter for rejecting the narrowband interference. It results in less distortion in the wideband DS-CDMA signal and less calculation cost in the adaptive notch filter forming (it also introduces the energy saving of mobile battery). The CMRFB technique is applicable to the DS-CDMA receiver in both the base station and mobile terminal, and furthermore, it is also effective for the TDMA group demodulator in the base station. Because of the phase linearity of complex multirate filter bank, we can directly use the rejected analysis filter output to demodulate the (phase-modulated) TDMA signal (see Fig.1 (d)). When 1 DS-CDMA system and N frequency-multiplexed TDMA systems are sharing the same frequency band, in order to support all the systems, we usually require 1 base station for the DS'CDMA system and N base stations for the TDMA systems, each of which can handle a single multiplexed signal. However, employing the complex multirate filter bank based technique, we can integrate the N+I base stations into one intelligent base station, which can handle both of the multiplexed signals simultaneously. This system must be a solution for the co-existence problem of DS-CDMA and TDMA systems.
411
IV. Numerical Results and Discussions A. System Model Fig.2 shows the co-existence problems of CDMA and TDMA systems discussed in this paper. The TDMA system is based on a QPSK/coherent demodulation scheme, and the root Nyquist filter with roll-off factor of 0.5 is used for baseband pulse shaping in the transmitter and receiver. The same modulation/demodulation scheme and
Fig. 2 C o - e x i s t e n c e of C D M A a n d T D M A systems
Nyquist filter are used in the CDMA system, and Gold codes with processing gain of 31 are used for spectrum spreading. The complex multirate filter bank in the CDMA receiver is constructed with a polyphase implemented[4] 32-tap or 12-tap complex filter obtained by the modification of real filters in [5]. Fig.1 (e) shows the frequency response of the 32-tap complex filter. We assume an additive white Gaussian noise (AWGN) channel, and define E(C/T) and E(T/C) as the C D M A - t o - T D M A and TDMA-toCDMA
signal
energy
ratios,
respectively,
and
B(C/T) and B(T/C) as the CDMA-to-TDMA and T D M A - t o - C D M A bandwidth ratios, respectively.
B. Bit Error Rate of CDMA System with TDMA Signal Fig.2 shows the power spectrum of the received signal composed of some CDMA components and 1 TDMA component. The center frequency of the TDMA signal is located at 27/128Hz, which corresponds to that of a notch filter formed by the 6-stage CMRFB. Therefore, the 6stage CMRFB can perfectly reject the TDMA signal with B(C/T)=64. Fig.3 shows the BER of the CDMA system for E(C/T)=-5dB. Without the notch filtering, the BERs are almost the same for different values of B(C/T). It means that the BER depends not on B(C/T) but on E(C/T). The notch filtering drastically improves the BER. The 4-stage CMRFB can perfectly reject the TDMA signal with B(C/T)=16, the 5-stage CMRFB the TDMA signal with B(C/T)=32, and the 6-stage CMRFB the TDMA signal with B(C/T)=64.
F i g . 3 B i t e r r o r r a t e of C D M A s y s t e m with I TDMA signal
Note that it is desirable to form a notch filter as narrow as possible, of course, wide enough to reject the narrowband
interference,
because
the
notch filter rejects a part of the energy of the CDMA signal as well as the narrowband interference (the loss of energy is proportional to the bandwidth of the notch filter). When we use K-stage CMRFB, we can form notch filters with up to 1/2 K bandwidth of the received frequency bandwidth. Therefore, the BER
Fig. 4 B i t e r r o r r a t e of C D M A s y s t e m without notch filtering
412
improves as the number of stages increases. Figs.4 and 5 show the BERs of the CDMA lAWGN Channel 12-Tap CMRFB system for 1, 4, 8 and 16 users without and with the notch filtering for 1 TDMA signal, respectively, -ll where we assume B(C/T)-64 and E(C/T)=-5dB. 10 " When there is 1 TDMA signal in the received ~, frequency band, without the notch filtering, the BER ,..,~ld2,, severely degrades as the number of CDMA users 0 1 User increases. On the other hand, with the notch A 4 Users ', filtering, the BER performance can be improved. ~ 1(~3 . [] 8 Users - - No Interference ' ~ N X ~ C. Bit Error Rate of TDMA System with CDMA 9 16 Users 1 TDMA Signal ' Signal .~ with Notch Filter m -4 Fig.6 shows the BER of the TDMA system 10 " CDMA to TDMA CDMA to TDMA Bandwidth Ratio Signal Energy Ratio when the received signal is composed of 1 TDMA B(C/'I3=64 E(C/T)=-5 dB component and 1 CDMA component (see Fig.3 (b)). -5 As the E(T/C) decreases, the BER degrades. The 10 2 4 6 8 10 12 Es/N0 [dB] simulation result for the energy penalty agrees well with the calculated result with a Gaussian Fig. 5 Bit error rate of CDMA system approximation for the CDMA signal. It means that with notch filtering we can really deal with the CDMA signal as a Gaussian noise in terms of the TDMA system. 1 AWGN Channel with 1 CDMA Signal V. Conclusions TDMA to CDMA In this paper, we have discussed the co-r !..-....._~ x J ~ Bandwidth Ratio existence problem of CDMA and TDMA systems, 10 ~"~ B(TIC)=l164 and proposed a complex multirate filter bank (CMRFB) based adaptive notch filtering technique ~16 2 for the CDMA system. We have shown the principle E(T/C)-TDMA to C of the CMRFB based adaptive notch filtering Sign~ Energy~Uo ~tX,,~ technique, and discussed the bit error rate ~16 3 -----O--- E(T/C)=-3dB ~'~X~ performance for both CDMA and TDMA systems ~. E(T/C)=OdB '~N~N~ with and without the proposed technique. rl--- E(T/C)=5dB ~ ~ The CMRFB based technique can observe the received signal composed of a desired wideband ..... Lower Bound signal and an undesired narrowband interference -5 I I I I I with different frequency resolutions at the same time, 10 0 2 4 6 8 10 12 Es/N0 [dB] and easily form the most suitable notch filter for rejecting the interference. Fig. 6 Bit error rate of TDMA system References with 1 CDMA signal [1] L. B. Milstein et al., "'On the Feasibility of a CDMA Overlay for Personal Communications Networks," IEEE Jour. on Sel. Areas in Commun., vol. 10, pp.655-668, May 1992. [2] H. V. Poor and L. A. Rusch, "'Narrowband Interference Suppression in Spread Spectrum CDMA," IEEE Personal Communications, vol. 1, No.3, pp.14-27, Third Quarter 1994. [3] L. B. Milstein, "'Interference Rejection Techniques in Spread Spectrum Communications," Proc. of the IEEE, vol. 76, pp.657-671, June 1988. [4] P. P. Vaidyanathan, "'Multirate Systems and Filter Banks," Prentice-Hall, 1993. [5] V. K. Jain and R. E. Crochiere, "'Quadrature Mirror Filter Design in the Time Domain," IEEE Trans. on Acoust. Speech Signal Proc., vol. 32, pp.353-361, Apr. 1984.
g
"164
9 !,,,~
Session N: EDGE DETECTION
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
415
Multiscale Edges Detection by Wavelet Transform for Model of Face Recognition *Fan YANG, *Michel PAINDAVOINE, **Herv@ ABDI *University of Burgundy, LIESIB 6 B o u l e v a r d G A B R I E L 21000 D I J O N , F R A N C E email : fanyang~u-bourgogne.fr * * U n i v e r s i t y of T e x a s U . S . A
Abstract One of way to store and recall face images uses the linear auto-associative memory. This connectionist model is in conjunction with a pixel-based coding of the faces. The image processing using the Wavelet transform can be applied to the multiscale edges detection. In this paper, we describe a technique of learning for the auto-associator based on Wavelet transform and a 17% improvement of the performances for face recognition has been obtained in comparison with the standard learning.
1
Introduction
As noted, the linear auto-associator is a particular case of the linear-associator. The goal of this network is to associate a set of stimuli to itself which could be used to store and retrieve face images and it also could be applied as a pre-processing device to simulate some psychological tasks, such as categorizing face according to their gender[l]. The auto-associator function as a pattern recognition and pattern completion device in that it is able to reconstruct learned pattern when noisy of incomplete versions of the learned input pattern are used as "stimuli". A technique of learning based on the Wavelet transform can improve recognition capability when the pattern images are with a great noise. In the second part, the basic features of the classical auto-associative memory are briefly described. In the third part, we propose a technique of learning for auto-associator using the multiscale edges of face images and a comparison has been made between the results of different operators of edges detection. The experimental results concerning the face recognition of different types are presented in the fourth part.
2
M o d e l description
First, the faces to store are coded as a vector of pixel intensities digitizing each face to form a pixel image and concatenating the rows of the image to form I*l vector Xk. Each element in Xk represents the gray level of the corresponding pixel. Then, each element of the face vector Xk is used as input to a cell of the auto-associative memory. The number of cells of the memory is equal to the dimension of the vector Xk. Each cell in the memory is connected to each other cell. the output of a given cell for a given face is simply the sum of its inputs weighted by the connection strengths between itself and all of the other cells. The intensity of the connection is represented by an I*I matrix W . In order to improve the performances of the auto-associator, the Widrow-Hoff learning rule is used which correct the difference between the response of the system and the expected response by iteratively changing the weights in W as follows: W (t+l) - W (t) -t- r](Xk -- w ( t ) Z k ) X [ Where ~ is a small learning constant and k is randomly chosen. The Widrow-Hoff learning rule can be analyzed in terms of the eigenvectors and the eigenvalues of the matrix stimuli X (set of K faces)[2]:
W (t) = P { I -
with: A: diagonal matrix of eigenvalues of X X w P : matrix of eigenvectors X X T
( I - (rlA)t}P T
416 with a r] smaller than 2Area . - 1 ()~ma, being the largest eigenvalue), this procedure converges toward:
W(OO) = p p T The notation of eigenvectors and eigenvalues provides a possibility to work with the matrix of small dimension. So that, the matrix W of dimension I*I can be computed as W = P P T , w i t h the matrix P of dimension I*L (L being the number of eigenvectors with a non zero eigenvalue, i < min{ I, K }). For example, we have used an auto-associator to face recognition in which I is equal to 33975 and L is equal to 40 or 200.
3
New technique of learning using the multiscale edges
The standard learning for auto-associative memory consists in presentation of a series of face images to input of model as patterns stored. The auto-associator trained with this method doesn't give satisfactory results in the case of more noisy stimuli. The contour gives the first strong impression for recognition[3]. We have introduced the edges of face images in auto-associator during the learning. In domain of image processing, many algorithms have been proposed to extract the edges which come in two class: the gradient operator and the optimal detector. The Sobel operator uses a mask of [3*3] which gives the satisfactory results for images without noise. The Canny-Deriche filter is a optimal detector whose implementation can be realized on a recursive of order 2 form. The technique of the Wavelet transform allows the detection of multiscale edges and is used to detect all the details which are in an image by modifying the scale. We choose here the optimized Canny-Deriche filter (recursive of order 3) as a Wavelet function of edge detection: f(x)
=
ksxe
ms~' + e ms~
-
e~
with k--0.564 and m-0.215 The method has been applied which allows a direct implementation of the Wavelet transform using a convolution between the image and the edge detection filter for different scales (s = 2 j) to obtain edge images[4]. During the learning of the auto-associator, for each face, a pre-processing has been effected to extract the edges of face image. Then, not only a face image but also the edge images have been proposed to input of auto-associator as patterns. The fig.1 displays the responses of the memories trained with the different ways. The top panels present: la) a stimulus which is noisy with additive random noise, lb) response of the model trained only with the face images, and lc) the desired response. The bottom panels show: ld) response of the model trained adding the edge images by the Sobel operator, le) response of the model trained adding the edge images by the filter Canny-Deriche, and lf) response of the model trained adding the multiscale edges images by Wavelet transform (scale s=l,2 4 8).
Figure 1" Response of the models
Figure 2: Correlation of the models
Clearly, the standard method gives bad results for this noisy stimulus. We remark as using the edge images detected with the different techniques, from the Sobel operator to the Wavelet transform; the quality of recognition improves gradually. The quality of recognition can be measured by computing the cosine (correlation) of the angle between the vector Ok (response of model) and Tk (desired response). The fig.2 shows the correlations of the auto-associators trained with different manners.
417
Experimental results re have applied this new technique of learning using multiscale edges images to store a set of 40 face Caucasian 0 males and 20 females). The fig.3 displays the responses of 2 memories, the one trained with the standard ~rning and the another trained with the new technique of Wavelet transform. The stimuli are noisy with |ditive Gaussian noise, (from left to right) 1) Signal-to-Noise Ratio SNR=I, 2) SNR=3/5, 3) SNR=3/8, and qR=3/13.
Figure 3: The top panels show 4 stimuli, the middle panels the responses produced by the autoassociator trained with the new technique of learning and the bottom panels the response of the autoassociator trained with the standard learning.
Figure 4: Stimuli and responses of the models.
418 The fig.4 shows the results of these 2 memories for the new faces (from top to bottom): 1) a new face similar to the set of learned faces (Caucasian face), and 2) a new face face different from the set of learned faces (Japanese face). The auto-associator trained with the standard learning is not able to give distinguishable responses. Better results have been obtained for the model trained with the new technique. The fig.5 displays the mean correlation functions of these 2 memories: (5a): with 10 Caucasian faces whose faces without noise were learned, (5b): with 10 new faces similar to the learned face, (5c): with 10 new Japanese faces ( - - New technique, __ Method standard ) .
5a
o.91 i.. i_ o 0 r
5c
9
0.9
0.85
0.9 tO -,~
5b ,
0.95
i
0.9
0.85 i
0.8
0.8 0.85
0.75
I I
0.75
I
0.8
I
I
I
0.7
I
0.7
I
0.75
0.65
\ \
1 i
0.65
0
0.6
\
0.65
\ \
0.7
I
\
0.6~"
\ \ \
' 0.55 50 1 O0 0 Noise Magnitude
' 0.55 ~ ' 50 1 O0 0 50 1 O0 Noise Magnitude Noise Magnitude
Figure 5" Mean correlation functions.
5
Conclusion
We have proposed a technique of learning based on Wavelet transform for auto-associative memory which allows to improve the performances of face recognition when stimuli are noisy. More the noise is great, more the improvement is important. A 17% improvement of correlation in comparison with the standard learning has been obtained in the case of more noisy face. Considering the necessary computation amounts, we will implement this auto-associator on several processors (DSP TMS320 C40) in parallel form. We also hope to use this technique toward other applications like character recognition.
References [1] D. Valentin, H. Abdi and A.J. O'Tool Categorization and identification of human face images by neural networks: A review of the linear autoassociative and principal component approaches, In Journal of Biological Systems, 2, 1994. [2] H.Abdi, Les Rd.seaux de neurones, Presses Universitaires de Grenoble, Grenoble, 1994. [3] X.Jia and S.Nixon, Extending the feature vector for automatic face recognition, IEEE-Transactions on "Patterns Analysis and Machine intelligence "Vol.17, December 1995. [4] S.Mallat and Z.Zhong, caracterization of signal from multiscale edges, IEEE-PAMI, Vol.14, July 1992.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
419
Edge Detection by Rank Functional Approximation of Grey Levels J.P. ASSELIN de BEAUVILLE~ D BI~ F.Z.KETTAF Laboratoire dqnformatique - Universit6 de Tours E.3.I.- Ecole dqng6nieurs en Informatiquepour l'lndustrie 64 avenue Jean Portalis, Technop61eBoite N*4 37913 Tours Cedex 9 - France e-mail: asselin@univ-tours,fr
Abstract: In this paper, a new method of edge detection based on rank functional approximation is proposed. This approach regards the edge as the local discontinuity of grey levels, and this discontinuity is extracted by approximating the local grey levels with a linear rank function. The method proposed is robust against noise and can adapt to many edge models (step edge, ramp edge, roof edge.... ). In addition, a new method for selecting the edge position is also proposed, which leads to the thickness of the detected edge being only 1 pixel. Key words: Edge detection, image analysis, pattern recognition, rank statistics, median filter. I. Introduction In light intensity images, edges are usally regarded as the discontinuities of the grey levels, and the edge detection is often implemented in two steps. The first step extracts the discontinuities of the grey levels and the second step thresholds the amplitude of the discontinuities so as to decide the correct edge position. In traditional methods, the discontinuity is extracted by differentiating the grey levels in certain directions. Some examples of these methods are: Sobel gradient, Prewitt gradient, etc. These gradients are easily calculated, but they are too sensible to noise and their responses are not the same for different edge directions. In addition, these methods did not consider the choice of the threshold. Differently from the traditional methods, Marr and Hildreth [1] proposed the zero-crossing of the second derivative of Gaussian filter. This operator can precisely detect the edge in different scales and can minimize the errors of the edge positions both in spatial and frequence domains. In this method, the image is first smoothed by a Gaussian filter with a given scale, then the second derivative of the Gaussian filter is used to find the position of the edge by the zero-crossing output of the filter. The threshold method proposed by Marr and Hildreth consists in accumulating the detected edge positions with different scales. This method is robust against noise, but it often produces false edges especially in the corner of the objets. It is Canny [2] who first formuled three criteria, leading to many new mathematical schemes, such as Deriche scheme [3], Shen scheme [4], Kittler scheme [5] etc. These approaches regard the discontinuities as differents profile models, such as ideal step edge model, ramp edge model, roof edge model etc. The operators for detecting edges are obtained by optimizing the three criteria of Canny with different edge models. The edge detection is implemented by first filtering the image and then detecting the discontinuities with the derivative of the operators, the edge position is decided by nonmaximum suppression and hysterisis thresholding proposed by Canny. The mathematical schemes often give better results, because they have the advantages of multiscale edge detection like that of Marr and Hildreth and of precise edge position owing to nonmaximum suppression and hysterisis thresholding. The problems with these methods are that they need too much calculations and that they consider the edge models only from one dimension. An other type of edge detection approach is the functional approximation, such as functional approximation in two directions proposed by T. Pavlidis [6]. the facet model functional approximation proposed by Haralick [7], the surface functional fitting method proposed by Nalwa [8], and the full plane functional fitting method such as those of Zhou [9]. In this type of the methods, the edge is regarded as the discontinuity of a surface. With this conception, the edges correspond to special distributions of grey levels in two dimensions. Owing to approximating of the surface with a function, these methods are therefore robust against noise and can adapt to different edge models. Considering that all the preceding functional approximation methods use a two dimentional function or a one dimensional function in two or more directions[6], the calculations are therefore complicated. For this reason, we have proposed a new approach which uses only a one dimensional linear function. In our method, we first choose a window with a desired size, next we arrange the pixels in an increasing order acoording to their grey levels, then we use a one dimensional linear fuction to approximate the distribution of the rank-ordered grey levels, and finally we decide the position of the edge by using a local and a global thresholds. In a wxw(=K) window, if we arrange the pixels in an increasing order acoording to their grey levels, we will obtain a distribution of the type shown in Figure 1, where K is number of the pixels in the window. The rank of each pixel is calculated by ordering their grey levels, rank 1 corresponds to the minimum grey level pixel and rank K the maximum grey level pixel. A linear function with two parameters can be used to fit the distribution of the grey levels by their ranks. It is evident that the slope of the straigh line represents the change rate of grey levels in the window. Intuitively, if there is a edge in the window, the changes of intensity will be relevant and the slope will be great. So the slope can represent the discontinuity of the grey levels in the window. One advantage of this method is that it simplifies the two dimensional problem to one dimension. Under the considered window, the profile of the edge may be a step edge, a ramp edge, or a line edge, but the distribution of the rank ordered grey levels is always incereasing. For all edge
420
types,the slope of the function is invariant to the edge directions, this is another advantage of the rank functional approximation. Because the discontinuity detected by this method is invariant to the edge, so it can be regarded as an isotropic edge detection operator. The edge position is selected by thresholding the discontinuities. The threshold method proposed in this paper consists of two parts: the first part is a local threshold, which is calculated by using edges geometries in a 3x3 window, this threshold can give a very thin edge (one pixel). The second part is a global threshold chosen empirically which controls the number of edges to be detected. In the next section, we give the description of the functional approximation. Then we describe the thresholding method and the implementation of the algorithm in section III and IV respectively. For visual comparison, the results of the proposed algorithm and that of Canny's and Deriche's methods for the same image are given in section V. II. Rank functional approximation' In the literature, the rank-ordered grey level is often used to filter an image, such as median filter, wilcoxon filter, etc. but for image analysis, only few researchs have been done with the rank-ordered grey levels. Zamperoni [10] uses the difference of the distribution of rank-ordered grey levels between two regions to detect the edge of textured image. Bovik [11] also used rank-ordered grey level to filter image and to detect the edge, but he used the order-statistics. The above methods model the edge as step edge and detect it by calculating the differences of the ordered grey levels between two regions in a windows. Because the relative positions of the two regions in the window may be horizontal, vertical, or diagonal, this leads to many masks to be considered and to an amount of calculations. Different from them, Kim [12] has proposed a method to detect the edges by subtracting the minimum rank-ordered grey level from the maximum rank-ordered grey level in a K-neighbourhood, this method is very simple but is very sensible to noise because he did not consider the contributions of the non-extrema rank-ordered grey levels, and he did not discuss how to decide the position of the edge. As we have noted in the first section, we arrange the pixels in the window into an increasing order according to their grey levels, this will project all the edge models to only a one dimensional rank-ordered grey levels distribution. So in our method, we only need to consider one mask. By using functional approximation to detect the discontinuities of the rank ordered grey levels, we obtain an algorithm that is robust against noise. Supposing that the size of the window is wxw=K pixels, the grey levels of the K pixels in the window is y(i), i=l...K, where i is the pixel number. After we arrange the K pixels into an increasing order, we get a vector Y=(Yl,Y2 ..... y r ) r such that y, _
Replace Y by a R + b i"and let:
a[(a ~ +b i- q)r(a if,+b i'-CO] =0 a[(a fi+b i- COr(a ~ +b [- ~')] =0 c~b k (i K + 1)yt =i=l " ~ "
We obtain:
a
(~)
(K 3 - ~-)- ~ k __i=Z?i K + I a
@ K 2 Because the denominator in (2)depends only on the size of the window, we may rewrite a as follows: ,k Iz(i K + I a' .= " - - ~ )y' a = C--T = ( K 3 _ Z ) / 1 2
where a ' =
~(i-K+l)y i=l -'~
I = k " i--ZYiyi
The values of Ci are symmetrical around i = K + 1 and that a' is a function of the differences of the symmetrical 2
'
ordered grey levels on the two sides of the median rank K + I . This means that a' is a good operator to detect the edge 2 in every direction in the window. Figure 2 shows a step edge with different directions in a 3x3 window, the value of a' is invariant to the direction of the edge. In ~), we can see that b is equal to the mean of the grey levels in the window subtracted by [(K+l)/2]a. This means that b can smooth the noise (by averaging the grey levels in the window) and can preserve the edge (by substracting [(K+ 1)/2]a.). In order to utilize the advantage of b, the variance of b, noted by Var(b), is selected owing
421
to its isotropic property and its sensibility to the discontinuity of the grey levels. Finally, we use the production of a by Var(b) as our isotropic edge detection operator.
..... !
I~.~ep eclge$ L ~ r ee~ 9bhK~kand 9white region
b~,Distribution of rw'lE-ordered grey
9l . ~ o p edges bevyeen 9vi~e and 9 black re,on
b~ ~ i b ~ i r
of fardc-ofdmed ~ey levels
Znt~s example, K-.9, a' fox the $ edges is the s ~ e md
a' .. 4"(10-1)+ 3"(10-1)+ 2*(10-1)+ 1*(10-1)" 81
Figure 2. Illustration of the edge with different directions IlL The thresholding We suppose that the ideal output of the isotropic edge detection operator (a*Var(b)) will produce an image which consists of roof edges. The location of an edge will be the roof position in the image. For each pixel, we select a 3x3 window around of it, the basic edge geometries in the window is shown in figure 3, the other edge geometries can be obtained by rotating the basic edge geometries in each step of 45 ~. In all the cases, there are only 3 pixels locating at the roof of the edge image, and these 3 pixels (in the edge position) have to be the first 3 maximum discontinuities in the window. For deciding if a pixel in the edge position, we can compare its discontinuity with the first 3 maximum discontinuities in the window. If the discontinuity of the pixel under consideratin (center pixel) is greater or aqual to the first 3 maximum discontinuities in the window, it will be regarded as in an edge position or an edge pixel, if not, it will not be the edge pixel and will be eliminated. This is similar to the nonmaximum suppression method, but this is very easy implemented. For control edge number, we also use a global threshold which is chosen empirically, a pixel is finally chosen as an edge pixel if its discontinuity is greater than both local and global thresholds. b
Figure 1. Rank ordered grey levels
Figure 3. The edge geometries in a 3x3 window
IV. Implementation of the algorithm
Our method is a parallel edge detection approach. It is implemented in two steps. The first step calculates the parameters (a, b) of the rank function for each pixel in a given sized window (K--wxw). K corresponds to what scale of the edge we want to detect. The second step calculates the discontinuity in each pixel position by multipling a with the variance of b. The last step locates the edges with the help of the two thresholds. The algorithm is given as follows: Step 0 Initialization
Step 1 Step 2
Step 3
V. Results
image size => Number of lines, Number of columns. size of window for calculating a, b, a*Var(b), => (K=wxw). size of window for calculating local thresthold => (Ka=sxs). percent of the discontinuities average Sp. Calculation of aij, bij, for all the pixels (i,j) in the image
Calculationof discontinuity: @ Calculate Var(bij) for all pixels in the image with a wxw window (l) Calculate and record aiflVar(bij) for all pixels in the image Calculate the average(noted by E) of aiflVar(bij) for all pixels. | Calculate the global threshold SAb(=Sp*E). Localization of the edge: For i = 1 to NbLine do For j= 1 to NbColumn do 9 Find the third maximum discontinuities Sr in a Ka window. (~ Ifaij*Var(bij) > Sc and aij*Var(bij) >Sabthen rij=255. else rij=0; Endlf Record edge information ri,i. End do End do
Figure 4 shows two real scene images of 256*256 pixels with 256 grey levels. The first image mainly contains step edges and ramp edges, whereas the second, contains step and roof ones. Figure 5 shows the results of our algorithm and that of Canny's and Deriche's methods for visual comparaison. Note that our algorithm gives more thin edges (for
422 example, the edges of the woman's arms) and also straight lines (for example the borders of the table in the office). The algorithm is implemented in C langage on a SUN SPARC station. For wxw=3x3, the calculation time is 15 seconds by image. For wxw=5x5, the calculation time is 50 seconds.
Figure 4. Original images
Figure 5. Results of edge detection References
[ 1]. [2]. [3].
D.MARR and E.HILDRETH, "Theory of edge detection," In Proc. Roy. Soc. London, 1980, PP 187-207. J.F.CANNY, "A computational approach to edge detection," IEEE Trans. PAMI 8, 1986, PP679-698. R.DERICHE, "Using Canny's criteria to derive an optimal edge detecter recursively implemented," Int. J. Comput. Vision, 1987. [4]. Jun SHEN and Serge CASTAN, "An optimal linear operator for step edge detection," Computer Vision, Graphics and Image Processing, Vol 54, N~ 1992, PP 112-133. [5]. M.PETROU and J.KITTLER, "Optimal Edge Detectors for Ramp Edges, " IEEE Trans. PAMI 13, 1991, PP483-491. [6]. T. PAVLIDIS, "Segmentation of pictures and maps through functional approximation," Computer graphics and images processing, 1972, PP360-372. [7]. R.M. HARALICK, "Digital step edges from zero crossing of second directional derivatives," IEEE Trans., PAMI 6,1984, PP58-68. [8]. V.S.NAWA and T.O.BINFORD "On detecting edge," IEEE Trans. PAMI 8, 1986, PP699-714. [9]. Y.T.ZHOU, V.VENKATESWAR and R.CHELLAPPA "Edge detection and feature extraction using a 2-D random field model," IEEE Trans. PAMI 11, 1989, PP84-95. [10]. P. ZAMPERONI, "Feature extraction by Rank-vector filtering for image segmentation", Int. Journal of Patten Recognition and Artificial Intelligence, Vo12, N~ 1988, PP301-319. [11]. W. KIM and L. YAROSLAVSKII, "Rank algorithms for picture processing, " Computer Vision, Graphics and Image Processing, 35, 1986, PP234-258. [12]. A.C.BOVIK, T.S.HUANG and D.C.MUNSON, "Edege-sensitive image restoration using ordered constrained least squares methods, "IEEE Trans.ASSP 33, 1995, PP1253-1263.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
423
Fuzzy Logic Edge Detection Algorithm Sakad Murtovaara l, Esko
J u u s o 1 and
Raimo Sutinen2
1) Control
Engineering Laboratory University of Oulu Linnanmaa, FIN-90570 Oulu Finland Phone: +358 81 553 1011 Fax: +358 81 553 2304 E-mail: { sakari.murtovaaralesko.juuso }@oulu.fi
2)ABBIndustry Oy Tym~.v~.ntie 14 FIN-90400 Oulu Finland Phone: +358 81 374 555 Fax: +358 81 374 486
Abstract In this project, fuzzy logic is applied to edge detection. The performance of a recovery boiler is strongly affected by the geometry of the char bed, and therefore the operation of the boiler can be improved, if this geometry is known. By the use of infrared fire-room cameras, the bed can not only be displayed to the operator but from this image it is also possible to calculate the geometry parameters of the char bed, when the edges of the bed have been detected. The system utilises the information coming from the recovery boiler. The image processing analysis tries to find out the contour of the bed. The image of the contour may contain pseudo pixels and gaps, e.g. caked liquor solids on the walls may cause erroneous pixels to appear in the contour. The aim of this project is to further improve the edge detection and the image processing. A new algorithm searching for the contour of the char bed is developed. The present algorithm is based on membership functions of the contour obtained from history data. This algorithm filters out fast changes of the contour. The extended algorithm takes into account the neighbouring points by examining the new contour, and if the distance between the forecasted pixel and the pixel obtained image processing goes over tunable limits, the algorithm removes these pixels. This further improves the efficiency of the algorithm and gives more accuracy. This project is included in the national technology programme financed by TEKES (Adaptive and Intelligent Systems Applications) and is done in co-operation with ABB Industry Oy.
Keywords: Fuzzy logic, Image processing, Edge detection and Recovery boiler.
424
Introduction The behaviour of the char bed in a recovery boiler is extremely difficult to monitor by using conventional instrumentation. The char bed height depends on operating variables such as liquor temperature and primary/secondary air ratio as well as air pressure. Digital image processing offers techniques to expand and improve the supervision and control of the burning. [ 1] The shape and positions of the char bed in recovery boiler, as well as the temperature distribution of the bed, are important control objects when boiler efficiency is to be maximised and emissions minimised. Visibility in the visible light region is limited. By using infrared fire-room cameras, the char bed can be displayed to the operator. The effect of changing operating variables (liquor temperatures, air pressure, air flow etc.) can be seen on monitor. The effects of any plugging of the liquor nozzles and slagging of the air ports can also be detected. A camera gives the most immediate information about the burning process and a clear image could help the operator to identify the beginning of the transients much earlier than by other means. [2, 3] In this paper, we discuss a new edge detection algorithm. This algorithm will further improve the recognition of char bed. By using fuzzy logic we can simplify this process and increase flexibility in the supervisory control of the burning process.
Image processing The image processing is divided in two main parts: processing of the incoming image and analysis of the pre-processed image. In this context, analysis means searching for the contour of the bed and calculating the numerical information describing the bed. The image processing part digitises the camera image and performs different kinds of neighbourhood operations: 10 consecutive frames are averaged to reduce noise and decrease the influence of instantaneous disturbances, dirt around the camera opening is masked away, edges in the image are enhanced by differentiation, and image is thresholded so that only the enhanced edges remain in the image. The result of the digital image processing is another "improved" image. [4] The analysis section takes this pre-processed image and searches for the pixels that from the contour of the bed. First, a search window is fixed to speed up the calculations.. From the defined 'search' area contour pixels are searched for according the following principles: 9 non-zero pixels are searched for downwards from the search window boundary in each column. 9 if at least two pixels are found to be on top of each other or in previous columns a contour pixels in nearby columns was found, this pixel is assumed to belong to the contour. 9 locations of the contour pixels are stored in a table, which then represents the instantaneous contour of the char bed. [5]
425 In the recovery boiler application the following features are analysed: instantaneous contour of the bed, height of the bed, horizontal position of the top of the bed, cross-sectional area of the bed, and figure parameters describing the shape of the bed.
Fuzzy logic in edge detection The aim of this project is to further improve the edge detection and the image processing. A new algorithm searching for the contour of the char bed is developed. It generates and updates membership functions for each contour point on the basis of history data (Fig. 1). Then it defuzzifies the resulting fuzzy numbers into a new contour (Fig. 2). Defuzzification is the based on centre of average of the membership functions (Fig. 1). According to the tests, this algorithm filters out fast changes of the contour.
Fig. 1. The cemre of average calculation.
Fig. 2. Calculation of new contour.
426 By extending the algorithm we can further improve the efficiency of the algorithm. The extended algorithm takes into account the neighbouring points by examining the new contour, and if the distance between the forecasted pixel and the pixel obtained image processing goes over tunable limits, the algorithm removes these pixels. The evaluation of the method will be continued with very large material in the Matlab-environment and after those tests, the implementation will be transferred to the application software. A suitable number of contours in history data is 10 contours since the effective changes in the state of burning processes are slow, a typical time constant may be order of minutes. By changing number of contours in history data we can effect how quickly the system can be adapted in movements in the char bed. The reliability of the searching for the contour of the char bed can be improved by developing a fuzzy method for the image thresholding (adaptivity) by changing the thresholding parameters according to the intensity of the image. A fuzzy control method has been outlined for the image thresholding to stabilise the image processing conditions. Conclusions
According to the tests, the present algorithm filters out fast changes of the contour of the char bed. In the recovery boiler, the changes are very slow, and therefore, the algorithm improves the searching of the contour. Already the present algorithm will increase flexibility of the supervisory control, and the extended algorithm can improve further the efficiency of the algorithm. The adaptation of the system by changing number of contours in history data can be tuned. In digital image processing, the dynamics of the phenomenon can also be utilised on the basis of successive images. References
[ 1]
[2]
[3]
R. Lillja, "Pattern recognition in analysis of furnace camera pictures," in Pattern recognition applications, September 2 7 - October 2, 1987. Tbilis, USSR 1987, The Soviet-Finnish Symposium. 12p. S. Murtovaara and E. Juuso, "Fuzzy Logic in Digital Image Processing for Recovery Boiler Control," In Proc. of TOOLMET'96 - Tool Environments and Development Methods for Intelligent Systems,, April 1 - 2, 1996, Oulu, Finland, Report A No 4, May 1996: Univ. of Oulu, Control Eng. Lab., pp. 199- 204. M. Ollus, R. Lilja, J. Hirvonen, R. Sutinen and S. Kallo, "Burning Process Analyzing by Using Image Processing Technique," in IFAC 3rd MMS Conference, June 14 - 16, 1988, Vol. 1, Oulu, Finland, 1988, pp. 274 - 281.
[4] [5]
R. Sutinen, R. Huttunen, M. OUus and R. Lilja, "A new analyzer for recovery boiler control," Pulp & Paper, Canada, pp. T83 - T86, 1992(1991). T. Hosti, "Digital image processing for recovery boiler control," Masters thesis, Univ. of Oulu, Dep. of Process Eng., Oulu, Finland 1992, 57 p.
Proceedings IWISP '96, 94- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
427
Topological Edge Finding
Mark Mertens* 9 Hichem Sahli and Jan Cornelis. Vrije Universiteit Brussel (VUB), Dept. ETRO-IRIS. Pleinlaan 2- B-1050 Brussels - Belgium Abstract
In this paper we describe a new automatic approach which calculates a polygonal image model for arbitrary images. It is part of a framework for image modeling[ 1]. To cope with the wide range of images the method has to be topological avoiding a high sensitivity to the exact pixel values. Part of this requirement can be fulfilled by using a distribution free nonparametric estimator gain function. This gain function will be the subject of the paper. We found that it results in very accurate edge representations and that it is robust against noise.
Introduction
We describe an edge detection approach in which edge finding and polygonalisation of curves are tackled jointly in one optimisation framework. This is achieved by formulating a gain function which evaluates the quality of postulated lines, assumed to be coincident with the edges in the image. Edges are detected by finding the lines that maximise the gain function, which measures the dissimilarity between regions on opposite sides of the postulated line. We represent a postulated line segment by the internal data structure of an agent in a multi-agent framework[ 1]. The agents find the line features in the image, by moving towards them in representation space, by maximising their value of the gain function. The result is an emergent configuration of line segments, globally coinciding with the edges in the image. The advantages for feature extraction are (1) the robustness and accuracy of the proposed gain function, (2) the merging of detection and representation of features, (3) the easy interpretation of the extracted line features (e.g. for object-based progressive transmission). This edge finding approach is highly homogeneous which facilitates its incorporation in different image processing applications.
Our new edge finding method
The problem with classical edge finding, when trying to determine the correct amounts of differentiation and smoothing[2],[3], has always been the choice of window size. We cannot take information that is too local since we need a clear identification of the regions on both sides of the edge - in our case we use line segments in stead of points around which we construct our windows - but we cannot use a global method either, since an edge is a localised characteristic and global methods will tend to merge different edge parts. In particular the use of a fixed window for the whole image is not a good choice since some parts of the image need a coarse and others need a fine detection. This can be described by the Heisenberg uncertainty principle[ 4][5]. The optimal solution is the use of windows which are adapted to the actual shapes of the objects appearing in the image. This seems to be a circular reasoning since we want to use these windows to find the objects in the first place. Starting from basic modeling principles, we state the problem as a prediction-verification problem. Can we determine the descriptive parameters of the boundary and verify its existence ? We can calculate both if we recognise the fact that in a discrete image a boundary can always be faithfully represented as a chain of line segments, solving a first topological problem. We can then postulate (predict) and verify (fig. 1) the existence of a particular line segment 3 with 4 descriptive parameters, namely the coordinates of its starting point (x0, Y0), a length I and a slope O~ in the object boundary. This strategy can be automated for most types of image. We propose a one step solution for all the classical edge-finding problems of calculating the likeliness that a pixel is part of an edge, thinning, linking, and polygonalisation for representation. Postulated lines (prediction)
InputImage
-,
I Dissimilarity
~1 gain (verification)
Optimisation procedure
I segments "Maximumcoinciding gain"linewith image edge segments
t
Dissimilarity measure
(gainfunction)
Fig. 1" Block diagram of our prediction/verification optimisation approach to edge finding and representation. We use our edge verification criterion as a gain function which has to be maximised by moving a postulated line segment through the image. When the gain is maximum the postulated line segment coincides with the edge-segment with the same parameters (x0, Y0, l, O~) in a boundary. The details of the developed prediction/verification approach (fig. 1) will not be elaborated in this paper, but they are described in [1 ]. For clearness and conciseness we will focus upon our edge definition and the resulting verification gain function, its properties and relation to classical approaches.
* The research of Mark Mertens was sponsored by the IWT.
428 Definition of edges and regions We define regions as connected sets of pixels, having a particular a priori unknown statistical distribution of numerical values (e.g. colours, grey values, texture measures...), which we shall simply call "colours". An edge is defined as the 8-connected, single pixel width, set of pixels "optimally" separating two regions with different distributions. We conjecture that the exact distributions are irrelevant and that we only need to establish a first order difference criterion to determine an edge, so we have a distribution free and nonparametric method. Notice that we define the edge as a locus of change, but a change between regions and not numerical "intensity" values. The problem now is to extract a reasonable amount of pixels from the regions on both sides of the edge, so that the separability, expressed by the gain function, can always be considered reliable (fig. 2). This issue is not raised in classical edge finding techniques. Theoretically the maximum gain value will be obtained when the postulated line segment coincides exactly with the edge-segment of the object. Practically the maximum could shift a little due to numerical errors or a large amount of noise. We will show however that our method is inherently noise insensitive while also giving good localisation, and this as contrasted with methods based on differentiation.
Fig. 2: Rectangular window with fixed width w, associated to postulated line segment $ (x0,Y0,/, 0~). Gain function For each postulated line segment, we sample the pixels in its associated window, which gives the possibility to select an optimum set of representative pixels of both sides, and evaluate the following verification gain function: (1) The gain G is calculated for a postulated line segment S (shown dashed in fig. 2) as the sum (over all colours i E N in the space of possible colours) of the absolute values of the difference of the number of pixels with colour i in R1 (namely
CiR~(g))
and R2 (namely C/R2(s)). Hence we obtain a first order topological dissimilarity measure for the two regions. When the two "colour" distributions have almost no overlap and the postulated line segment coincides with an edge-line segment in the image, the G-value (eq. 1) will be approximately one, corresponding to the maximum possible gain value GM=I. So all G-values above 1- E, (where E is a threshold value that is determined from the noise in the image or specified by the wishes of the user) will be retained as valid object representing line segments. Results obtained with our gain function To illustrate the detection accuracy and the robustness of the function GCS), an image representing a part of a rectangle of gray value 96 on a background of gray value 160, with and without added uniformly distributed noise between plus and minus 64 is used (fig. 3).
Fig. 3: Part of a rectangle with (3B) and without (3A) added uniform noise for examination of the gain estimator (1). A typical 1D section through the 4D representation manifold (G function of x0,Y0,1, 0r will look like the curve in fig. 4. As shown in fig. 4, four characteristics are of interest to us. Let GM be the maximum theoretically achievable gain (GM=I), GN the value for the "noise", GO the optimum gain value, which occurs when the postulated line .~ coincides with an image edge-line in the noisy image, u the parameter (x0,Y0,1, O~) value where GO occurs and T the true parameter value of the image edge-line. We then define: - The clearness C, which is the difference between GO and GN. - 6= GM-G0. -The accuracy error A=Iu-TI. -a, the width of the high value peak. Note that 6 depends on the noise and should not be too big or the correct detection of the edge-line becomes questionable.
429
Fig. 4: A typical 1D section through the 4D representation manifold. Parameter is x, y, I or O~.
a) Gain cross section obtained by varying the parameter x. Figure 5 shows a scan (ax=t) of a vertical line of length/--20 and associated width w=2 across the edge. From theoretical arguments we would expect a linear function, since with each step I pixels are moved to the other side (cf fig. 3) of the postulated line segment. Theoretically, the gain variation G(d) for 0
HO_l(w_d)~,C21/21w=2t(w_d)/21w=l_d/w
Fig. 5: Scan of a vertical line (/--20, w=2) across the edge in the image fig. 3A and fig. 3B. The accuracy error A is zero in both cases and ,~ is only 5%, which means that despite the large amount of noise we have a good detection (accurate and robust). A Canny edge detector[ 2] with a sigma of 5, applied onthe image of fig. 3B, could also find the edge but only with an intensity of 0.54 and an accuracy error A of 1 pixel. Furthermore due to the isotropic Gaussian filtering it behaves badly near the corner, malting it round. b) Gain cross section obtained by varying the parameter I.
Fig. 6: Incrementation of the length l of the postulated line (l=10-60) in image 3A &B (w=2). When we make the postulated line longer than the image edge-line, we see the gain is going down hyperbolically (fig. 6A), according to the formula G(l)ll>,~=2,t, wl21w=Z/l. In our case the length ;t of the edge is 15 pixels. In the noisy image we see the same tendency, but the gain value starts lower since the noise creates some equally coloured pixels in both regions, even when the postulated line segment is accurately representing the edge-line. The accuracy error A in the noisy case, which we can define as the point where the gain starts to drop hyperbolically, is approximately 4 pixels as seen from fig. 6B. c) Gain cross section obtained by varying the parameter (~
430
Fig. 7: Rotation of the postulated line from 0 degrees (horizontal) to 100 degrees in image 3A &B (/--40, w=2). For angles, which are essentially variables in the continuous plane, the maximum achievable accuracy depends on the length of the postulated line. From fig. 7A, we see that for a length of 40 pixels we can clearly estimate the angle within an accuracy of 1 degree. The accuracy error A is zero degrees for the noisy version, since the maximum is found at the same place as for the noiseless image; ~ is 22.5% and C is still large enough (0.2) to reveal a similar trend in G(a) as in the ideal case (fig 7A). In a continuous edge image and for a continuous sampling window, the function c(a) turns out to be:
Fig. 8: 2D section through 4D manifold for an image area around the corner in fig. 3A
Fig. 9: Result of applying G to a noisy image of the Arc de Triomphe.
Fig. 8 gives us a clearer view of the shape of the 4D manifold. Here we scan over angles of 0 to 100 degrees for different orthogonal distances x from the corner (fig. 3). We point out that the optimum c0(~) shifts from 90 to almost 60 degrees when the location x is wrong but there is only one clear maximum with a gain equal to one. Remark that on the left side of the figure we see the growing gain corresponding to the horizontal side of the corner. Fig. 9 shows that for a low quality real image, we still find e.g. the first sloping side of the Arc with a a of only 7%. Conclusion In this paper we propose a method in which there is no distinction between edge detection and representation, yielding immediately intermediate level features. This leads to high uniformity, which makes the method easy to incorporate in higher level strategies. The approach also recasts the fundamental theory behind edge finding. We find that a fundamentally different detection strategy linked to a different definition of edges finds the same edges as a Canny edge finder does, even a little more accurate. We plan further experiments with the same gain function (1) on texture segmentation, for which Canny's edge detector is not designed. References: 1) M. Mertens: A Topological Edge Finding Gain Function. Technical Report IRIS-TR-0039, Electronics department, Vrije Universiteit Brussel, Belgium, 1996. 2) J. Canny: A Computational Approach to Edge Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence, 8, No 6, pp 679-698, 1986. 3) R. Klette & P. Zamperoni: Handbook of Image Processing Operators. John Wiley and Sons, 1996, p 238. 4) R. Wilson & G. H. Granlund: The Uncertainty Principle in Image Processing. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6, No 6, pp 758-767, 1984. 5) R. Wilson and M. Spann: Image Segmentation and Uncertainty. Research studies press ltd, 1988.
Session O: VIDEO CODING II: MOTION ESTIMATION
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
433
Automatic Parallelization of Full 2-D Block Matching for Real Time Motion Compensation and Mapping into Special Purpose Architectures Nectarios Koziris, George Papakonstantinou and Panayotis Tsanakas National Technical University of Athens Dept. of Electrical and Computer Engineering Computer Science Division Zografou Campus, Zografou 15773, Greece e-mail." papakon@cs, ece.ntua, gr tel: +301- 7722494 fax." +301- 7722496 AbstraetmThe most important issue in video encoding is motion compensation within a frame sequence. Block matching techniques are used by various algorithms [5], [6] to estimate the motion into succesive frames. Full search 2-D block matching is the most widely used algorithm for video encoding in all standards (H261, MPEG 1-2, HTDV etc). It provides with the best SNR, since it uses exhaustive search to find the best, in terms of MAD or MSE matching candidate block. Its main disadvantage is that it requires a lot of computations which should be performed for every candidate frame. Consequently, real time application of the 2-DFS algorithm for video compression, requires the use of parallel architectures, to achieve large amount of computations. This paper presents the application of automatic loop parallelization techniques to provide with scalable or fixed systolic arrays, with optimal performance, in terms of total computation time. We transform the BMA algorithm into an equivalent form, in order to apply the automatic parallelization method. Starting from the dependence index space, we map the matching algorithm to 2-D or 3-D systolic arrays and we propose an alternative mapping for fixed size architectures. I. BLOCK MATCHING MOTION ESTIMATION CONCEPTS: Block-matching motion estimation/compensation is used to remove the temporal redundancy within a frame sequence, thus resulting into significant bit-rate reductions for digital image encoding methods. Nevertheless it requires a large amount of computation and heavy memory bandwidth. In the block-matching techniques, the candidate frame for encoding is partitioned into quadratic blocks of nxn pixels. These blocks, resulting from the segmentation of the current and previous frames, are called current and previous block, respectively. For each current block, the best matching previous block is sought within a search area surrounding the previous block:
Figure 1. Block Matching Concepts The search is limited to a maximum displacement m/2 in both directions around the position of the reference block (search area). The candidate blocks are (m+ 1)2.
434 The 2DFS algorithm, is the following: M-1 N-1
MAD(kj,(x,y) = S(x,y) = E E ]Fn(k + i,l + j ) - F._,(k + x + i,l + y + j ) [ ~=oj=o v(k,I) = (x,Y) I min MAD(k,o(x,y) -m/2 < x,y < m/2
In order to apply the automatic methodology of [7], we transform the above summation into a four dimensional loop, where the dependencies between the different loop iterations/ statements are better shown. This transformation reveals all possible parallelism and, therefore facilitates the optimal parallelization of the BMA algorithm: Forx=0...m For y = 0... m For i = 0 to M-1 Forj = 0 to N-1 MAD(k,~)(x,y,i,j)=MAD(k,o(x,y,i,j-1)+[Fn(k+i,l+j)-Fnq(k+x+i,l+y+j)[ endj MAD(k,l)(x,y,i,N-1) = MAD(k,~)(x,y,i,N-1)+ MAD(k,l)(x,y,i-1,N- 1) end i MAD(k,I)(x,y,M-1,N- 1)= MAD(k,I)(x,y,M-1,N- 1)+ MAD(k,o(x,y-1,M- 1,N- 1) end y MAD(k,l)(x,m,M-1,N- 1)= MAD(k,i)(x,y,M-1,N- 1)+ MAD(kj)(x-1,m,M- 1,N- 1) end x It can be easily noticed that the computation of MAD's for all candidate blocks in the previous frame F n_ 1, is a four dimensional nested loop over indices y,x,i,j. In order to exploit all the inherent parallelisation of this nested for loop, we will apply methods for automatic parallelization of loops used in parallel compilers [3], [11], [13]. These techniques will subsequently map the parallel algorithm into a systolic structure according to the methodology and the respective tool presented in [7]. II. AUTOMATIC TECHNIQUES FOR MAPPING THE FULL SEARCH BLOCK MATCHING ALGORITHM ONTO SYSTOLIC ARRAYS : Systematic methods for parallelization of for-loops were proposed by Moldovan [11], Shang [13] and Andronikos et al. in [1]. These methods are based onto the decomposition of the algorithm into basic modules containing single assignment statements. Since the nested loop is assumed to have depth n, every indexed variable in any assignment statement is n-dimensional. Since n indices are used, the algoritmh is defined over an n-D index space. Every point inside this integer index space corresponds to an instance (i 1,...,in) of the nested loop. This simple algorithmic model can be graphically displayed by an n-D index space graph of computation nodes and data dependence arcs. From such a representation of the computation space, various architectures can be derived. a) n-D I n d e x Spaces:
The simplest and most straightforward implementation is the assignment of every computation node of the n-D index space to a distinct PE (Processing Element). This leads to optimal time execution of the nested loop, but requires too many processors, leading to poor processor utilization. It is equivalent to using a distinct processor for every loop instance. A more efficient design with higher processor utilization can be achieved if each PE executes the operations of multiple processor nodes. The BMA (Block-Matching Algorithm) is defined over a four-dimensional index space due to its four indexes x,y, ij. We are proposing various decomposition techniques according to the total number of cells and the dimension of the target systolic array. It is obvious that the BMA nested loop can be decomposed into two parts, which are defined over two-dimensional index spaces. The first one is spawn by the indexes i
435 and j and consists of the addition of the sum MAD(x,y). In the rest, which is defined over x and y, the minimum search and the selection of the displacement vector is performed. note:we consider for simplicity block (0,0) and M=N=5.
J4-
~,
~"
~,
"'"
J 4 I'
9149149
''"
Optimalhyperplane
9
3
!!!!!!!!!
2
""
....
1
A
0-
~
I
:]
~
9* ~
f
~...... V. . . . . . . . . . . . . . . . :7"
..
" 9 2,9
"'"
i9
4- it-
0q
~176149
9176
~~
~149
~
9
n i l l]
:::
"
" -=. i' 9 9 2". 4 it. ~ 3 9149 ~
~
~
In the 2-D index space the dependence vectors are d 1=[ 0 1] and d2=[ 1 0]. After applying the algorithm proposed in [2], the optimal time hyperplane is II [1 1].
J
x
x
Optimal hyperplane1-I[1 1 1] 9J 4 i .... .i .... .~i.. 11~'" f
iiii I
iiiii
0-
i". ~'..~
~ r
i
4- (
In the 3-D index space the dependence vectors are d 1=[ 0 1 0], d2= [ 1 0 0] and d3=[0 0 1]. After applying the algorithm proposed in [2], the optimal time hyperplane is II [ 1 1 1]. b)Resulting
Architectures:
As the authors propose in [ 1], the resulting architectures are n-1 dimensional, if the initial nested loop has n dimensions. In section IIa, two representative index spaces were presented, along with the corresponding array architectures, by applying the method of [11], [7] are 1-D and 2-D. Note that the proposed block architecures are optimal in terms of total parallel execution time, since the automatic technique chooses the time hyperplane H for each index space which minimizes tparallel=min
ImaxFljl- minl-Ij2 + 11 dispTt
are points of the index space +
4,
4,
4,
4,
Fig 2. The proposed Systolic Array implementing ij loop Fig 3. The proposed Systolic Array implementing i,j,y loop
where J~,J2
436 In Fig. 2 , the architecture derived from the inner i,j nested loops is shown. If we want to serially perform the outer x,y loops we can simply add another cell, which succesively performs the x,y additions, and a minimum cell, as follows in fig 4: 4,
4,
4,
4,
4,
min
MAD(k,0(x,y)
Fig 4. The proposed Systolic Array implementing i,j loop in parallel and x,y outer loops sequentially In Fig. 3, the architecture is derived from the inner i,j,y nested loops is shown. If we want to serially perform the outer x loop we can simply add another cell, which succesively performs the x additions, and respective minimum cells, as follows in fig 5: _ Summations over y,i,j
Minima for x loop
Fig 5. The proposed Systolic Array implementing i,j,y loop in parallel and x outer loop sequentially Finally, a 3-D array which can perform all loops in parallel, as quickly as possible, can be derived by extending the above techniques. Mathematic perfomance evaluation proves the efficiency and real time response of the proposed architectures. III. CONCLUSIONS
In this paper we have proposed systematic methods for mapping the BMA algorithm into special purpose parallel architectures. We transformed the BMA into an equivalent form and we applied automatic loop parallelization techniques. The resulting architectures are optimal both in time and space requirements. This method gives us the possibility to produce automatically special purpose hardware for real time motion compensation applications.
437 REFERENCES:
[1] T. Andronikos, N. Koziris, Z. Tsiatsoulis, G. Papakonstantinou and P. Tsanakas, "Lower Time and Processor Bounds for Efficient Mapping of Uniform Dependence Algorithms into Systolic Arrays, " to appear in the Journal of Parallel Algorithms andApplications, vol. 12, 1-2,1996 [2] T. Andronikos, N. Koziris, Z. Tsiatsoulis, G. Papakonstantinou and P. Tsanakas, "Lower Time Processor Bounds for Uniform Dependence Algorithms," First ECPD International Conference on Advanced Robotics and Intelligent Automation, Jan. 1995. [3] A. Darte and Y. Robert, "Constructive Methods for Scheduling Uniform Loop Nests", IEEE Trans. Parallel Distrib. Syst., vol. 5, no. 8, pp. 814-822, Aug. 1994. [4] M.J. Chen, L.G, Chen, T.Z Chiueh, "One - Dimensional Full Search Motion Estimation Algorithm For Video Coding", IEEE Trans. on Circuits and Systems for Video Technology, vol. 4, no. 5, pp. 504509, Oct. 1994. [5] H. Gharavi and M. Mills, "Blockmatching Motion Estimation Algorithms - New Results", IEEE Trans. on Circuits and Systems, vol 37, no. 5, pp. 649-651, May 1990. [6] H.M. Jong, L.G. Chen, T.D Chiueh, "Parallel Architectures for 3-Step Hierarchical Search BlockMatching Algorithm", IEEE Trans. on Circuits and Systems for Video Technology, vol. 4, no. 4, pp. 407-415, Aug. 1994. [7] N. Koziris, (3. Papakonstantinou and P. Tsanakas, "Automatic Loop Mapping and Partitioning into Systolic Architectures, "5th Panhellenic Conference on Informatics, Athens 1995. [8] T. Komarek and P. Pirsch, "Array Architecures for Block Matching Algorithms", IEEE Trans. Circuits and Systems, vol. 36, no. 10, pp. 1301-1308, Oct. 1989 [9] P.-Z. Lee and Z.M. Kedem, "Synthesizing Linear Array Algorithms from Nested For Loop Algorithms," IEEE Trans. Comp., vol. 37, no. 12, pp. 1578-1598, Dec. 1988. [ 10] B. Liu and A. Zaccarin, "New Fast Algorithms for the Estimation of Block Motion Vectors", IEEE Trans. on Circuits and Systems for Video Technology, vol. 3, no. 2, pp. 148-157, Apr. 1993. [11] D.I. Moldovan and J.A.B. Fortes, "Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays," IEEE Trans. Comput., vol C-35, no 1, pp. 1-11, Jan. 1986. [12] W. Shang and J.A.B. Fortes, "Time optimal linear schedules for algorithms with uniform dependencies," IEEE Trans. Comput., vol. 40, no. 6, pp. 723-742, June 1991. [13] L. de Vos and M. Stegherr, "Parameterizable VLSI Architectures for the Full - Search Block Matching Algorithm", IEEE Trans. on Circuits and Systems, vol 36, no. 10, pp. 1309-1316, Oct. 1989. [14] K.M. Yang, M.T Sun, L. Wu, "A Family of VLSI Designs for the Motion Compensation BlockMatching Algorithm", IEEE Trans. on Circuits and Systems, vol. 36, no. 10, pp. 1317-1325, Oct 1989.
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
439
New Search Region Prediction Method For Motion Estimation D. H. Ryu*, C. R. Kim*, T. W. Choi**, J. C. Kim** * ETRI, Korea ** Dept. of Electronics Engineering, Pusan National University, Korea
Abstract This paper presents a new search region prediction method using the neural networks vector quantization(VQ) in the motion estimation. A major advantage of formulating VQ as neural networks is that the large number of adaptive training algorithm that are used for neural networks can be applied to VQ. The proposed method reduces the computation because of the smaller number of search points than conventional methods, and reduces the bits required to represent motion vectors. The results of computer simulation show that the proposed method provides better PSNR than other block matching algorithms.
I . INTRODUCTION Techniques for compressing motion pictures play a very important role in video conferencing, video phone and the high definition television(HDTV). Since the temporal correlation as well as the spatial correlation is very high in the moving picture, the high compression ratio can be achieved using the motion compensated coding(MCC) technology. The motion compensated coding is consisted with motion compensating part by the precise motion estimation and prediction error encoding part[ 1][2]. Estimation of the motion information is an important problem in the image sequence coding, and many researches exist. Motion estimation techniques can be roughly classified into the pel recursive algorithm(PRA) and the block matching algorithm(BMA). For the BMA-based motion compensated prediction coding method, the amount of information for motion vectors and prediction error must be small as possible. The amount of motion vector information is different according to the coding techniques and transmission rate[3]. In this paper, we propose a method for estimating motion vectors in a video sequence. The proposed method predicts the search region by a vector quantization using neural networks and evaluates distortion for the predicted points. The remainder of the paper is organized as follows: Section II reviews the conventional block matching algorithms and their problems. Section III describes the proposed method that predicts the motion region using neural network vector quantization and detects the motion vector. Section IV presents the results of the computer simulation for a sequence of images. It also discusses the comparisons with other algorithms. Finally, Section V addresses the conclusions of this paper. II. MOTION VECTOR ESTIMATION USING BLOCK MATCHING ALGORITHM
Block matching algorithms are utilized to estimate the motion at a block ofpels, say, of size(M.N) in the present frame in relation to pels in the previous frame. This block of pels is compared with a corresponding block within a search region in the previous frame (Fig.l). The process of BMA can be described as follows: First, an image is divided into fixed size subimages. At this time, a best match for the previous frame can be sought by maximizing cross correlation. We define a function D(.) to evaluate these measures for locating the best match. 1
M N
D(i,j) = --~.~:iG(U(m,n)= =
- Ur(m + i,n + j), )
(1)
-p<_i,j < p where G(.) is a nonlinear function to evaluate error power, U is the block of size M*N in the current frame, Ur is a search area of size (-M+2p)(N+2p) in the previous frame, and p is maximum displacement allowed. Displacement is the (i,j) minimizing the D(i,j). Though the motion vector detection schemes using BMA have been widely utilized, they have many problems. For
440
instance, it assumes that all the pels within the block have a uniform motion because of detecting motion vector on a This assumption is relatively satisfied by only a small block(8.8 or 16* 16). However, having a smaller block-size increases the number of blocks and yields the higher rate for transmission by increasing the motion vector to be transmitted[4][5]. block-by-block.
Fig. 1. Motion detection by block matching algorithm
Fig. 2. Structure of FSCL
VQ
I I I . M O T I O N V E C T O R ESTIMATION U S I N G N E U R A L N E T W O R K S The performance of motion vector detection can be increased because the motion vectors usually have high spatiotemporal correlation. In the sections below, we propose a new motion vector estimation technique using these correlation.
A. Search Region Prediction Using VQ Vector quantization is a quantization technique which capitalizes any underlying structure in the data being quantized. The space of the vectors to be quantized is divided into a number of regions. A reproduction vector is calculated for each region. Given any data vector to be quantized, the region in which it lies is determined and the vector is represented by the reproduction vector for that region. More formally, vector quantization is defined as the mapping of arbitrary data vectors to a index m. Thus, the VQ is mapping of k-dimensional vector space to a finite set of symbols M. V Q : x = (xl. x2 . . . . .
Xk) ---> m
(2)
where m ~ { M } and the set m has size M. Assuming a noiseless transmission or storage channel, m is decoded as x. The collection of all possible production vectors is called the codebook. In general, this requires knowing the probability distribution of the input data. Typically, however, this distribution is not known, and the codebook is constructed through process called training. During the training, a set of data vectors that is representive of the data that will be encountered in practice is used to determine an optimal codebook[6].
B. Quantization of Motion Vectors Using Neural Networks As can be seen from the discussions above, the training and encoding processes are computationally expensive. Moreover, most of the algorithms currently used for VQ design, e.g., the LBG algorithm, are batch mode algorithms, and need to have access the entire training data set during the training process. Also, in many communication applications, changes in the communication channel mean that a codebook designed under one condition is inappropriate for the use in another condition. Under these circumstances, it is much more appropriate to work with adaptive VQ design methods, even if they are suboptimal in a theoretical sense. Another benefit of formulating vector quantization using a neural networks is that a number of neural network training algorithms such as competitive learning(CL), Kohonen self-organizing feature map(KSFM) and frequency sensitive competitive learning(FSCL) can be applied to VQ. In this paper, we use the FSCL networks that overcome a drawback of CL. Assume that the neural network VQ is to be trained on a large set of training data. Further assume that the weight vectors w , ( i ) are initialized with random values. The algorithm for updating the weight vectors is as follows. The input vector is presented to all of the neural units and each unit computes the distortion between its weight and the input vector. The unit with the
441 smallest distortion is designated as the winner and its weight vector is adjusted towards the input vector. Let w,,(i) be the weight vector of neural unit i before the input is presented as follows. f l zi =
if d(x, wi(n) < d(x, w~)
(3)
i = 1. . . . . M
0
otherwise
The new weight vectors wi(n+ 1) are computed as w,(n+ l) :
w,(n)
+ s(x-
w,(n))z,
(4)
In the above equation, the parameter e is the learning rate, and is typically reduced monotonically to zero as the learning progresses. A problem with this training procedure is that it sometimes leads to neural units which are under utilized. To overcome this problem, FSCL algorithm has been suggested. In the FSCL network, each unit incorporates a count of the number of times it has been the winner. A modified distortion measure for the training process is defined as follows.
at(x, wi) = d(x, wi(n)) * ui(n)
(5)
where, ui(n) is the total number of times that neural unit i has been the winner during the training. The winning neural unit at each step of the training process is the unit with the minimum at. The architecture of FSCL is shown in Fig. 2 [7].
Fig. 3. Block diagram of suggested motion vector estimation
Fig. 4. Initial and output value of codebook
C. Motion Vector Estimation by Search Region Prediction
Fig. 3 shows the block diagram of the proposed motion vector estimation method using the neural networks vector quantizer. First, we find motion vectors using the full search method from the training images and then, training the codebook of the neural networks vector quantizer using this motion vectors. Second, a motion vector can be estimated using the codebook as a motion prediction region, the codewords in the codebook represent the motion vectors for the input image sequences. Since the codebook is used as the search region for estimating the motion vectors, the search points and computation can be reduced compared with the full search BMA. In addition the information required to transmit the motion vectors can be reduced. For example, the number of possible motion vectors for BMA with a +7 is 225, which requires about 8 bits per a motion vector for fixed length encoding. Therefore, we are compressing the number of motion vectors from 225 to 25 or from 8 bits to 4 bits per vector. The computational cost is also improved because the number of search point is reduced. We design the codebook with the neural network vector quantizer utilizing the FSCL leaming algorithm using the above motion vectors as the training input data. Fig. 4 shows the initial codebook and the output codebook that has 25 codewords.
IV. EXPERIMENTAL]~ESULTSAND DISCUSSIONS As an experiment, the SIF version of Flower Garden sequence was used for the test sequences. The size of a SIF sequence is half of that of its CCIR 601 version in both dimensions. The block size for BMA was 8 x 8. Since MPEG recommended search region of +15 pels in both horizontal and vertical directions, we choose a search region of +_7 pel in both spatial directions because the size of a SIF sequence is half of that of its CCIR 601 version in both dimensions. We choose a codebook size of 64 motion vectors. As shown in Table 1, the number of possible motion vectors for three step search(TSS) which is known to show better performance is 27, and the number of possible motion vectors for BMA is 225, which requires about 8 bits per motion vector for fixed length encoding.
442
As mentioned in the previous section, the proposed method reduces not only the number of search points but also the computation. For BMA with +2 search region, the number of matches required is 25. Since the codebook size of the proposed method is 25, the number of computation required is almost equal to that of BMA with+2 search region. In this section, we compare the performance of the proposed method with that of BMA with +2 search region. In this experiment, we used peak signal to noise ratio(PSNR) as the objective quality measure defined as follows. 2552 PSNR = 10 lOgl0 MSE
MSE =
(6)
1 ~_~,l N-1 2 /..., Z [ F ( i , j ) - F'(i,j)]
(7)
M N m=O n=O
Table 1 shows the average PSNR of 30 frame image sequences for the proposed method, full search BMA and TSS. The proposed method does not performs better in PSNR than BMA with +7 search region, but about 1.5dB better than BMA with +2 search region, which requires a similar fixed length rate for motion vectors. Fig. 5 shows the PSNR graph for 30 frame image sequences. Table 1. Performance comparision
Fig.
5. PSNR
Vo CONCLUSION The motion estimation method plays an important role in the moving image transmission systems. System performance depends on how accurately the motion vectors are estimated. Though number of motion estimation methods have been suggested, to detect the motion vectors more accurately the full search method which matches all point in the search area must be used, but it requires much computation and hardware complexity. In this paper, we found motion vectors using the full search BMA from the initial image sequences, and trained FSCL neural networks to design the codebook using the motion vectors. We used this codebook as the motion estimation region. This method uses the spatial correlation of motion vectors in image sequences, therefore reduces search area, bits required to transmit motion vectors and increased the compression rate. The computer simulations show that the proposed method is superior to the TSS about 1.5dB. And the proposed method is robust to the noise because it has the motion vector smoothing effect during the vector quantization process.
REFERENCE [1] A. K. Jain, "Image data compression: A review," Proc. IEEE, vol. 69, no. 3, pp.384-389, Mar.1981. [2] A. N. Netravali and J. O. Limb, "Picture coding: A review," Proc. IEEE, vol.68, no.3, pp. 366-406, Mar.1980. [3] A. N. Netravali & J. D. Robbins, "Motion compensated television coding: Part I," Bell Syst. Tech. J., vol. 58, no. 3, pp. 631-670, Mar. 1979. [4] K. Iinuma, T. Koga, K. Niwa, and Y. Iijima, "A motion-compensated interframe codec," in Proc. Image Coding, SPIE, vol.594, pp.194-201, 1985. [5] Y. Y. Lee and J. W. Woods, "Motion vector quantization for video coding," IEEE Trans. Image Processing, vol 4, no. 3, Mar., 1995. [6] A. Gersho and R. M. Gray, Vector Quantization and Signal Compression, Kluwer, 1991. [7] S.C. Ahalt, A.K. Krishnamurthy, P.Chen and D.E. Melton, "Competitive learningalgorithms forvector quantization, "Neural Networks, vol. 3, no. 3, pp. 277-290,1990.
Proceedings IWISP '96, 94- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
443
Motion Estimation by Direct Minimisation of the Energy Function of the Hopfield Neural Network Leszek EFP--The
Cieplifiski t and Czestaw
J0drzejek
S c h o o l o f N e w Information and C o m m u n i c a t i o n Technologies, 6 0 - 8 5 4 P o z n a f i , P . O . B o x 3 1 , P o l a n d
Franco-Polish
Abstract We present a modification of generalised motion estimation based on the Hopfield neural network. It is based on the approach of Skrzypkowiak and Jain in which they extend block based motion estimation by allowing for more than one motion vector and use the Hopfield neural network to minimise the error. We use direct minimisation of the error function instead of the standard iterative algorithm. This leads to much better motion estimation without increase (and, usually with significant decrease) of computational effort for sequential versions of both algorithms. 1
Introduction
Motion estimation and compensation is a very important part of any video compression method (see e.g. [1]). It allows for a substantial reduction of the bitrate required for encoding of a video sequence by exploring temporal redundancy existing between consecutive frames. The most popular and effective approach to motion estimation is based on the so called block matching algorithms (BMA) and it is used in the ISO and ITU-T video coding standards [2, 3]. In this paper we present an approach to generalised motion estimation where the predicted block is a superposition of several blocks from the previous frame. It is based on the work of Skrzypkowiak & Jain [4, 5, 6] in which they use a modified Hopfield neural network to minimise the motion compensation error. We start from the formula for the motion compensation error (multiplied by 1 for convenience)
E - -~llF - vGII = -~
f~y - ~ x--1 y--1
vig~y
(1)
i--1
where F is the present block, G is a vector of blocks from the previous frame, and v is a vector of coefficients. L is the block size in pixels and D is the number of neurons in the net (equal to the number of blocks from the previous frame considered as candidates for the motion estimation). By comparing this to the energy function of the Hopfield neural network Egg 1 D D D
-
E E
- Z
i--1 j = l
i--1
(2)
(which is a simplified but equivalent version of the network from [4], as explained in [7]) we obtain the following formulas for weights L
L
=
gxygxu,
(3)
x--1 y--1
and biases
L
Ii
L
-- E Z
x--1 y--1
fxYgixY
(4)
i is of the Hopfield network. In the formulas above, fxy is an (x, y) pixel intensity from the current block and gxy an (x, y) pixel intensity from the i-th block of the previous frame. As explained in [4, 7] the connection weights Tij are non zero for i -- j so that the energy function given by Eq. 2 is not guaranteed to converge to a local minimum. This problem is solved by checking energy change after each attempted step and accepting only those leading to the energy decrease. Such a network was used by Skrzypkowiak & Jain first as an alternative to block matching algorithm [4] and then to approximate a block in the present frame with linear combination of blocks from the previous frame [5, 6]. It has been shown that this approach can give quality improvement from 2 to 4 dB [5] and can also be used for rate-constrained motion estimation [7]. In the following, we first discuss some of the drawbacks of this approach and then apply the straightforward minimisation of the motion compensation error in order to compare it with the Hopfield neural network approach. ?currently with the VSSP Group, Department of Electronic and Electrical Engineering, University of Surrey, Guildford, Surrey, GU2 5XH, United Kingdom.
444
2
Method
and
Results
Two main problems with the Hopfield neural network algorithms for motion estimation are the dependence of results of the network evolution on model parameters (more precisely the step s i z e ) a n d relatively poor performance for large search ranges. The first drawback was demonstrated in [7], and the second is illustrated in Table 1. range 0 1 2 3 4 6 8 10
P S N R - ES 34.198 36.005 36.223 36.327 36.336 36.352 36.363 36.369
P S N R - NN 34.171 37.688 38.254 38.739 38.893 38.999 39.091 39.191
Table 1: Motion estimation results for exhaustive search (ES) and neural network (NN) with different search ranges for "missa" All the simulation results are for two frames selected from the "missa" and "salesman" sequences at CIF resolution I . The fifth frame (previous frame) is always used for prediction of the sixth frame (present frame). The neural network parameters are the same as given in [7], where they were found sufficient to obtain convergence. It is seen that increasing the search range above some threshold (about 3-4) results in a very small increase in motion estimation quality. This applies to both exhaustive search and Hopfield neural network. In the following we describe a method of direct minimisation of the energy function of neural network and compare the results obtained to those presented above. To find the global minimum of the energy function of the Hopfield network given in Eq. 2 we use the fact that it is a quadratic function of vector variable v and that Tij is symmetric in indices i, j. We look for the minimum by first differentiating the formula with respect to Vk
Ovk -- ---2 E E Tij i=1 j=l
-OVk -vj
+
Ovk ~] -
i=1
Ii OVk '
(5)
which gives OE cOv~
D
E
Tikvi
Ik,
(6/
i=1
and then comparing the derivative to zero D
- ~
T~v~ - I~ = 0.
(7)
i=1
As a result we obtain the following formula for v min v rain = - T - 1 I ,
(8)
where T -1 is the matrix inverse of T. This means that instead of performing a simulation of the neural network we can find the minimum of its energy function by multiplying the inverse of the weight matrix by the bias term. In practice, there exist effective methods for solving the linear equation 7 without calculating the matrix inversion. In Figs. 1 and 2 we present the results obtained with LU decomposition [8] and compare them to the Hopfield neural network and block matching with exhaustive search. It is seen that the new method performs much better in terms of estimation quality than the neural network, especially for large search ranges. Another advantage of the direct minimisation of the energy function is that there are no arbitrary parameters in this approach to be tuned. This is very different from the "classical" network evolution where the final result heavily depends on the parameters and especially on the step size, as demonstrated in [7]. To obtain reliable results a small step size is necessary which leads to very long execution times as compared to the direct solution 1The sequences were obtained from the ftp site ftp. ipl.rpi.edu.
445
Figure 1: Comparison of exhaustive search, neural network, and matrix inversion results for "missa"
Figure 2: Comparison of exhaustive search, neural network, and matrix inversion results for "salesman" of Eq. 7. More precisely, solution of Eq. 7 requires O(D 3) multiply/add operations [8] for each block, where D is the size of the search region, i.e. has the same meaning as in Eq. 1. Complexity of the iterative neural network evolution may be estimated by O(ID(D + a)), where I is the number of iterations needed for convergence and a is the number of operations required by the energy change calculation. Typically the number of iterations needed to reach a minimum of the energy function is of the order of hundreds and for many blocks even thousands. This is much more than D for reasonable search ranges, which means that the network evolution is more computationally expensive than matrix inversion. These considerations are illustrated by sample execution times presented in Table 2 on the next page. The first column (total) shows total time used by the program including initialisation and the second one (solve) is a part spent on solving equations 2. It should be noted that both algorithms are much more computationally expensive than the exhaustive search, for a single motion vector which requires O(D) operations. 2It is seen that the conjecture made in [7] that most of the time is spent on minimisation of the energy function is no longer true for the network parameters used in the presented simulations. It was based on simulations with the first parameter set from [7].
446
range
time (s) - NN total solve 24.95 17.02 150.48 118.09 371.12 310.56 1007.73 880.79 2562.10 2126.14
time (s) - MI total solve 2.54 0.01 13.69 0.41 31.80 1.77 89.13 8.90 414.71 78.19
Table 2: Execution times for the Hopfield neural network (NN) and matrix inversion (MI) for "missa" The main disadvantage of the presented approach is that it is more difficult to control the number of significant motion vectors than in case of the standard neural network. It is not possible to extend the algorithm from [7] because the solution obtained by matrix inversion can contain negative coefficients. It also seems that solution is much more sensitive to any modifications of the energy function. We are currently investigating other ways of dealing with this problem. 3
Conclusions
The presented method performs much better than both the block matching algorithm with exhaustive search and the Hopfield neural network. With similar computational complexity it gives better quality and performs much better than the neural network with increased number of candidate blocks for motion estimation. Further work is currently in progress towards the rate-constrained version of this method. This work has been partially supported by European Community ACTS AC077 grant "Scalar" and by the Polish KBN grant 8 T l l E 035 10. References [1] Tziritas G. and Labit C., Motion Analysis ]or Image Sequence Coding, Elsevier, Amsterdam, 1994. [2] Bhaskaran V. and Konstantinides K., Image and Video Compression Standards, Kluwer Academic Publishers, Boston, 1996. [3] Video Coding/or Narrow Telecommunication Channels at < 6~ kbit/s, Draft ITU-T Recommendation H.263, July 1995. [4] Skrzypkowiak S.S. and Jain V.K., Neural Networks Based Motion Vector Computation and Application to MPEG Coding, in: Proc. Int. Con/. on Image Proc. 2, 1994, 985-990. [5] Skrzypkowiak S.S. and Jain V.K., Affinity- Weighted Neural Network Motion Estimation for MPEG Coding, Digital Signal Processing 5, 1995, 149-159. [6] Skrzypkowiak S.S. and Jain V.K., Affine Motion Estimation Using a Neural Network, in: Proc. Int. Con]. on Image Proc. 1, 1995, 418-421. [7] Cieplifiski L. and Jedrzejek C., Block-Based Rate-Constrained Motion Estimation Using Hopfield Neural Network, Neural Network World 6, 1996, 285-297. [8] Press W.H., Teukolsky S.A., Vetterling W.T., and Flannery B.P., Numerical Recipes, Cambridge University Press, Cambridge, 1994.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
447
A modified MAP-MRF motion-based segmentation algorithm for image sequence coding. Daniel Gatica-P6rez, Francisco Garcia-Ugalde and Victor Garcia-Gardufio. Universidad Nacional Aut6noma de M6xico. Facultad de Ingenieria. Divisi6n de Posgrado. Departamento de Ingenier~a E16ctrica. Apdo. Postal 70-256. M6xico, D.F., 04510. M6xico. e-mail: [email protected], [email protected], [email protected].
Abstract We present a motion-based segmentation algorithm, using a markovian-bayesian framework. The goal is to assign each pixel in the image to one out of several regions characterized by a motion model parameter vector. The algorithm uses a dense displacement vector field estimated by a pel-recursive method as the main information to guide the pixel fusion process, but includes other data sources to enhance the segmentation field estimation. Markov Random Fields (MRF)theory provides a framework in which all this information, along with the physical properties of the desired solution, can be incorporated. Optimization of the solution is performed with Iterated Conditional Modes (ICM) procedures. The algorithm is planned to be the first stage of a region-based sequence coding system. Results on a real digital TV sequence are shown, 1
Introduction.
Motion estimation and segmentation have been regarded as two key domains in image sequence processing and computer vision, and have found applications in a varietyoffields, including scene interpretation [4] and digital video coding [5]. In particular, motion-based segmentation (partition of an image describing a viewed scene in regions that undergo different motion, where each pixel is assigned to the region that represents its motion more adequately) represents a basic task for extracting high level information from the time-varying intensity of a sequence and for enhancing the motion measurement process, and has been studied by several authors [3], [4], [5], [8]. In this paper, we present a probabilistic approach to segment a sequence from a previously estimated motion vector field, based on Markov Random Fields (MRF) theory and bayesian estimation using the maximum a posteriori criterion (MAP). This algorithm is planned to be the first stage of a region-based sequence coding algorithm, Motion-based segmentation is known to be a problem in which simple spatial clustering techniques do not
work well (for example, in presence of pure divergent motion). In this case, a model OM of both the motion and the structure of the regions in the scene has to be introduced. Thus, when an estimated displacement or velocity vector field is available, the goal of the segmentation process is to assign each pixel in the image to one out of several regions (characterized by a motion parameter vector), depending on the accuracy between each estimated motion vector and the assumed model. This represents a qualitative change from a local motion description to a regional one. The obtained regions can then be associated to different regions of the same object, or to different objects in the scene. In addition, motion estimation and segmentation are two interdependent problems, in the sense that we need one in order to obtain the other with accuracy (estimation-segmentation ambiguity). Due to the illposedness of motion estimation, an exact displacement field does not exist, and errors usually occur at or around motion boundaries. Better results in amotion-basedsegmentation can be attained if an indicator of the quality of the vector field is known.
2
The
model.
We use a MAP-MRF framework for the solution of the motion segmentation problem. This approach provides a way of introducing different sources of information and of characterizing their mutualinteractions into one single model, thus allowing for the incorporation of the physical properties of the desired solution. It also enables us to develop formal solutions through the formulation of specific energy functions in terms of an optimality principle (MAP criterion). MRF image modeling, combined with bayesian estimation, has shown to be useful in a broad number of problems in computer vision, including image restoration [6], edge detection, monochrome and color image segmentation [7], and motion estimation and segmentation [3], [8]. Our proposed method is an extension of the scene segmentation algorithm using global optimization pro-
448 posed by Murray and Buxton [8]. It uses a dense displacement vector field estimated by a Wiener-based pelrecursive method [2] as the main information to guide the motion vector fusion process, but includes other data sources (image intensity, non-compensated pixels, intensity edges) as additional observations that enhance the
9 g is the binary intensity edges field, and favors the ocurrence of motion boundaries only when strong spatial gradients are present. Fields e, dx, dy, p and i are defined over a lattice S of pixel sites s, and fields 1 and g are defined over a
lattice St of line (interpixel) sites st (figure 1). final segmentation. From this coupled MRF formulation, some previWe formulate the motion-based segmentation as ously reported algorithms can be derived: an estimation problem: simultaneously find the label 9 Murray and Buxton include in their algorithm e, 1 fields (6, j) that maximize the a posteriori probability and a velocity field v [8]. density function (pdf) of the labels given the observed data
9
9 Chang et al. only consider e, d, and i [3].
(8, i ) - argmaxe, 1 p ( e , l / d x , d y , i , p , g )
(1)
where 9 e is the desired segmentation label field, which has associated a four-parameter simplified linear motion model O M L S - - (t,,ty, k, 0), that can describe combined translational, rotational, and divergent motions of planar surfaces parallel to the image plane
oOo~o~oOo~oBo o~oEoOoOoDoao o p~o~to
~~0~0~
o~oooOoDoDoDo
~hor~ontilinesite
[q
[
d~
z-
and where (xg, yg) is the gravity center of each surface. 9 1 is a binary motion discontinuity line field, which represents an auxilar field introduced to improve the segmentation process (motion boundaries). 9 d , , dy are the observed horizontal and vertical components of the displacement field. They are estimated using a Wiener-based pel-recursive motion estimation algorithm, which has been used in the past for predictive motion-compensated image stquence coding,
~OlOlOlO~
0
@)
. . . . . ~OlOlOB
=DD= ~
(0
Figure 1: (a) Definition of pixel sites s and line sites st. MRF neighborhood system including pixel and line sites: (b) vertical; (c) horizontal. Using the Bayes rule, the last equation can be expressed as ( ~ , ] ) - argmaxe, 1 p(d~,dy,i/e,l,p,g)p(e,1/p,g)
(3) because the denominator in Bayes' expression does not depend on the desired labels. The term 9 p is a binary non-compensated pixel field, that is a p(dx,dy,i/e,l,p,g) is the conditional pdf of the obuseful by-product of any pel-recursive motion esti- served data given the segmentation, and p(e,1/p, g) is mation algorithm. A non-compensated pixel is one the a priori pdf of the segmentation. Now, if we inin which the convergence criterion (minimization of troduce some physical assumptions about the relation the reconstruction error (DFD)) is not satisfied. In between labels and observations, the MAP estimate can the case of pel-recursive methods, non-compensated be restated as pixels usually appear in noisy regions and around motion boundaries [1], so they can be considered as a simple partial confidence measure of the mo . tion estimation quality. Field p represents a way of switching between displacement and more reliable information (gray level) when the motion field is not accurately estimated, 9 i is the image intensity field (gray-levelinformation).
(6 i ) - argmaze, 1 p(d,,d,/e,p)p(i/e p)p(e/1)p(1/g) . . . (4) If we model the two first probability terms in the right side of equation (4) (corresponding to the observations) assuming conditionally independent gaussian random fields, and characterize the other two terms (corresponding to the labels) by Gibbs distributions [6], it can
449
6
References.
[1] N. Baaziz y C. Labit. Multiconstraint Wienerbased motion compensation using wavelet pyramids. IEEE Transactions on Image Processing. Vol. 3, No. 5, pp 688-692. September 1994. [2] J. Biemond, L. Looijenga, D.E. Boekee and R. Plompen. A pel-recursive Wiener-based displacement estimation algorithm. Signal Processing. Vol. 13, No. 4, pp 399-412. December 1987. [3] M. M. Chang, A. M. Tekalp y M. I. Sezan. Motion-field segmentation using an adaptive MAP criterion. Proc. of the ICASSP, 1993, Vol. 5, pp 33-36. [4]. E. Francois. Interpretation qualitative du mouvement a partir d'une sequence d'images. Ph. D. Thesis, Universite de Rennes I, France, June 1991. [5]. V. Garcia Gardufio, Une approche de compression orientee-objets par suivi de segmentation basee mouvement pour le codage de sequences d'images numeriques. Ph. D. Thesis, Universite de Rennes I, France, May 1995. [6]. S. Geman y D. Geman. Stochastic relaxation, Gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell., Vol. PAMI-6, No. 6, pp. 721-741. November 1984. [7]. S. Z. Li. Markov Random Field Modeling in Computer Vision. Tokio, Springer-Verlag, 1995. [8]. D.W. Murray, y B.F. Buxton. Scene segmentation from visual motion using global optimization. IEEE Trans. Pattern Anal. Machine Intell., Vol. PAMI9, No. 2, pp. 220-228. March 1987.
Figure 2: Wiener-based pel-recursive estimated displacement field d.
Figure 3" MAP motion-based segmentation algorithm. (a) Segmentation field e; final number of regions" 28. (b) Superimposition of frame 7 of Interview and segmentation field e. (c) Synthetic displacement field obtained from the segmentation mask.
450 be proved that maximizing the a posteriori pdf is equal to minimizing a so-called energy function U(e, 1, d~, dy, i, p , g ) which has the form
U(e,l,d~,dy,i,p,g)
=
c~Ud(d~,dy,e,p )
+
flUi(i, e, p) + 7Ue(e, l)
+
aVl(l, g)
(5)
where c~, ~, 7 Y ~r are weighting terms. In [3], the DFD information is used for determining the weighting factors of the energy terms in the markovian model. As we have shown, we include it in one of the input data fields in our model (p), so the weighting terms are determined in a different way. Finally, we want to estimate
(~,i) = argmine, 1 U(e,l, d~, dy,i, p , g )
(6)
The definition of each energy term reflects the knowledge about the problem and the characteristics desired for the segmentation: 9 Ud(d~ , dy,e, p) is a measure of the errors between the motion model OMLS, the displacement vector field and the segmentation. When motion has not been correctly estimated (according to field p), this energy term is not considered. In this way, we reduct the introduction of noisy data in the segmentation, 9 Ui(i , e , p ) promotes a local spatial coherence property in the segmentation. It is used just when motion estimates are not reliable. 9 Ue(e, 1) is responsible for modeling the interactions between segmentation and line fields (segmentation continuity except in motion boundaries). 9 Ul(1 , g) penalizes the introduction of discontinuity lines, and shapes the motion boundaries (geometrical properties of the segmentation),
3
The
segmentation algorithm,
The global optimization of the solution uses an iterative deterministic relaxation procedure: a modified Iterated Conditional Modes (ICM) method based on an instability table [4] is employed to overcome the great computational cost required by simulated annealing. ICM methods minimize the local energy AUs in each pixel
s = (x, y) of the image. Our minimization scheme considers two phases in each iteration [5]: one for the optimization of the segmentation field, and the other for the optimization of the line field. The complete motion-based segmentation algorithm includes four stages (a) initialization, (b) numbering and labeling of each region in the image, (c) motion model parameter estimation in each region, and (d) optimization of the label fields. These steps are repeated until the method reaches the maximum number of iterations allowed, or until the segmentation becomes stable (convergence). An advantage of our algorithm is that the number of regions in the image is not fixed through the segmentation process.
4
Results.
Results obtained on a real digital television sequence (Interview 1) are presented in figures 2 and 3. This sequence includes global motion (camera pan) and an object with different motion (woman to the right). The displacement field estimated with the Wiener-based higorithm is shown in figure 2. The segmentation obtained using this field and the proposed algorithm is shown in figure 3a and 3b. It can be seen that the woman in motion is correctly segmented, even if there are some small inner regions that do not dissappear. Some other tiny remaining regions can be fused by using further processing. The estimated motion discontinuities have been well fitted to the real ones. In figure 3c, we show the displacement field synthetized from the segmentation mask and the assumed motion model. This field appears to be ideal for coding purposes. The algorithm has been employed with other sequences (synthetic and real), and good results have been obtained, It converges after 5 iterations approximately in all cases.
5
Conclusion.
We have derived a method to segment image sequences from a coupled MRF formulation that includes some previously reported algorithms. Our proposed approach produced satisfactory results. Nevertheless we think that, due to the motion estimation-segmentation interdependence, the results can be improved if an iterative scheme (pel-recursive motion estimation and MRF motion-based segmentation) is applied. This strategy could be combined with a different minimization method. 1Courtesy of CCETT-France
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
451
U N S U P E R V I S E D M O T I O N S E G M E N T A T I O N OF I M A G E S E Q U E N C E S U S I N G A D A P T I V E FILTERING
Olaf Pichler 0), Andreas Teuner (2), and Bedrich Jr. Hosticka (2) o) Chair of Microelectronic Systems, Dept. of El. Eng., University of Duisburg, 47057 Duisburg, Germany (2) Fraunhofer Institute of Microelectronic Circuits and Systems, 47057 Duisburg, Germany
Abstract In this paper we present a new method for feature extraction in unsupervised motion segmentation of image sequences. The method is based on the multichannel filtering of the input image sequence using adapted Gabor filters. We use the histogram of a quantised two-dimensional parameter space resulting from a three dimensional orientation analysis to select the main Gabor filter parameters, namely azimuthal and elevational angle. The three-dimensional orientation analysis consists essentially of the eigenvalue and eigenvector computations of inertia tensors in the threedimensional frequency space. The feature images are obtained from the complex valued filtered sequences using a simple magnitude computation and are subsequently evaluated employing a multichannel segmentation algorithm. The performance of the algorithm is demonstrated using two segmentation examples that are based on artificial as well as real image sequences.
Motion representation in the frequency domain For an image s(x,y) of an object moving at constant velocity ~ = (vx,vy) it can be written that
s(x,y,t) = s(x-vxt,y-Vyt).
(1)
S(O)x,O)y,O3t) = S((.Ox,O3y) 9~(O)xV x + O3yVy +cot)
(2)
Fourier transformation of Eq. (1) yields
with S(ox,my) being the Fourier transform of s(x,y). Eq. (2) is implying that the spectrum of S(0)x,0)y,0)t) is equal to zero except in the plane defined by the argument of the Dirac function. The plane equation is then defined by ~ . f i = 0 , with ~ r
_(03x,O)y,(Ot )
and~r = (v~,vy,1).
(3)
It can be seen that the three-dimensional spectrum S(o)x,O3y,(Ot) is obtained by parallel projection of the twodimensional spectrum S(cox, O)y ) into the Dirac plane defined by the velocity vector ~. Hence, velocity is not a coordinate in the three-dimensional frequency domain but an orientation.
3-D orientation analysis based on inertia tensors Considering the three-dimensional spectrum S(~) as mass density distribution it becomes obvious that the normal vector fi of the plane can be determined by computing the rotation axis intersecting the origin of coordinates for which the moment of inertia defined as oo
J = I d2 (~' ~ n )'IS( ~)12 dcs
(4)
-oo
is at its maximum (~ n: vector in direction to the rotation axis; d ( ~ , ~ , ): distance between ~ and the rotation axis). For spectra that are not an exact plane, r5maXn with J(g nmaX)= max is the best approximation for the non ideal orientation. The moment of inertia of a body with respect to an arbitrary rotation axis ~ can be written using the inertia tensor as 11,21 J = ~Trl i(~ I i ,
(5)
which has the following elements: ao
J ii - E I co21S(~j)[2 drs i*j - o o
(6)
452 oo
Jij = -
fcoic0jls(~)l = d~.
(7)
-oo
Based on Parseval's theorem it can be shown that for small windowed areas of the input sequence these elements can be easily computed from the partial derivatives in the space-time domain followed by lowpass filtering [11. The eigenvalues of J are the moments of inertia with respect to the principal axes and, hence, the eigenvector with the greatest eigenvalue is r m~,. A measure of confidence for the orientation is given by conf
=
Jmax
"
Jmid + Jmin
0.5 < conf < 1,
(8)
where Jmax, Jmid, and Jmin are the maximum, middle, and minimum eigenvalues of J , respectively. Thus the inertia tensor model yields the orientation of a local surrounding of the input sequence as well as a measure of confidence for it. 3-D Gabor filters
The transfer function H (fx, fy, ft ) of a three-dimensional Gabor filter is given by
expl-rt(w ~(~-F)2+ (wyf~)2+ (wt ft,)2 )]
n ( f x , fy, ft ) = 4 ~ / W x W y W t 9
(9)
where w x, Wy, and w t are the widths in x, y, and t direction, respectively, and F is the center frequency. The apostrophe denotes the coordinate rotation by q~a in azimuthal and by q% in elevational direction. Following the twodimensional case [3], the radial bandwidth B is defined as B = log9 wxF_o~
.~
=
ln2
,
(lo)
and the azimuthal and elevational bandwidths f2~ and ~ are --~=2arctan(
~ )
[,wyF)
and " e = 2 a r c t a n (
~ /.
~w,F)
(11)
The radial bandwidth is measured in octaves, whereas the azimuthal and elevational bandwidths are measured in degrees. Filter parameter selection
In order to extract motion features for image sequence analysis using Gabor filters the filters must match the planes in Fourier space that are created by objects moving at different velocities in a single scene. Hence, it is clear that these filters have to be 'flat' in elevational direction in order to reduce crosstalk between filter channels adapted to different velocities. Since the filter characteristics may be degraded due to quantisation effects if they are chosen too 'flat' we set the elevational bandwidth to ~e = 1.5 o for eight frame long sequences. The choice of the azimuthal bandwidth is much less critical and simulations show that f2a = 45 o is a good choice in many cases with respect to crosstalk and discriminative power of the resulting features. Since all velocity planes are planes intersecting the origin of coordinates 0Eq. (3)) their intersections are straight lines likewise through the origin. Thus, this suggests the use of filters with linearly increasing width for increasing center frequency, or in other words, with constant bandwidths, as defined by Eqs. (10) and (11). To reduce crosstalk resulting from overlapping transfer functions near the origin of coordinates we have chosen B = 0.75 octaves and F = 0.75 FNyquist (FNyquis t denotes the minimum of the two Nyquist frequencies in direction to the spatial coordinates). In order to extract motion features from an image sequence in an unsupervised manner we need the main Gabor filter parameters which are the angles of azimuthal and elevational rotation q~, and q~e. To obtain these parameters first we carry out a 3-D orientation analysis, as described above, and then compute the azimuthal and elevational angles for each eigenvector corresponding to a maximum eigenvalue. Afterwards this two-dimensional parameter space is quantised at predefined resolutions in both directions and its histogram is evaluated and sorted.
453
The filter parameters q~a and q~, are to be chosen from the sorted histogram in descending order with the following restrictions: 1.) For the evaluation of the histogram only those eigenvectors are taken into considerations whose measure of confidence is greater than 0.95. 2.) The histogram classes including the elevational angles of 0 ~ and 900 remain unconsidered, because 0 ~ is physically impossible and must, therefore, result from a virtual velocity, and 900 is equivalent to a zero velocity and, hence, the direction has no physical meaning. 3.) All selected filter parameters q% and 9e must have a user defined distance to each other in order to keep the redundancy of the extracted features small. Feature extraction itself is performed simply by filtering the sequence to be analysed using the adapted Gabor filters and followed by computation of the magnitude of the complex output. The resulting feature image stack is finally used as input data for the multichannel segmentation algorithm proposed in [4].
Experimental results The performance of the proposed algorithm has been verified using two segmentation examples of eight frame long image sequences. The first sequence is an artificial one and contains three objects with the same surface structure moving at velocities of identical magnitude (1 pixel per frame) but in different directions, namely to the left, to the fight, and downwards. The horizontally moving objects overlap one another (Figure 1 (a)). From Figures 1 (b) to (d) it can be seen that the first three Gabor filters adapted by using the proposed algorithm show excellent object matching, i. e. each filter is individually matched to one of the moving objects. Hence, the resulting segmentation as depicted for the fourth frame in Figure 1 (e) is nearly perfect. The second example is an eight frame long crossroads scene shot out of a standing car featuring two cars and a bike crossing the road (Figure 2 (a)). Although the camera was slightly vibrating during shooting and the brightness of the frames was not exactly constant the segmentation into three object classes yields a good figure-ground separation as well as the differentiation between the slower bicycle and the faster cars (Figure 2 (b)). In addition the flashing leftturn indicator of the waiting car in front of the camera car has been correctly suppressed. Figures 2 (c) and (d) show the masking of the fourth frame by the pixels belonging to the grey and black labeled object classes, respectively. The shapes of the segmented regions appear extended in the direction of motion which is a direct consequence of the uncertainty principle between spatial-time and three-dimensional Fourier domain. Figure 2 (e) shows the confidence measure stemming from the 3-D orientation analysis, wherein conf = 0.5 belongs to black and conf = 1.0 belongs to white. Binarisation of this image with the threshold conf = 0.95 yields Figure2 (f). Only the eigenvectors belonging to white pixels in this image are considered to be valid for further analysis. It is important to realize that a velocity analysis using 3-D orientation analysis alone is impossible because of the high percentage of non valid eigenvectors. The total number of valid eigenvectors is merely sufficient to select the Gabor filter parameters. This finding is supported by studying the images of the azimuthal and elevational angle derived from the 3-D orientation analysis in Figures 2 (g) and 2 (h), which are noisy to such an extent that they are useless without the aid of the confidence measure. The examples clearly demonstrate that the proposed algorithm is able to analyse unsupervised image sequences containing objects moving at different velocities with respect to magnitude and/or direction.
Figure l: (a) Fourth frame out of an eight frame long image sequence containing three objects moving at velocities of identical magnitude (1 pixel per frame) but different directions (to the left, to the right, and downwards). (b) to (d) Feature images of the fourth frame for the first three adapted Gabor filters where each filter matches one of the moving objects, as found by the proposed algorithm. (e) Segmentation result for the fourth frame with four object classes.
454
Figure 2: (a) Fourth frame out of an eight frame long crossroads scene shot out of a standing car with two cars and a bike crossing. Ca) Segmentation result for the fourth frame containing three object classes by using the proposed agorithm. (c) Masking of the fourth frame by pixels belonging to the grey labeled object class of (b). (d) Masking of the fourth frame by the pixels belonging to the black labeled object class of (b). (e) Image of the confidence measure stemming from the 3-D orientation anlalysis (conf = 0.5 = black, conf' = 1.0 = white). (f) Binarisation of (e) with threshold conf = 0.95. (g) Image of the azimuthal angle derived by 3-D orientation analysis. (h) Image of the elevational angle derived by 3-D orientation analysis (white - low velocity, black - high velocity). References
[1] [2] [3] [4]
[5] [6] [7] [8]
B. J/time: Digital Image Processing, Springer, 1994. H. Goldstein: KlassischeMechanik, 8th Edition, Aula 1985. A.C. Bovik, M. A. Clark, and W. Geisler: "Multichannel Texture Analysis Using Localized Spatial Filters", IEEE Transactions on PAMI, Vol. 12, No. 1, pp. 55-73, January 1990. O. Pichler, A. Teuner, and B. J. Hosticka: "A Multichannel Algorithm for Image Segmentation with Iterative Feedback", Fit~ International Conference on Image Processing and its Applications, 4 - 6 July, 1995, Edinburgh (Scotland). A. Teuner, O. Pichler, and B. J. Hosticka: "Unsupervised Texture Segmentation of Images Using Tuned Matched Gabor Filters," IEEE Transactions on Image Processing, Vol. 4, No. 6, pp. 863-870, June 1995. A. Teuner, O. Pichler, and B. J. Hosticka: "Uniiberwachte Selektion und Abstimmung von dyadischen Gaborfiltern zur Textursegmentierung," DAGM-Symp. Mustererkennung "94, 21-23 September, 1994, Vienna (Austria). J.K. Aggarwal and N. Nandhakumar, "On the Computation of Motion from Sequences of Images - A Review", Proceedings ofthe IEEE, Vol. 76, No. 8, pp. 917- 935, August 1988. D.J. Heeger: "Model for the extraction of image flow", J. Opt. Soc. Am. A, Vol. 4, No. 8, pp. 1455-1471, August 1987.
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
455
DEVELOPMENT OF A MOTION COMPENSATED CODING SYSTEM FOR AN ENHANCED WIDE SCREEN TV Takahiro HAMADA and Shuichi MATSUMOTO KDD R&D Laboratories, Visual Communications Group, Japan 2-1-15 Ohara Kamifukuoka, Saitama 356, Japan TEL(FAX): +81-492-78-7428(7439) E-mail:[email protected] Abstract EDTV II is a Japanese standard system for enhanced wide-screen NTSC broadcasting. We developed a motion compensated EDTV II coding system for the purpose of transmitting an EDTV II signal of contribution quality to television studios. The algorithm for this system has been recommended as the Japanese standard EDTV II coding scheme. 1. Introduction There is no doubt that television is one of the leadng information media throughout the world. Television broatt:ast standards unique to different areas were developed and have been used for more than 30 years during the spread of television as shown in Table 1. As television continually widened its role not only in journalism but in entertainment fields such as sports, viewers started to demand higher picture quality on a wider monitor than was available on currently existing TV receivers. Taking viewer demands into account, intensive and far-ranging efforts were made to provide standardized higher picture television at a world-wide level and several TV broadcast standards were established in Europe, U.S. and Japan. Two approaches were taken; one compatible and one not compatible with current TV standards. An HDTV is designed to be totally free from the backward compatibility of conventional TV standards so that much higher picture quality can be obtained on a widescreen monitor with approximately twicethe numberofpixels and scanning lines. The other approach is a widescreen TV mainly designedto widenthe aspect ratio from4:3 to 16:9 with backward compatibility. This means that programs of a Wide screen TV can be enjoyed by current existing receivers andmore enjoyed than the before by specific receivers with a wide screen, even though HDTV programs require viewers to change total current receivers. A currentkey issue in these HDTV and Wide screen TV is how to transmit these TV signals from events such as at the Olympic Games, to TV studios by in particular using digital compression technology (high efficiency coding). In these transmissions, very high quality must be achievedas the grade of contribution quality for the purpose of a post processing and editing at TV studios. Much research and development of high efficiency coding has been conducted and DCT (Discrete Cosine Transform) based schemes such as the ITU-T J. 81 and MPEG-2 have been established. These schemes however are designed on the premise that the input signal is a component TV signal in which a luminance and chrominance signal can be handled separately like an HDTV signal. A wide screen TV however uses the same composite signals as the current NTSC or PAL systems. A color encoding-decodingprocess is therefore required for applying these schemes to wide screen TV but this process fatally affects the picture quality. To resolve this problem, we developed the WHT (Walsh Hadamard Transform) based coding system for Japanese standard Wide screen TV (EDTVII). In this system, a composite TV signal can be directly coded without a color encoding-decoding process, and sophisticated coding control of the EDTV II signal is achieved. Table 1 Television standards 2. Current status of digital compression U.S. Japan Europe A digital compression scheme has been s ~ as the MPEG-2. A block diagram of this DCT-based NTSC NTSC PAL Current TV coring scheme is shown in Fig. 1. In this scheme, the (Analog AM) (Analog AM) (Broadcasting) SECAM current frame to be coded, is subtracted from neighboring (Analog AM) frames (predictive frames) by block base, andmotions for DVB ATV Hivision I-IDTV each block in the predictive frames are detected and (MPEG(MUSE+ (MPEG(Broadcasting) compensated (Motion Compensation). DCT is then 2+VSB) Analog FM) 2+QPSK, implemented on a difference signal subjected to COFDM) quantization and quantizer outputs then c t x ~ and sent. Wide screen TV PALplus EDTVII There are two applications for digital compression of (Analog AM) (Broadcasting) (Analog AM) TV signals. One application is transmission application i
456
from an event arena to a TV studio and the other application is for broadcasting from the TV studio to the end users (Fig. 2). "W signal C (Current frame)
S ports
Eve nts
IQ-11
Predictive frames
......
Transmission (Contribution quality) 13roac'cas'~g . . . . . . (Distribution quality)
000000 DCT : Discrete Cosine Transform Q : Quantizatim
FM : Frame memory MC : Mot ion Compensation
Home (End users)
VLC "VariableLengthCoding F i g . 2 Transmission to and Broadcasting from TV S t u d i o F i g . 1 Block diagram of DCT-based MPEG-2 In MPEG-2, the coc[ng parameters for each application are defined by a 4:2:2 profile and a main profile as shown in Table 2. Table 2 MPEG-2 parameters Application Transmission [ Broadcasting
MPEG-2 profile Chroma sampling
4:2':2 profile ] Simple/Main profile 4:2:2 4:2:0 High Under study 1920/1152/60 H/V/T + High- 1440 Under study 1440/1152/60 Main 720/608/30 720/576/30 High Under study 80Mbps Bit-rate ~ High- 1440 Under study 60Mbps Main 50Mbps 15Mbps + : H, V, T mean maximum horizontal pels, vertical lines and frame numbers ++ : Maximum bit-rate Both DVB and ATV have adopted full digital broadcast using the MPEG-2 main profile and will be in service from 1997. Hivision in Japan had already started broadcasting service in 1991 using the MUSE+analog FM before MPEG-2 was standardized and several proprietary systems such as HDC 4511] were used for transmission applications. An MPEG-2 high profile will be employed for all HDTV standards in transmission applications. In both PALplus and EDTVII, broadcasting is conducted by an analog AM ground wave, the same as with the current PAL and NTSC respectively. I n digital compression however, a specific coding scheme must be designed since MPEG-2 can be hardly used for these signals. We therefore developed the motion compensated EDTVII coding scheme for a transmission application as described in section 4.
3. EDTV II signal [2] Two types of a Wide screen TV;
PAL plus and EDTV II are compared in Table 3. PALplus
Aspect ratio Compatiblility to a current TV Active Active image area lines Black area Resolution Horizontal Vertical (Active imagearea)
16:9 Letter box 432 144
These two signals have many EDTV II 16:9 Letter box 360 120
0'-~5.5MHz
0~--4.2MHz
0 4 4 3 0 lines/hight
0"-360 lines/hight
430"--576 lines/hight No
360"~480 lines/hight
Vertical-Temporal helper (VT) Horizontal helper (HH)
No
Scanning
625 interlace
4.2~-6.0MHz 525 progressive 525 interlace
Vertical high resolution helper (VH)
180 -'~360 lines/hight
457
Fig.3
EDTV II signal format
common formats but PALplus has only a vertical high resolution helper signal (VH). This is because a PAL signal originally has ahigher resolution ofup to 5.5 MHz than does NTSC at 4.2 MHz. Fig. 3 shows an EDTV II signal format focusing on three types of helper signals. The 180 lines active image area in each field is a 16:9 aspect ratio signal maintaining compatibility with a conventional 4:3 NTSC monitor (Fig.4). Two types of helper signals; VT and VH are multiplexed in the remaining 30 lines above and below this. Here, VT is a Vertical-Temporal helper signal which converts an interlaced active imaged area into a progressive scanning format, and VH is a Vertical high resolution helper signal which enhances the vertical frequency of the active image area from 360 lph to 480 lph. A third type, HH is a horizontal helper signal which is multiplexed into the "Fukinuki hole" present in the active image area and which enhances the horizontal frequency of this area from 4.2MHz to 6MHz. These enhancement process are conducted in EDTV II decoding, by which higher quality wide screen TV signal is obtained.
4. EDTV II Coding System
Fig.4
System configuration
4.1 System configuration Fig.4 shows a system configuration. A total 768 pixels*496 lines are extracted from the EDTV II signal in a pre-processing unit as the coding target area. An 8*8 intraframe Hadamard transform is used as a transform scheme and motion compensated interframe DPCM is implemented on the Hadamard transform domain.
4.2 Composite Motion Compensation [3] Fig.5 shows the general concept of composite motion compensation. In this figure, let the color sub-carrier phase of the coding block be a reference (0 *) expressed as (4). The phase is then shifted from this reference by 90 *, 180 * and 270 * respectively, as the position of the reference block is moved by one sample (3), two samples (1) and three samples (2). Movement of four samples returns the phase back to 0 *. These four types of a phase shift along the 2D(MVx, MVy) axis are also shown in the same figure. This figure also shows phase compensation by permutations and polarity changes of the two Hadamard coefficients located in the center of 8"8 WHT block, for every phase shift (1), (2) and (3)in the figure. This technique greatly improves the correlation between the coding and reference block. 4.3 Adaptive Coding Control The four types of stripes on the EDTV-II frame are shown in Fig.6. The following adaptive coring (quantization) control is performed on each stripe to minimize degrad_an_'onof the EDTV-II ~ picture quality. 1) ID and control signal area: Nearly loss less coding (quantization step size=l) 2) Boundary area: High SNR (step size=3), because an imperfect junction of above and below black area is easily recognized as astreaky noise in the center of the EDTV-II decoded image. 3) Active image area: Adaptive NTSC coding control considering human visual perceptions.
458
"":." ";~rT1 ......
Role=aceph.c 9 ; g ~.Cd~S.b.
~
Previousframe
@@@@@G
,,
Stril~ 2 -'-4 (Black area)
' , :
(DO t ~ : 90" ~ t
, 'i
Q- No~
~oop~ti~)
Stripe 5 (Boundary me.a) : :
:
'-20 0 O @ @
' X
8X 8WriT (~: 18ff ~i.q
O(DO@O~
VITS5
"'2 213
'
23"/ 238
'
MVy
.;
motion
compensation
x
.......
| |
•
~.
236 239
:
240
(Black area)
'
203
. . . . . . . . . . . . . . . . . . . . . . . . . . .
4) Black area: Uses an average step size for the active image area.
Fig.6
:
.
2~s
ID and control signal area
A
~6"''~
| !
: .
, :
"]I'0"" Abo ~ black area 311 , 60 lines 312 , ,| 313 ,: : 314 : . dl. 315 , y
316 317
234 235
Stripe 29 --31
283 2~4
309
55
" '| '
(Soua,~ ~.a) :
Composite
VITS 10 :
"!'''2"32 ........
Stripe 28
Fig.5
54
'
:
vrrs9
47 4.8 49 50 51 52 53
" ~.
vrrss :
.............. """ "~' ........ Stripe6"-- 27 |' (Activ c imag earca): 231
(~): 270" dLq
O O@(i)|
24
Vfr~
VITS7
21
t...~."
Blockwise 8 x 8 WIIT
,~@@
: .
|
[1
~z OOOOOE
) Scmnhg
~
VITS4
: ~
.............. '++
: '|
signal)
'
V1TS2
: vrrs3
Slripr 1 (ID and co,trol Refercmccblock
;
: :
q
31'8' " "
,tO<;
|
Active imag c a n : a '
493
~"~
,
:
~
|
1801in es 9
:
,
,Iv
+~
:
,5
498 499
: : ' . :| Below b k c k
497
5130 501
~--~
'
'
'
|
525
'
area
60lines
, l
|
y
Stripe c l a s s i f i c a t i o n s adaptive c o n t r o l
for
4.4 Coding performance We evaluated picture quality of several EDTV-II test sequences. The bit rate was fixed at 20Mbps and objective/subjective evaluation results obtained under this bit rate areas follows. The SNR (Signal to Noise Ratio) of the control signal and Boundary are constant andmore than 50dB which shows the effect of the coding control in 4.3. The SNR of the main and black area depends on a sequence and even the lowest SNR is hi gher than 45dB. In the subjective evaluations, original sequences and system outputs were decodedby an EDTV II decoder and displayed on a 16:9 wide screen TV monitor as shown in Fig.4. System outputs gave steady EDTV II decoding because of the high SNR of the control signal being constantly obtained and degradations even in the lowest SNR sequence were difficult to find. This demonstrates that this system can therefore transmit an EDTV II program with contribution quality at the bit rate mentioned above (equivalent to one-half of a 36MHz transponder). 5. C o n c l u s i o n
We have developed an WriT-based EDTV II coring system in which the composite motion compensation and stripe-based, four type adaptive coding control are achieved. Utilizing this technology will permit transmission of an EDTV II signal of contribution quality at the bit-rate of 20Mbps (two EDTV II programs can be transmitted on one 36MHz transponder). The algorithm for this system has been recommended as the Japanese standard EDTV II coding scheme. The stripe-based adaptive coding control will probably also prove effective forPALplus. For motion compensation, however, more than two coefficients will be required on WriT in order to compensate the phase shifts of the PAL color sub-carrier, but the performance will still probably greater than in the case of DCT. We intend to make further studies of efficient color sub-carrier compensation algorithms for motion compensated PALplus coding. References: [1] S. Matsumoto et al.: "Development of High Compression HDTV Digital Coded', Fifth IEE Conference on Telecommunications, 26-29 March 1995, PP. 100-104 [2] "ENHANCED WIDE-SCREEN NTSC TV TRANSMISSION SYSTEM", Question ITU-R 42-2/11 [3] T. HAMADA et aI.:"NTSC Interframe Direct Coring Scheme by Composite Motion Compensation on Hadamard Transform", IEICE Transaction B-IVol. J77-B-I No.7 pp.475-482 July 1994
Session P: BIOMEDICAL APPLICATIONS
This Page Intentionally Left Blank
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
461
BRAIN EVOKED POTENTIALS MAPPING USING THE DIFFUSE INTERPOLATION
Djaaffar Bouattoura
Paul Gaillard
Universit$ de Technologie de C o m p i ~ g n e D@partement G6nie I n f o r m a t i q u e B P 529 F-60205 C o m p i ~ g n e c e d e x - France Tel. (33) 44 23 44 2 3 - Fax. (33) 44 23 44 77. Email: D j . B o u a t t o u r a @ u t c . f r
Pierre Villon
Universit~ de Technologie de Troyes L a b o r a t o i r e de Mod@lisation et Sfiret~ des Syst~mes
Frangois Langevin
Universit~ de Technologie de C o m p i ~ g n e U.R.A. C N R S 1505 LG2MS
Universit~ de Technologie de Compi~gne U.R.A. C N R S 858 B I M
ABSTRACT In this paper, an i n t e r p o l a t i o n m e t h o d based on t h e diffuse a p p r o x i m a t i o n is applied to represent evoked potentials d i s t r i b u t i o n over the skull. This m e t h o d retains most of t h e a t t r a c t i v e features of t h e finite-element m e t h o d b u t does not require explicit elements. The h u m a n head is a s s u m e d here as a single-layer sphere with h o m o g e n e o u s conductivity a n d with Ary eccentricity t r a n s f o r m a t i o n in o r d e r to a p p r o x i m a t e t h e m o r e realistic 3-shell model. T h e p a t t e r n s shown in t h e c o m p u t e d maps, e i t h e r in simulation or application tests, suggest t h e ability of t h e p r o p o s e d m e t h o d to e x t r a c t coherent i n f o r m a t i o n from m e a s u r e d data. In the application protocol, visual evoked potentials are used to test this m e t h o d . Keywords: Evoked potentials, topographic mapping, diffuse approximation. 1. I N T R O D U C T I O N Topographic mapping is a widely used tool in neural electrophysiology to obtain a representation of the brain electrical activity over the skull [1, 2]. In the evoked potentials (EP's) or electroencephalogram (EEG) mapping, the distribution of the potential can be displayed as maps using isopotential lines (contours) or flats (surfaces). Yan [3] analysed, by the finite-element method, the human head by investigating the effects of the eye orbit structure. Abboud [4] proposed a method based on finite-volume discretization for calculating the potential distribution in a 4-layer spherical volume conductor due to a dipole current source. The reasons of the success of the finite-element method are well known: local character of the approximations, ability to deal with complex geometrical domains and existence of a large set of approximation schemes adapted to various problems but embedded in a unified formulation. However, this method presents two main drawbacks: first, the approximation solutions provided by this approach present a limited regularity; second, generating adequate discretization meshes is a difficult task, in particular for complex tridimensional domains. In the present paper, we describe a new method based on the diffuse approximation [5] for calculating the potential distribution (mapping) on human head model due to brain evoked activity. This method retains most of the attractive features of the finite-element method but provides smoother approximations and requires sets of discretization points (nodes) without explicit elements. 2. H U M A N H E A D M O D E L L I N G Human head is generally approximated by a sphere in which the electrical properties are represented by concentric shells of different conductivities [6, 7, 8]. The most frequently used model is which assumes the human head as a homogeneous sphere. Moreover, Ary [9] showed that the potential distribution due to a source in a homogeneous sphere was similar to that of a source in a 3-shell sphere with a correction in eccentricity of the source. In the present paper, we consider the human head as a homogeneous sphere with radius p and conductivity a'. The surface is represented by a mesh of quadriangular patches [10] formed by joining adjacent points with straight lines. In order to simulate the potential distribution, we consider an eccentric dipolar source inside the spherical model [11]. 3. D I F F U S E I N T E R P O L A T I O N The basic idea of the diffuse approximation is to give an approximate value of a given function s, at each point pe (evaluation point) of the considered surface, starting from the knowledge over N measurement nodes
462
Ot
Figure 1: Electrode configuration used in the visual evoked potentials recording procedure according to the standard 10-120 system (N = 18).
(electrodes) (Fig. 1). This evaluation is based on the measurements obtained from the k (k < N) nearest nodes ni (i - 1,..., k) of the evaluation point pe and in which the contribution of each node is given by a weighted function. At any evaluation point pc, the approximate value of s is given by
~,,~ = [~ ~ ~ y~~ ~~ a
(1)
where the coefficients column vector a is obtained by a local weighted least squares procedure. It consists to minimize the following expression =
1~ w(ni, pc){s., - [ 1
zn, y.,
z.,] a
}2
(2)
where s,~, represents the value of the function s at each node ni. w(ni, pc) is a positive weighting function which quickly decreases as the distance, between the node ni and the evaluation point pc, increases. This function can be defined as
,~(n~, p~) = '~-I
(3)
such that
1. Wre](O) =
1
2. supp(Wrel) =
[-1, 1]
3.0
_
Wre] < 1
4. wre] E
Cm(-1, 1)
and where di represents the distance between node ni and the evaluation point pc. di is normalized by the attenuation radius dh+l, which is given by the distance between pe and its (k + 1)th nearest node. m is the length of vector a ( m - 4 in our case). Let Pi - [1 z,,, y,,, z,,], equation 2 becomes then (4) i=1
The estimate of a is obtained by OJ~.(~) = a . Oa
- b = 0
(5)
Thus a = A-~b
with
(6)
k
a = ~
k
wCn,,p.)PT P,
=d
i=1
b = F_, w C n , , p . ) . . , pT i=1
The rank and conditionning of matrix A depend only on number of nodes belonging to the neighbourhood of pe. A necessary condition to get a non singular matrix A is the existence of at least m nodes in the neighbourhood of pc (N > k > m). Without additional hypothesis, the diffuse approximation does not exactly satisfy the interpolation property gn, = [1 z,, Y-i zn,] a = s , ,
(7)
In practice, the diffuse approximation can locally be an exact interpolation at each node ni by modifying suitably the weighting function. For that, we select weighting functions which are infinite at nodes ni. For example t~re/(d) = such that
w"'/(d)
1 - w.~ (d)
(8)
463
Figure 2: Comparison between Reference map and its interpolated .forms in 18, ~5 and 35 electrodes configurations. The dipole orientations are radial (a) and tangential (b) with both 80% eccentricity. 1. ffore f ( O) -+ oo
2. supp(ff.,,.~f) = [-1, 11
3. ~re! E Cm(-l, I) The definition of Wref is equivalent to the introduction of an interpolation constraint in the minimization problem. The displayed results were computed by using the following weighting window ~,f
(d) =
l[l+cos(Trd)]
if - l < d < l
0
elsewhere
(9)
4. E X P E R I M E N T A L P R O C E D U R E 4.1. S i m u l a t i o n Many maps were generated with configurations based on 18, 25 and 35 electrodes. The study consists in the comparison of a reference map, corresponding to a dipolar source potential distribution, to a map built by interpolating the extracted sample set. We consider the correlation coefficient Tr (resemblance rate) as a comparison criterion Tr =
(
1- r
! ] 100%
(10)
where ire and r represent respectively the standard deviations of interpolation error and reference map. The interpolation error is given by the deviation between simulated and interpolated values. We have modified the dipole orientation from the radial orientation (Fig. 2.a) to the tangential one (Fig. 2.b). The correlation coefficient increases gradually with the minimal value corresponding to the radial position. This is due to the fact that, in tangential orientation, the potential distribution is more widespread. From the three selected configurations, we can observe that maps based on 25 and 35 electrodes are better than those with 18 electrodes both in the radial and tangential orientations. Then, we propose the use of configurations including between 25 and 35 electrodes. 4.2. A p p l i c a t i o n Visual evoked potentials (VEP's) were recorded on healthy subjects (Aged 20-25). Single sweep recordings were performed via 18 channels (electrodes) whose impedances were less than 1 kfl. VEP's were elicited via a checkerboard of 10 x 10 black and white squares at 100% contrast, 60cd/m 2 luminance and with 2Hz reversing frequency. Each single sweep was amplified with a gain of 3104 and band-pass filtered (1-100Hz). A/D conversion was performed at 1 kHz sampling frequency. After conventional averaging over 200 sweeps, the average responses of the 18 channels were used to build maps via the diffuse interpolation algorithm. Figure 3.a gives the time modality with selected latencies while figure 3.b gives the space modality. The electrical activity of the salient polarity is highly concentrated over the occipital cortex, as it is well known. This kind of potential distribution, whose morphology is in some way simpler, represents a good test for the proposed method in order to evaluate its behaviour towards regions where the VEP distribution is almost non-existent (frontal cortex). And as it is shown, the resulting maps give a satisfactory description of the potential distribution over the head model.
464
Figure 3: (a) Multiple raw traces of the average VEP responses (18 channels) according to the standard lO-eO system. Dotted vertical lines point out selected latencies at which maps were generated. (b) Topographic maps at selected latencies. The most important latencies: N75 and PIO0 are respectively at 91ms and l$Zrns. The lower part of the patterns represents the occipital cortex.
5. C O N C L U S I O N The proposed method has shown to be able to give a satisfactory dynamic topographic description of the brain visual evoked potential distribution from responses obtained via conventional averaging. In order to simulate potential distribution, we have considered a single eccentric dipolar source inside a homogeneous sphere with radius p and conductivity or. We have observed that maps with configurations based on 25 and 35 electrodes are better than that with 18 electrodes both in the radial and tangential orientations. In the application stage, only 18 leads were used to perform acquisition procedure. The patterns shown in the obtained maps suggest the ability of the proposed method to extract coherent information, from data obtained via different electrodes, even if the used configuration did not allow to show correctly the efficiency of this method. 6. R E F E R E N C E S [1] F. H. DUFFY, J. L. BURCHFIEL,and C. T. LOMBROSO. Brain electrical activity mapping (beam): a method for extending the clinical utility of eeg and evoked potential data. Ann. Neurol., 5:309-321, I979. [2] B. N. CUFFIN and D. COHEN. Comparison of the magnethoencephalogram and electroencephalogram. Electroenceph. Clin. Neurophysiol., 47:132-146, 1979. [3] Y. YAN, P. L. NUNEZ, and R. T. HART. Finite-element model of the head: Scalp potentials due to dipole sources. Med. Biol. Eng. Comput., 29:475--481, 1991. [4] S. ABBOUD, Y. ESHEL, S. LEVY, and M. I~OSENFELD. Numerical calculation of the potential distribution due to dipole sources in a spherical model of the head. Comput. Biomed. Resea., 27:441-455, 1994. [5] B. NAYROLES, G. TOUZOT, P. VILLON, and A. RICARD. Generalizing the finite element method: diffuse approximation and diffuse elements. Computational Mechanic, 10(5):1-12, 1992. [6] R. N. KAVANAGH, T. M. DARCEY, D. LEHMANN, and D. H. FENDER. Evaluation of methods for threedimensional localization of electrical sources in the human brain. IEEE Trans, BME-25:421-429, 1978. [7] B. J. I:{.OTH, M. BALISH, A. GORBACH, and S. SATO. How well does a three-sphere model predict positions of dipole in realistically shaped head? Electroenceph. Clin. Neurophysiol., 87:175-184, 1993. [8] B. N. CUFFIN. Effects of head shape on eeg's and meg's. IEEE Trans, BME-37:44-52, 1990. [9] J. P. ARY, S. A. KLEIN, and D. H. FENDER. Location of sources of evoked scalp potentials: correction for skull and scalp thicknesses. IEEE Trans, BME-28:447-452, 1981. [10] F. ZANOW and M. J. PETERS. Individually shaped volume conductor models of the head in eeg source localisation. Med. Biol. Eng. Comput., 33:582-588, 1995. [11] F. N. WILSON and R. BAYLEY. The electric field of an eccentric dipole in a homogeneous spherical conducting medium. Circulation, 1:84-92, 1950.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K.
B.G. Mertziosand P. Liatsis (Editors) 9 1996Elsevier ScienceB.V. All rights reserved.
465
COMPUTER-AIDED DIAGNOSIS: DETECTION OF MASSES ON DIGITAL MAMMOGRAMS
Arturo J. M~ndez (1), Pablo G. Tahoces (2), Maria J. Lado (1), Miguel Souto (1), Juan J. Vidal (1) (1) Department of Radiology. University of Santiago de Compostela, Spain (Complejo Hospitalario Universitario de Santiago) (2) Department of Electronics and Computing. University of Santiago de Compostela, Spain
Technical area: Medical Applications
Contact author: Arturo J. M~ndez San Francisco 1 Departmento de Radiologia. Facultad de Medicina 15704 Santiago de Compostela, Spain Telephone (81)-570982 FAX (81)-583345 e-mail m r a r t h u r @uscmail.usc.es
This work has been supported by Fondo de Investigaciones Sanitarias de la Seguridad Social (Spain) under Grant no 94/1708 and by Conselleria de Sanidade da Xunta de Galicia (Proyecto de Investigaci6n del Programa Gallego de Detecci6n Precoz del Cfincer de Mama).
466 I. I N T R O D U C T I O N While breast cancer is a major cause of death for women, adequate screening programs and independent double reading may reduce these rates, thus improving breast cancer detection [ 1, 2, 3]. However, the large volume of mammograms generated in a screening program and the necessity of attention by two radiologists for each mammogram [4] encouraged several investigators to point out the possibility to use computers as a second reader [5, 6]. Two general approaches have been explored in mammographic mass detection: single-image segmentation and bilateral image subtraction. In the first case, several techniques that incorporated knowledge about lesions were employed [6]. The second approach uses bilateral subtraction of corresponding left-right matched image pairs and is based on the symmetry generally found between both images, asymmetries indicating possible masses [7, 8]. We present a computerized method to automatically detect masses in digital mammograms by employing bilateral subtraction. Size and eccentricity tests were used to eliminate false-positives. Detection performance was evaluated using flee-response receiver characteristic (FROC) analysis [9, 10].
II. MATERIALS AND METHODS A. Acquisition of digital mammograms 76 pairs of mammograms with a biopsy-proven mass were digitized with a Konica laser film scanner (Konica Corp, Tokyo, Japan) at a resolution of 2000 x 2600 pixels (87.5 ~tm/pixel) with 10 bits precision. After digitization, each digital mammogram was subsampled by a factor of 5 to obtain an image of 400 x 520 pixels. A DEC VAX 4000 computer (Digital Equipment Corp, Maynard, MA) running VMS operating system was used for all calculations. The computer programs were written in DT/IDL (Research Systems, Inc., Boulder, CO). B. Detection algorithm An original algorithm was designed to segment the breast region and to detect the nipple [ 11 ]. To detect the breast border, both a thresholding and a smoothed version of the original image were calculated. The breast was divided into three regions (Figure 1), and a tracking algorithm was applied to the mammogram to detect the border. There is a dependency among each region (I, II and NI) and the tracking process: in region I the algorithm searches the breast border from left to fight; in region II the algorithm searches the border from top to bottom; and finally, in region III, the algorithm searches the border from fight to left. To detect the nipple we have used a method that combines the maximum height of the breast and the maximum of the gradient in the direction of a straight line connecting each point of the border with a point inside the breast.
Figure 1.- Breast divided into three regions. Once the breast border was detected, images were corrected to avoid differences in brightness between left and fight mammograms due to the recording procedure. Given the breast image f(x,y), the corrected breast image n(x,y) is equal to (f(x,y)-~t)/o, where ~t is the mean gray level and ~ the standard deviation of grey-levels belonging to the breast. To align fight and left breast
467
images the left image was both displaced and rotated. The latter, to compensate for possible missalignments due to mammogram acquisition. The correlation coefficient between the fight image and the rotated left image was calculated for angles ranging between -5 and 5 degrees. The maximum value of the correlation coefficient corresponds to the best alignment. After the corrected images were aligned, bilateral subtraction and linear stretching to a full contrast of 1024 gray-levels were calculated, and a threshold was applied to obtain a binary image with the information of suspicious areas. The asymmetries were extracted by region growing. Size and eccentricity tests were applied to eliminate false-positives. C. Evaluation of Performance The effect of the subtlety of the masses on the performance was evaluated. An expert radiologist was required to record one of the five levels of subtlety of the mammographic appearances of these masses: level 1, obvious mass; level 2, relatively obvious mass; level 3, subtle mass; level 4, very subtle mass; and level 5, extremely subtle masses. Detection performance was also evaluated using free-response receiver operating characteristic (FROC) analysis [9, 10]. The receiver operating characteristic (ROC) curve shows the trade-off between true-positive and falsepositive fractions as the observer varies the decision threshold. The area under the ROC curve, Az, is usually used as an objective measure of scheme performance to compare CAD schemes. The correct diagnosis of a mass requires its correct localization, and multiple observer responses per image must be allowed. ROC analysis cannot take into account location information and it does not allow for multiple responses per image, so FROC analysis was used. FROC analysis permitts an arbitrary number of abnormalities per image and the observer can indicate both the rating levels and the locations of the abnormalities. The area under the AFROC (alternative FROC) curve, A1, is the natural index of performance for measuring FROC observer performance. IIl. R E S U L T S Figure 2.a shows the FROC curve. Each true-positive rate was determined after application of the size and eccenticity test previously described. The mean number of false-positive detections per image was calculated by dividing the total number of false-positive detections by the total number of mammograms. The fitted AFROC curve yielded A1 value of 0.527 (st dev=0.059).
"
i
I
!
I
~o Z
Z 0,8
8
Z
;9
~
0,6
60
~ (st d e v = 0 . 0 5 9 )
A1=0.527
0,4
~I 0,2
~
40
~
20
.
T"
..
I 0
, 0
MEAN
,
,
i 2
NUMBER
,
i OF
,
i 4
,
FALSE
,
,
i 6
,
,
POSITIVES
i
I 8
i
PER
i
OBVIOUS
i
..IF"
.
I R. OBVIOUS
-,.
I SUBTLE
"V"
I
6
~
2
~
0
~
:
,,-
I
VERY SUBTLE EXT.SUBTLE
10 IMAGE
SUBTLETY
Figure 2.- (a) FROC curve. (b) True-positive fractions and mean numbers of false-positives detections per image regarding the subtlety of masses.
The effect of the subtlety of the masses on the performance of the method was also analyzed (Figure 2.b). The true-positive fractions tend to be lower in the more subtle categories. An 80% truepositive rate at an average of 2.4 false-positive responses per image was obtained in the whole data set. Examples of detected masses are shown in Figure 3.
468
Figure 3.- Examples of detected masses (arrows): (a) obvious mass, and (b) subtle mass with three false-positives. IV. D I S C U S S I O N
Although the true-positive rate needs to be improved, we have obtained a low average number of false-positives per image indicating that our method may continue to improve to a point where clinical implementation will become feasible. Masses were missed because of various reasons: merging with normal tissues, inaccurate alignment or inaccurate mammogram acquisition. A nonlinear subtraction method, that involves linking multiple subtracted images obtained from different threshold values, and border and size tests to eliminate false-positive detections was used by Yin et al. [8]. Lau et al. determined asymmetry between breasts using a combination of several asymmetry measures, and suspicious areas were selected by thresholding (percentile method) and area test (two-stage thresholding) [7]. We have characterized the asymmetries by using a thresholding (after the brightness correction and subtraction of mammograms) and eccentricity and size test to eliminate false-positives, instead of the more complex methods of linking multiple subtracted images (used by Yin et al. [8]) or asymmetry measurements (used by Lau et al. [7]). Our results suggest that a method to detect masses on the basis of bilateral subtraction could be potentially useful to aid ragiologists in mammographic screening. REFERENCES
[ 1] C.C. Boring, T. S. Squires, T. Tong and Montgomery, "Cancer statistics, 1994," Ca-A Cancer Journal for Clinicians 4 4, 7-26 (1994). [2] S.A. Feig, "Decreased breast cancer mortality through mammographic screening: Results of clinical trials," Radiology 167, 659-665 (1988). [3] E.L. Thurfjell, K. A. Lernevall and A. A. S. Taube, "Benefit of independent double reading in a population-based mammography screening program," Radiology 191, 241-244 (1994). [4] R.E. Bird, T. W. Wallace and B. C. Yankaskas, "Analysis of cancers missed at screening mammography," Radiology 184, 613-617 (1992). [5] P.G. Tahoces, J. Correa, M. Souto, L. G6mez and J. J. Vidal, "Computer-assisted diagnosis: The clasification of mammographic breast parenchymal patterns," Phys. Med. Biol. 4 0, 103-117 (1995). [6] W.P. Kegelmeyer, J. M. Pruneda, P. D. Bourland, A. Hillis, M. W. Riggs and M. L. Nipper, "Computer-aided mammographic screening for spiculated lesions," Radiology 191, 331337 (1994). [7] T.K. Lau and W. B ischof, "Automated detection of breast tumors using the asymmetry approach," Computers and Biomedical Research 2 4, 273-295 (1991). [8] F.F. Yin, K. Doi, M. L. Giger, C. J. Vyborny and R. A. Schmidt, "Comparison of bilateral-subtraction and single-image processing: Techniques in the computerized detection of mammographic masses," Invest. Radiol. 2 8 473-481 (1993). [9] D.P. Chakraborty, L. H. L. Winter, "Free-response methodology: Alternate analysis and a new observer-performance experiment," Radiology 17 4, 873-881 (1990). [ 10] D. P. Chakraborty, "Maximum likelihood analysis of free-response receiver operating characteristic (FROC) data," Med. Phys. 16, 561-568 (1989). [ 11 ] A. J. M6ndez, P. G. Tahoces, M. J. Lado, M. Souto, J. L. Correa and J. J. Vidal, "Automatic detection of breast border and nipple in digital mammograms," Computer Methods and Programs in Biomedicine 4 9, 253-262 (1996).
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
469
Model order determination of ECG signals using rational function
approximations. Joseph S.Paul* V.Jagadeesh Kumar" M.R.S.Reddy + Indian Institute of Technology Madras, INDIA Abstract In this paper, a new pole zero order determination procedure is proposed for ECG signals. The method is based upon the singular value decomposition of the reduced rank approximation of the data matrix obtained by linear 'filtering of the DCT coefficients of the ECG signal. The energy characteristics of the reduced rank components are found to be highly order dependent.
In this
paper, we have made use of this property to identify the order of the underlying model. The NMSE of the waveforms reconstructed with the orders estimated using the proposed method are found to be lower than those obtained using statistical methods such as the AIC.
1
Introduction
Modelling a general signal with a fixed order rational function is possible only when the original signal happens to be representable as a linear combination of p exponentials.
This representation in terms
of damped sinusoids is valid for only those signals, which possess the minimum phase property.
It is
observed that. for the ECG signals, a large proportion of the energy is concentrated in the Q RS and T waves which are not present at the beginning of the beat. This feature renders the E C G waveform to be of nonminimum phase. The number of algorithms that can model such non minimum phase signals Q
are very. few [1]. However, the discrete cosine transform (DCT) coefficients of an ECG beat consists of damped oscillations and so can be conveniently represented as a combination of d a m p e d sinusoids [2, 3]. In addition, the DCT coefficients exhibit spectral features with distinct peaks and valleys indicative of the system poles and zeroes, in contrast to the spectrum of the direct time domain ECG beat, which appears to be a monotonically decreasing featureless function [4]. In the present work, we have considered N number of DCT coefficients of a single ECG b e a t to be the impulse response of an LT! system. It is observed that, the DCT coefficients of an ECG beat decays to zero for n >
N/2.
Further, since
the rational function approximations for ECG signals are built on the assumption t h a t the component waves are composed of a group of second order
biphasic fractions [2], with the
nator
considered the case in which the impulse response data is
polynomials of the same order, we have
" The authors are with the Dept. of Electrical Engg, + The author is with the Division of Biomedical Engg. (corresponding ~uthor).
n u m e r a t o r and denomi-
470 characterised by a set of 2p coefficients. In this paper, we have considered a linear algebraic approach based on the singular value decomposition (SVD) of a reduced rank approximation of the data matrix to arrive at an optimal value of the parameter p characterising the impulse response function. In section 2, we have developed the theory and the simulation results that illustrate the performance of the new method are discussed in section 3 of the paper.
2
Theory
Assuming the system to be ~)ffp'th order, the input output equation is given by P
P
y(n) = E aiy(n - i) + E bix(n - j); i=1
/or
j=O
0 < n _< N - 1
(1)
Where, y(n) denotes the output and x(n) represents the input to the system. The equivalent transfer function representation is given by H(z)
= b~
+ EiLl
~o z p - ' )
p zP + ~'~4=1 ai z~-i
(2)
With x(n) as the unit impulse and y(n) representing the N point DCT coefficients of an ECG beat, the system difference equationcan be recast into a matrix form as
Yp where, y-p
--[Yl,Y2,
...yp]t
Yp
(3)
bp
ON-p-lXp
and yp = [Yp+l,Yp+2, ...Y(N-I)] t
for
1 < j _< N -
yj,
1. The matrices
?p and Yp are given by
?p
=
The vectors
Y0
0
...
0
Yl
y0
...
0
Yp-1
Yp-2
...
YO
.
.
.
.
and
pXp
Yr =
Yp
Yp-1
...
Yl
Yp+l
Yp
...
Y2
YN-2
YN-3
...
.
.
.
.
YN-p-1
N - - p - 1Xp
ap and bp represent the coefficients of the numerator and the denominator polynomials
of the rational function representation H(z) with ap = [ a l , a 2 , . . . , % ]
~
and bp = [ b l , b 2 , . . . , b p ]
t.
From
equation (3), we have Ypap = - y p and bp - y~p + lYpap. Since N > p + 1, equation (3) represents an overdetermined system of linear equations and hence in general has an infinite number of solutions.The usual approach is to find a so that the least square error IYpap + ypl 2 = (Ypap + yp)t (Ypap + yp) is minimised. This minimisation yields ap = _ ( y p y p ) - i (y;yp)
(4)
Thus to build up an order recursion, we rewrite the equation (3) as Yp+~ap.k = - y p + ~
(5)
471 However, since the DCT coefficients for an ECG beat are zero for n > ~ , the above equation can be conveniently represented in terms of ~p as -Yp+k =-Yp,k = (Yp,k-1, Yp,k-2, "" Yp,o, Yp )at,+k
(6)
We have used the notation yi,~ to denote a vector which has it's first entry as yi+k+~ and having N - j - 1 elements. Hence for a given p and k, the data matrix Yv,~, = ( yp,h-1, yp,~-2, "" Yp,o, Yp ) is an (N - p -
1) X (k + p) matrix. This data matrix has a full column rank and has k + p nonzero singular
values. The characteristics of this data matrix are investigated with the help of an example in which, we have constructed the matrix ~sing the DCT coefficients of a typical ECG waveform illustrated in Fig. 1. The Fig. 2 shows the profile of the singular values of this matrix Yp,k for three different combinations of p and k. The curves show an identical variation for all the combinations of p and k and is understood to have practically no effect on the values of either p or k. To device a criterion for order determination in the framework of SVD, the data matrix Yp,k does not yield any useful information. Inorder to overcome this problem, we have constructed a reduced rank approximation of the data matrix by computing a sequence of filtered estimates for the vectors Yr,~ as
~rp+ 1 -- (~rp, ]r~)(~rp,
yp)typ+ 1
~rp+k = (~rp.+.k-1 , " ' , Srp, Yp)(~rp+k_l , . . . ,~rp, Yp)*Yp+k
(7)
so that the reduced rank matrix ?p,k = (3rp+k,3rp+k-1,'" ,~rp, Yp) is of rank p only. The SVD of the data matrices yield
l/p,k = Up,k 2p,k Yp,k
(8)
?,,,,, = G,k i:,,,,, #,,,k
(o)
and
Where, Ep, k = diag(al,h, a2,k,"',ap+~,k)
and Zp,k = d i a g ( S a , k , 5 2 , k , " ' , S p , k , O , " ' , O ) .
The last k en-
tries of ~p,~ are zeroes. The Fig. 3 shows the profile of the singular values of this matrix ?p,k for the same combinations of p and k as in Fig. 2. Unlike in.Fig. 2, the curves in Fig. 3 show a marked dependence on the values of p and k. This obviously indicates the fact that the reduced rank matrix serves to be a more appropriate tool for solving the order identification problem in the framework of SVD. Inorder to extract the order information from the singular values of the reduced rank matrix, we define the matrices Zp and ~p the j'th column of which are given by 4u
and Cp,j respectively where c~p,j = (al,j, a2,j, ." 9, ap+k,j)t
and ~p,j = (Sl,j,52,j," ",Sp+k,j) t. Tile average energy for the i'th component in the j'th column of ~p and ~p are given by
Ep(i,j) = V,p+ai,j ~
Z-s i.-= 1 (Ti.j
, for
l
(10)
472 and
Ep(i,j) =
~a,,j -~ ~i=1
, for
I < i
i.i
-
(11)
-
For a given p and j, it is observed that /~p(i, j) show identical patterns for all i, 1 _< i _< p. At the true
model order, the value of the variation of
Ep(i, j) is a maximum. To identify the model order, it is enough if we observe
Ep(i,j) for any i in the range I _< i _< p. The Fig. 4 shows the variation of/~p for p - 6 .
The true order of the model is observed to be at j=10 with a scale factor of 2, for which the functional /~p has a maximum. In contrast to the reduced rank matrix, the average energy function of the full rank data matrix for p - 6 , as showy in Fig. 5, has a featureless variation without a~y maxima.
Fig.l(a) A typical ECG beat
Fig.l (b) DCT coefficients for the
Fig.2 Profile of the singular
beat in fig.1
for the full rank data matrix
Fig.3 Profile of the singular
Fig.4 Variation of the average
Fig.5 Variation of the average
for the reduced rank data matrix
energy in the principal component
energy in the principal component
for the reduced rank data matrix
for the full rank data matrix
3
Results and Observations
We have carried out the order determination of a set of fifty different ECG waveforms using the proposed method and compared the results with AIC. To illustrate the salient features of the proposed method, we have considered three representative ECG beats shown in Figs. 6(a), (b) and (c). The average energy curves obtained from the reduced rank matrix corresponding to the ECG's in figure (6) are depicted in Figs. 7(a), (b) and (c) respectively. For comparing the results obtained using our method, tile AIC measures for the three ECG beats of Fig. 6 are shown in Figs. 8(a), (b) and (c). The table below shows
473 the NMSE's for the three waveforms evaluated with the model orders estimated using the algebraic and the statistical methods (AIC). It is seen that the methods using statistical techniques predict a lower model order as compared to that indicated by the algebraic technique. Further, It is also observed that, the algebraic method yields a lower NMSE. Table 1 Comparison of the results obtained using the algebraic and statistical methods Figure
Order
Order
NMSE
NMSE
reference
(algebraic)
(statistical)
(algebraic)
(statistical)
(a)
12
11
.0804
.0848
(b)
9
8
.0849
.1127
(c)
14
6
.1010
.4239
Fig.6(a) Fig.6 Typical ECG beats
Vig.6(b)
Fig.6(c)
Fig.7(a) Fig.7(b) Fig.7 Variation of the average energy in the principal
Fig.7(c)
component for the reduced rank data matrix
474
Fig.8(a) Fig.8(b) Fig.8 AIC measures plotted against the model order
4
Fig.8(c)
Conclusion
In this paper, we have presented an algebraic approach for determining the model orders of ECG beats. The proposed method rests on the assumption that the DCT coefficients computed from the samples of an ECG beat can be approximated using rational functions having equal number of coefficients in the denominator and the numerator polynomials. We have shown that the data matrices constructed using the DCT coefficients cannot be directly employed to establish a criterion for order determination in the framework of SVD. A reduced rank approximation of the data matrix obtained by linear filtering of the DCT coefficients, however, is found to possess a well ordered structure for the singular values. Each one of these singular values are attributed to a rank one signal, the average energy variation of which is found to exhibit identical patterns for a given size of the data matrix. Further, it is shown that the multiplicity of the zero singular values associated with the reduced rank data matrix represents the order of the underlying model. The true order is then identified as the one which maximises the average energy functional.
References [1] l.S.N.Murthy, M.R.Rangaraj, J.K.Udupa and A.K.Goyal,"Homomorphic analysis and modelling of ECG signals",IEEE Trans. Biomed.Eng., vol.BME-26, pp.330-344, 1979. [2] l.S.N.Murthy and G.S.S.Duga Prasad, "Analysis of ECG from Pole-Zero Models", IEEE Trans.
Biomedical Eng., vol.39, pp.740-751, July 1992. [3] M.Ramasubba Reddy, "Digital models for analysis and synthesis of ECG", Ph.D. Thesis, llSc;
Bangalore, August 1990. [4] N.V.Thakor, J.G.Webster and W.J.Tompkins, "Estimation of QRS complex power spectra for design of a QRS filter",lEEE Trans. Biomedical Eng., vol.BME-31, pp.702-706, 1984.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
475
COMPUTATION OF THE EJECTION RATIO OF THE VENTRICLE FROM ECHOCARDIOGRAPHIC IMAGE SEQUENCES Andreas Teuner Fraunhofer Institute of Microelectronic Circuits and Systems, Finkenstr. 61, 47057 Duisburg, Germany Olaf Pichler Department of Electrical Engineering, University of Duisburg, 47057 Duisburg, Germany Bedrich J. Hosticka Fraunhofer Institute of Microelectronic Circuits and Systems, Finkenstr. 61, 47057 Duisburg, Germany
Abstract Our contribution deals with the automatic computation of the ventricle volume ratio of the human heart at the end-systolic and the end-diastolic phases based on echocardiographic image sequences. The proposed method includes the use of an application-specific designed median filter for noise reduction in echocardiographic images, the detection of the end-systolic and end-diastolic phases based on the analysis of motion vector fields, and the segmentation of the ventricle using a region growing method. The presented simulation results have been compared with manually obtained measurements of the ejection ratio.
1.
Introduction
The computation of the ejection ratio of the human heart from an echocardiographic image sequence is a frequently employed procedure in medical diagnosis that enables a preliminary statement about a valvular defect. Currently, this task is usually performed by a diagnostician who examines manually the recorded frames that contain these phases in an echocardiographic image sequence of the ventricle. The criterion which is used to identify the frames corresponding to the end-systolic and the end-diastolic phases is the position of the mitral valve. Having identified both relevant frames the contours of the ventricle are traced and its main axes are determined to compute the volumes, resp. the ejection ratio. For that purpose, the ventricle is assumed to be an ellipsoid body. In contrast to this established procedure, which is subjective and time consuming, we propose a new technique that is based on performing several different image processing algorithms including image motion analysis. The primary steps of this novel method are summarized in Fig. 1. They embrace the reduction of noise in the ultrasound images using an application-specific designed 1-D median filter and nonlinear cluster merge filtering, the identification of the end-systolic and end-diastolic phase of the mitral valve based on the analysis of motion vector fields obtained from a template matching algorithm, and the segmentation of the ventricle using edge detection and region growing. At last, the ratio between the volumes of the ventricle at these two detected states is computed. The performance of our approach will be demonstrated by execution on a real echocardiographic image sequence and will be compared with results manually tracked by a diagnostician. In the following sections each step of the proposed method will be explained and experimental results will be presented.
Fig. 1. Basic steps to determine the volumes of the end-diastolic and end-systolic phases.
476
2.
Analysis of Echocardiographic Image Sequences
A.
Radial Median Filtering
Due to the recording principle of echocardiographic images [ 1] these are usually very noisy and appear to be smeared. It was shown in [2] that 2-D median filters represent an excellent tool to eliminate the background noise which is caused mainly by beam deflections at frontier bounds of the pericard. Because this noise is correlated in tangential orientation we have applied a 1-D median filter of size 2N+ 1 to each frame instead of 2-D filters to sharpen the contours in the images. As shown in Fig. 2a, the direction of the filter is locally adapted to the position of the pixel that will be analyzed in a radial orientation. Fig. 2d illustrates that excellent noise suppression in radial direction has been obtained without affecting the filter result by the correlated noise signal. The background noise has been reduced successfully without destroying the ventricular contours. The order of the median filter applied to the images has been chosen to be N = 2. Our simulations have shown that increasing the filter order does not enhance the quality of the images for postprocessing.
B.
AppBcation of a Cluster Merge Filter
To emphasize the ventricular contours, especially the contours of the mitral valve, we further applied a cluster merging filter to the images as desribed in [3]. Compared with other 2-D non-linear filter methods this algorithm which has been introduced to smooth noise in x-ray images shows a better preservation of thin lines. Fig. 2e illustrates the filter result applied to Fig. 2d. We have chosen a window size of 7 • 7 pixel. The "preset cluster number" was PCN = 3.
Fig. 2. a) Orientation adaptive 1-D median filter with N = 2. b) Frame indicating the enddiastolic phase of the original echocardiographic sequence for inverted representation, c) Detail of the frame before filtering, d) Detail of the frame after median filtering, e) Detail of the frame after cluster merge filtering of d).
477
C
Computation of the Motion Vector FieM
The computation of the motion vector field is the most critical and time consuming step in our approach. From the preprocessed sequence motion vector fields have been computed between two consecutive binarized frames using the template matching algorithm proposed by Kirchner [4]. Compared with alternative methods for the computation of motion vector fields, e.g. standard algorithms based on the optical flow [5], template matching yields good results also in cases where the velocity of an object exceeds more than one pixel per sampling rate. Due to the abrupt motion of the mitral valve with respect to the sampling rate of 25 I-Iz, object tracking with subpixel accuracy cannot be guaranteed. Thus, template matching offers a robust determination of motion vectors. Moreover, due to the knowledge of the sampling rate and the rough known velocity of the mitral valve the sizes of both, the operator window and the search window, necessary for the template matching algorithm, can be well estimated.
D. Detection of the End-systolic and End-diastolic Frames The determination of the frames which represent the end-diastolic and end-systolic phases has been carried out by analyzing the magnitudes of the motion vectors obtained from the template matching algorithm. Both the magnitudes of the motion vectors with orientation between 0~ and 180~ and those with orientation between 180~ and 360~ have been accumulated to obtain 1-D data series which serve as features for the ensuing identification process. Fig. 3 displays an example of a feature vector that can be used for the detection of the endsystolic and end-diastolic frames. Due to the periodicity of the ventricle motion feature vectors to identify both phases can be extracted from these time series. In our investigations, we defined two feature vectors containing 4 components and identified the phases which involve the increase, resp. the decrease of the sum of the motion vectors magnitudes with orientation 180~ - 360~ However, it should be noted that this pattern recognition task is sensitive to parameter variations and must be further enhanced by carrying out extensive test series and applying modem learning algorithms, e. g. neural networks.
Fig. 3. Plot of the sum of magnitudes of the motion vectors obtained from the template matching algorithm with orientation between 180~ and 360~ The frame pairs 1-2 and 7-8 indicate the end-systolic phases, the frame pairs 5-6 and 11~12 the end-diastolic phases.
E.
Segmentation of the Cardiac Valve
The last step of our approach is the segmentation of the ventricle and determination of the main axes. The task of segmentation has been carried out using an edge detection algorithm based on exponential filtering as implemented in [6] and simple region growing of the area indicating the ventricle. Since the position of the ventricle in an echocardiographic image is roughly known, an a priori defined centre pixel represents the core of a binary region growing algorithm to compute the area covered by the ventricle. Now, the volume V and the ejection ratio r of the ventricle can be computed from the longitudinal axis l and the area A, which is proportional to the number of labeled pixel, as follows: 8.A 2
V = ~ 3"~'l '
r=
Vend- diast. -- Vend- syst. Vend-diast.
For simplicity, the longitudinal axis has been defined by the largest vertical line of the segmented area.
478 3.
Simulation Results
The echocardiograpic image sequence used in our simulations, as shown in Fig. 2e, was recorded using a Hewlett-Packard echograph HP SONOS 1000 and contains 14 frames from 2 cycles. The images have been digitized to 8 bit accuracy and exhibit a size of 636 • 477pixel. To reduce the computational effort the orientation markers displayed in the original frames (see Figs. 2a and b) were used to mask out the background of the ultrasound images. Fig. 4 shows the identified frames at~r ventricle segmentation. Both images are identical with the frames which were manually identified by a diagnostician. The volumes of the ventricle have been computed as 48.7 cm3 for the end-systolic phase and as 101.3 cm3 for the end-diastolic phase which results to an ejection ratio of 51%. The manually determined volumes were 63.3 cm3 and 138 cm3, resp., which yields an ejection ratio of 54%. The systematic deviation between the automatic and manual obtained volumes is caused by the simplification which assumes that the longitudinal axis of the ellipsoid body is oriented in vertical direction. Nevertheless, the automatically computed ejection ratio compares closely with the manually obtained ejection ratio.
Fig. 4. Segmented areas of the identified a) end-systolic and b) end-diastolic frames. 4. Conclusion
In our contribution we have presented a system which is capable of detecting end-systolic and end-diastolic frames of an echocardiographic image sequence and computing the ejection ratio of the ventricle from these two phases. It has been shown that the inclusion of motion analysis algorithms to support the analysis of echocardiographic images sequences is a promising idea and can be highly useful for a computer-aided medical diagnosis. References
[1] Z.-H. Cho, J. J. Jones, and M. Singe Foundations of Medical Imaging. New York: John Wiley & Sons 1993. [2] C. Lamberti and F. Sgallari: "A workstation-based system for 2-D echocardiography visualization and image processing," IEEE Trans. Biomedical Engineering, Vol. 37, No. 8, Aug. 1990, pp. 796 - 801. [3] B. J. Lui: "A cluster merging filter for noise smoothing," IEEE Engineering in Medicine and Biology, Vol. 13, No. 2, April/May 1994, pp. 259 - 262. [4] H. Kirchner: "Blockmatching with column and row oriented optimization", Proc. 3rd Int. Workshop on Time Varying Image Processing and Moving Object Recognition, Florence (Italy), May 1989, pp. 29 31. [5] B. J~Jme. Digital Image Processing. Berlin: Springer-Verlag 1993. [6] J.R. Rasure and K. Kostantinides, "The Khoros software development environment for image and signal processing," IEEE Trans. Image Processing, Vol. 3, No. 3, May 1994, pp. 243 - 252.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
479
Contour Detection of the Left Ventricle on Echocardiographic Images SIMONE GAR(~ON- SECOM/UFPA(BRAZIL), FLAVIO BORTOLOZZI - DAINF/CEFETPR (BRASIL) ANDJACQUES FACON - CPGEI/CEFETPR (BRAZIL)
A b s t r a c t - - T h e cardiological diagnosis has been improved in terms of precision by several methods of automatic edge detection. Left ventricle functional performance must be estimated through quantitative analyses, which are essential parameters in the echocardiographic diagnosis. This procedure typically requires the endocardium detection. Thus, this work proposes a simplified automatic left ventricle extraction methodology in echocardiographic images which is based on mathematical morphology. In order to validate this methodology, it is performed the systolic and diastolic area evaluated by experts. I. INTRODUCTION
S
ince 1954, echocardiography has been contributing to a large number of relevant information on the diagnosis of cardiac performance, such as volume, fraction ejection, measurements of myocardial function, etc. Researchers have been attempting to find out solutions as much effective as possible in the digital image processing field, searching for a better reproductibility of the wall contour, so that a considerable number of approaches have been investigated lately. This research is based on the need of a better accuracy on data quantification in parameters analyses for the cardiac diagnosis. The automatic detection of the cardiac walls certainly has a relevant contribution in terms of velocity and objectivity, as one can calculate the systolic and diastolic area from different cutting edges of the endocardium contour and. consequently, several essential parameters on the echocardiographic diagnosis. Bidimensional echocardiography is a very common technique used within the clinical cardiology, being a real time, non invasive method, therefore painless for the patient. Nevertheless, due to image artifacts, among other factors, the poor quality of the generated images related to the ultrasonic beam leads to a subjective evaluation, depending on the expert's experience. Based on this context, this work proposes a methodology related to a simplification of the methods [1], [2], [4], [5], [10], [11], [16] which generally require a postprocessing step after an edge detector application by operators, whose algorithms demand a considerable computational effort. The echocardiographic images segmentation by morphological operators application seems to be the most suitable one, as such operators act directly on the geometric structure of the objects in the image, besides keeping its spatial integrity.
II. METHODOLOGY As a solution, the automatic detection of the left ventricle is applied to these basic purposes: speeding up and simplification of the outline reprodutibility on echocardiographic images and utilisation of these outlines to perform a range of the functional and structural cardiac performance parameters. The manual process used certainly does not define a more accurate standard of these parameters. Thus. it would be interesting that such process becomes a systematic one, through an automatic methodology which would reproduce the measurements variability of area. volume, fraction ejection, etc.. At that rate, this paper proposes to lead to a contribution in terms of accuracy and efficiency for the analysis requirements of the echocardiographic diagnosis parameter. The main goal is to define a simplified method for the left ventricle endocardium detection which main points concern to: 9Completely closed and isolated contour with one pixel in thickness; 9Inner artefacts elimination of the left ventricle cavity; 9A more accurate visual representation in terms of the anatomic data. In order to achieve this goal, Khoros software [8] modelling and prototyping is used as a research resource, as well as the morphologic machine MMach [31. Mathematical morphology [151 achieves this goal when it shows efficiency in the artefacts treatment, contour gaps and even computational effort.
A. Images Acquisition Depending on the conditions the images were acquired, their acquisition process can seriously impair all the rest of the process, as mentioned in [ 14]. This is particularly true in the case of echocardiographic images, where several factors affect directly images quality: non-uniform reflection of the ultrasonic beam, reflection of the other organs or muscles,
480
speckle noise predominance and distortions related to the ultrasound system. Besides these factors, which lead up to a distinctive noise of this kind of image, there are those ones related to the patient conditions (obesity, cartilage calcification, etc.), which make the image to become poorer and more confused. The noises related to the images acquisition from echocardiographic system to the computational system also contribute to a poor image quality. The ideal acquisition would be the one in which computer and echocardiographic system were straightly connected. Even so, it is necessary to pay attention to the signals interference and the echocardiographic system adjustments for a good visualisation on endocardium lateral walls, mainly on 2 and 4 chamber and transversal views, verifying contrast improvements on these regions. For this work, the image acquisition was performed with the following configuration: 9 Echographic system SIM7000 CFM Challenge - Real Time Ultrasonic Image Flow Analysis System. The generated images by the inner digital convertor are in a 512x512 with 256 grey level; 9Targa Plus Truevision Board; 9IBM-PC Microcomputer. The selected views were the most ordinary ones used on the diagnostic parameters definition (2 and 4-chamber and transversal). The detected motions in the frames were in end-systole and end-diastole in a average young and ideal weight patient. A frozen is performed on the visualised image for the frame acquisition with the desired motion. When this operation is performed the last 15 frames are saved and the desired ones can be selected.
B. Methodology Description After images input, they were transferred to a SUN Sparc Station and then processed. Basically, the processing applied to these images includes three steps: pre-processing, contour definition and artefacts elimination.
B.1 Pre-processing The first considered step attempts to eliminate noises through image smoothing by using the median filter with a 3x3 mask. From the smoothed image, is it applied the histogram equalisation for a more accurate contrast. It was selected median/equalisation algorithm instead of the top-hat technique used by [9] and [12] mainly because the results obtained were more satisfactory in terms of contrast and computational effort. Fig. 2 shows the resulting image.
B.2 Contour Definition After pre-processing the image is thresholded on the basis of the grey level ranges showed in the histogram. The threshold can be obtained automatically or by manual adjusting, without necessarily eliminating the automatic process of the approach. In general, for transversal images the proper threshold varies between 100 and 170 and for two and four chamber images is it fixed at 200. For gaps elimination in the cavity contour, theoretically one would think of using a closing process, which joins objects bridging gaps and smoothing the contour from the outside. However, the setting produced by this process seems to be poorer in details when compared to the original setting and, depending on the number of structuring element iterations applied this can be very aggravated. In order to obtain results treated by a closing operation without impairing the contour accuracy, a "dismemberment" on closing within two iterations is applied from the elementary operator property (dilatation and erosion) in such a way that it can intercalate the use of two different types of structuring element: cross and rhombus. The cross structuring element has a basic form and allows more flexibility. On the other hand, the rhombus structuring element excludes the angular effect caused by cross structuring element. Thus, the image is transformed by two dilations (each one of which with different structuring element) and soon afterwards by two erosions (also with different structuring elements). The result can be observed in fig.3. From this segmented image (fig.3), the contour representation is then obtained by its own erosion, followed by its subtraction with the result of this erosion (fig.4).
B.3 Artifacts Elimination Also from fig.3 image, a location mask (fig.5) is built for the artifacts remotion on inner ventricular cavity (papillary muscle), where objects are dilated, opened, eroded and dilated again by using structuring element square so that all the elements from the black region of the mask are preserved and those ones from the white region are eliminated. Thus, these artifacts are removed by adding the location mask to the inverted contour image ("bouchage de trous'" [13]). In fig.6 the artifacts from fig.4 have disappeared and the left ventricle contour image is presented completly closed. III. RESULTS Methodology stability on the left ventricle contour reprodutibility can be observed through the results convergence showed on the processed images, which are considered satisfactory in terms of performance and reliability.
481
In order to compare the automatic method with the manual process, the left ventricle cavity area (region of the interest) is calculated automatically in each one of the images by an image analysis tool which was developed by [7]. Areas figures are given in pixels and lately multiplied by a conversion factor so that one can obtain these areas figures in square centimetres, according to image resolution (300 dpi) and the conversion scale used by ultrasound system. Similarity between the manual and automatic quantitative results are presented in the table I, where it is clearly noticed the inclusion of the automatically obtained figures in the ranges established by experts. For these figures, it is necessary to take into account, among other factors, the patient's age and sex. TABLE I: LEFTVENTRICLEAREAMEASUREMENTS MANUAL AUTOMATIC AUTOMATIC AREA(cm2)" AREA(cm2) VIEW AREA(PIXEL)
SECTION cross/sist
6598
7,49
4- 11,6
cross/diast
12586
10,34
9,5- 2 2 , 3 7,2 - 20,1
apic4/sist
10247
9,33
apic4/diat
18462
12,52
10,5 - 31
apic2/sist
12266
10,21
7,5 - 22,5
I V . CONCLUSION
Judging from the presented results, it is concluded that transformations by morphological operations showed much more efficiency even for very poor images quality, such as those used on assessing tests, after aplication of the median/equalization. This statement is the major advantage of the methodology described above, mainly when medical examination for obtaining patients diagnoses is considered technically difficult. In relation to other methods used with similar proposes, the proposed methodology showed: 9Advantages in terms of efficiency in the cavity form and size. 9 Elimination of the posprocessing step related to the contour interpolation through high computational effort algorithms. 9Higher level detail on the left ventricle form and size in relation to manual contour. Often, it is hard to make out more accurate manual contours for measurements obtaintion even for an experienced expert. Taking into account the good computational performance showed by elementary binary transformations applied to the methodology, parallel processing could be explored in a way of extending its implementation to real time, what could imply the direct software utilization on the echocardiographic system, reducing the noises introduced in the image acquisition. ACKNOWLEDGMENTS The authors wish to thank Dr. Bianca ,6,vila (S~o Lucas Hospital - Pr) for her help with the ultrasonic images and Ms. Laura Vieira for the text review. This work was supported in part by CAPES (Brazil). REFERENCES
[1] [2] [3] [4] [5]
[6] [7] [8]
D. Adam, O. Hareuveni, S. Samuel, "Semiautomated Border Tracking of Cine Echocardiographic Ventricular Images", IEEE Transactions on Medical Imaging, vol.6, no. 3. pp. 266-271. Sep, 1987. N. Ayache, I. L. Herlin. "A New Methodology to Analyze Time Sequences of Ultrasound Images", Rapports de Recherche - Robotique, Image et Vision INRIA, no. 1390, France, Jan 1991. G.J.F. Banon, J. Barrera, "Bases da Morfologia Matem,qtica para An,qlise de Imagens Binfirias", IX Escola de Computa~:~o, Recife- Brazil, 1994. S.M. Collins et al, "Computer-Assited Edge Detection in Two-Dimensional Echocardiography: Comparasion with Anatomic Data", American Journal of Cardiology, vol. 53, pp. 1380-1387, May, 1994. C.H. Chu, E. J. Delp, A. J. Buda, "Detecting Left Ventricular Endocardial and Epicardial Boundaries by Digital Two-Dimensional Echocardiography", IEEE Transactions on Medical Imaging, vol. 7, no. 2, pp. 181194, Mar, 1988. H. Feigenbaum, "Echocardiography', Lea & Febiger, 4th, Philadelphia, 1988. S.B. Filho, "Um Quantificador da Ploidia Tumoral Atrav6s da Citofotometria". Master's Dissertation, CEFET, Curitiba - Paranfi, Brazil, 1994. The G. Khoros, "Khoros Manual - User's Manual", University of New Mexico, vol I, 1991.
Source: Echocardiographic Laboratory of the Standford University (1988).
482 [9] J.W. Klinger, C. Vanghan,C. L., Fraker, T. D., Andrews, L. T., "Segmentation of Echocardiographic Images Using Mathematical Morphology", IEEE Transaction on Biomedical Engineering, vol. 35. no.11, pp. 925-934, Nov, 1988. [10] C. Lamberti, F. Sgallari, "A Workstation-Based System for 2-D Echocardiography Visualization and Image Processing", IEEE Transactions on Biomedical Engineering, vol. 37, no. 8, pp. 796-801, Aug, 1990. [11] L. Maes, B. Bijnens, P. Suetens, F. v. de Werf, "Automated Countor Detection of the Ventricle in Short Axis View in 2D Echocardiograms", Machine Vision and Applications, no. 6, pp. 1-9, 1993. [ 12] G. K. Mathsopoulos, S. Marshall, "Use of Morphological Image Processing Techniques for the Measurements of a Fetal Head from Ultrasound Images", Pattern Recognition, v.27, no.10, pp. 1317-1324, 1994. [13] M. Roussel, "Analyse et Interpretation d'Images Appliquees Aux Algues Microscopiques", Doctoral Thesis, Compiegne/France, 1993. [14] S. G. Santos, "Metodologia para Detec~o de Contornos do Ventriculo Esquerdo em Imagens Ecocardiogr,'ificas",Master's Dissertation, CEFET, Curitiba- Paran,'i, Brazil, 1995. [15] J. Serra, "Image Analysis and Mathematical Morphology", Academic Press, 1982. [16] L. Zhang, E. A. Geiser, "An Endocardial Borders From 2-Dimensional Echocardiograms", IEEE Transactions On Biomedical Engineering, vol. 31, no. 6, 441-447, Jun 1984.
fig.l- Original Image
fig.3-ClosingContourImage
fig.5-LocationMask
fig. 2- Median/EqualizationResult
fig. 4- ContourImage
fig. 6- ContourImagewithoutArtefacts
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
483
Identification of a stohastic System involving Neuroelectric Signals. A.G.Rigas Department of Electrical and Computer Engeneering, Demokritos University of Thrace, 67100 Xanthi, Greece.
Abstract In this work we use a Volterra - type stohastic model to identify a neurophysiological system (muscle spindle) involving two neuroelectric signals as inputs and one as output.The inputs consist of series of nerve pulses produced by an alpha and a gamma motoneuron applied to the muscle spindle and modify its response which is transferred to the spinal cord by the axons of sensory neurons.The parameters of the proposed model are estimated by using spectral analysis techniques of stationary point processes.It is shown that the effect of the gamma motoneuron on the muscle spindle is restricted at low frequencies (0-20Hz) , whereas the effect of the alpha motoneuron occurs at middle and higher frequencies (20-100Hz). Introduction The muscle spindle is an element of the neuromuscular system and is thought to play a critical role in the initiation of movement and in the maintenance of posture. The function of the muscle spindle is regulated by two kinds of motoneurons lying in the spinal cord, an alpha motoneuron whose long axon makes synaptic contact with the external fibers of the <<parent)) muscle, part of which is the muscle spindle, and affects indirectly the muscle spindle, and a gamma motoneuron whose long axon affects directly the muscle spindle by making synaptic contact with its interval fibers.The axons of the sensory neurons which form spirals around the fibers of the muscle spindle transfer the information to the spinal cord and from there to higher levels in the central nervous system.More details about the muscle spindle and its functional role are given in [1 ]. In this work we intend to examine the simultaneous effect of a gamma and an alpha motoneuron on the muscle spindle.The method used for the analysis of the available data recording from the axons of the two motoneurons and the sensory axon which carries the response of the muscle spindle will be based on the spectral analysis of stationary point processes.Estimates of the coherence function provide measures of the linear relationship between the response of the muscle spindle and the two inputs in the frequency domain.The parameters of the frequency domain can also be used for the identification of the muscle spindle which is adequately described by the stohastic model E{dM,(t)/M,,M2}=
{ s a o+
ta ,
(t-u)dM
' (u)+
ooa 2 ( t - v ) d M
where the quantity E { d M 3 (t) / M , , M 2 } can be interpreted as
1
2 (v) dt , (1)
Prob {event of type-3 at ( t,t+h]/event of type-1 at t-u and event of type-2 at t-v}, for small h. The stohastic model (1) is an extention of a model discussed in [2] and [3].
Spectral analysis of stationary point processes. We assume that M(t)={Ml(t),M2(t),M3(t)} is a vector-valued point process defined on the interval [0,T ]. Formally, a point process is a collection of usually interrelated random variables, each labeled by a point t on the positive line and such that Ma(t)-Ma(s)=Ma(s,t] is the number of events of type-a whitch occur on the
484 interval (s,t] (0_<s
I(R) ab (X, j)
= ~ d1
where d (R) (~.) = f'~ e -i~'t d M a
(R) a (X, j)d(bR) (X, j)
2nR
J0
a
,
j=l ,. ..,L (a,b=1,2,3)
(2)
(t) is the Fourier-Stieltjes transform of the increments
dMa(t)=M,(t,t+dt].An estimate of the spectral density function is given by ~(LR) ( ~ , ) ab
where
fab (~k)
"-
k
_.
1
2p+l
(3)
P f(LR)(~k+r) E=ab
r=-p
1 L 2nk ~ E I(R) and k = 1,2,... ' ( R - 1) / 2 " ab (~'k ' J)' X k = j=~ R
We can now obtain an estimate of the coherence function between the components of the point process from the expression
I~(LR)(:X)l 2 2 _
I'ab
,
(a~:b= 1,2,3)
(4)
where fa(a TM (X) and ~(LR) -bb (~) are the estimates of the power spectrum of the components Ma(t) and Mb(t) respectively. A 95% point of I]~,b I (X) 2can be computed from the following relation z = 1 - (1 - ~)1/,-!, where a=0.05 and s=(2p+l)L.
(5)
Fig.1 presents the estimates of the coherence functions.The broken lines correspond to the 95% of
I
,,b (2~) calculated from (5) for L=5 and p=9.Values of the estimates
below this point indicate that the components of the point process are uncorrelated. Fig.lA shows that the effect of the gamma motoneuron on the muscle spindle occurs at low frequencies (0-20Hz), whereas Fig.2B shows that the effect of the alpha motoneuron occurs mainly at middle and higher frequencies (20-100Hz). Identification of the muscle spindle. In this section we discuss how we can identify the muscle spindle by solving equation (1) with respect to ao ,al(u) and a2(v).The parameters al(u) and a2(v) are called the impulse response functions, whereas ao is a constant presenting the mean value of Mz when M2 and M1 are inactive.The impulse response function is useful in predicting
485
Fig. 1 Estimates of the coherence function.
Fig.2 Estimates of the impulse response function.
486
whether there will be an output event when an input event has already occurred.In order to solve equation (1) we difine the one-sided Fourier transforms of sin(u) (m=1,2) as follows s~O0 =
I: s m ( u ) e x p { - i x u } d u ,
-~<§
(m=1,2)
(6)
By using the arguments of [2] we find p3=a0+ $1(0) p1+$2(0) P2 f31(~)=Sl(h) f11(/0 and
(7)
f32(A)=S2(~)f22(h),
(8)
since the two components M~ and M2 are taken to be independent.By Pa we denote the mean intensity of Na (a=1,2,3). Estimates of the parameters Sm(,k) and sin(u) are odtained using (8) and the inverse transform of (6), that is m
~(LR)
(~) -- "3m
(~') /
^(LR) fmm (X)
(m=1,2)
(9)
(Q-I)/2
"Sm(U) = O - '
~ K R (;kj)S m (Zj)exp(iXju),
(10)
j=-(Q-I)/2
where K(h)is a convergence factor (see [5]).The use of the convergence factor improves the properties of "Sm(u). It can be proved that the distribution of Sm(u) is asymptotically Normal with mean Sr,(U) and variance given by Var['Sm(U)] ~ ( 2 p + I ) L Q J'f33(~')fm~m(~')l-IR~m(~)[~ d~
(11)
Fig.2 presents the estimates of the response impulse function.The broken lines in the middle correspond to the hypothesis that the inputs and output are independent, whereas the solid lines are the 95% confidence limits.Values of the estimates outside the confidence limits indicate that the components are correlated. Fig.2A shows that the presence of the gamma motoneuron makes the system to respond after 15ms. Fig.2B shows that the presence of the alpha motoneuron blocks the system for about 20ms and responds again after that. References
[1] Matthews, P.B.C. (1981). Review lecture: Evolving views on the interval operation and functional role of the muscle spindle. J. Physiol., 320,1-30. [2] Brillinger, D.R. (1974). Fourier analysis ofstationaryprocesses.Proc. IEEE, 62, 1628-43. [3] Rigas, A.G. (1996). Stohastic modeling of a complex physiological system. In Differential Equations and Applications to Biology and to Industry, Eds. M. Martelli et al., pp.409-416. Singapore: World Scientific. [4] Thompson, W.A., Jr. (1988). Point Processes Models with Applications to Safety and Reliability. New York: Chapman and Hall. [5] Candy, J.V. (1988). Signal Processing: The Modern Approach. New York: McGraw-Hill.
Invited Session Q: SIGNAL PROCESSING THEORY AND APPLICATIONS
This Page Intentionally Left Blank
Proceedings/WISP '96,"4- 7 November 1996; Manchester, U.K.
B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
489
DESIGN OF M BAND LINEAR PHASE FIR FILTER BANKS WITH HIGH ATTENUATION IN STOP BANDS Takuro KIDA and Yuichi KIDA Department of Information Processing, Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama-shi, 227 JAPAN. e-mail [email protected] Information Engineering Course, Kogakuin University, 1-24-2 Nishi-Shinjuku, Shinjuku-ku, Tokyo, 163-91 JAPAN Abstract In the literatures [~], [3], a systematic consideration is presented with respect to the optimum interpolatory approximation of multi-dimensional signals. In particular, one of the authors discusses a certain interesting reciprocal property for the approximation in the literature [2]. However, most o] the discussions is limited to theoretical treatment and design of higher order linear phase F I R filter bank, e.g., is )tot discussed [3]. In this paper, we will give a concise explanation for the theoretical basis on the proposed approximation. Further, we will present a linear phase M channel F I R filter bank with high attenuation characteristic in each stop band which is designed by the cosine-sine modulation and the iterative approximation based on the reciprocal property.
1
GENERALIZED Let L~ be a set of F(w) satisfying
INTERPOLATION IF(w) I~ dw < +c~, F(w) = F ( - w ) , where ~ means the complex conjugate of a
complex nurnber z. Further, let L~ be a set of signals f(r
satisfying f(t) = ~
F(w)e j''dw, F(W) E L~.
Note that f(t) has real value. For simplicity we express by J'(t) ~ F(w) a pair of functions f(t) and F(w) satisfying the above relations. Suppose that W(w) be an arbitrary function in L~. Then, we denote by | the set of F(w) in L~ satisfying the following inequality with respect to a given positive number A.
f c,, IF(w)12 oo I w(,o)I ~ d., < A
(~)
Moreover, let P be the set of signals f(t) in L~ expressed as the inverse Fourier transform of F(w) in Oa. Consider that we input f(t) into M linear time invariant filters having the transfer functions of H,,(w) (m = O, 1,... , M - l ) . We denote by .f,,,(t) (m = 0 , 1 , . . . , M - 1) the corresponding outputs of the filter H,,(w) (m = 0 , 1 , . . . , M - 1). Let F,,(w) =
Ilm(o.,)F(w) be the Fourier spectrum of fro(t), where m -
]W(w)
0, 1, ... ,.hi - 1. Further, we assume
121Hm(~o)12
oo
dw < +c~. From the Schwarz inequalities for
w6,){F(.,)lW(~)}e~o and
{W(,o)H.,(~o)I{F(~o)lw(~o)}a,o, we r
easily prove that F(w) and F,,(w) are absolute integrable. Hence, f(t) and J',,(t) are continuous. Consider the equally spaced sample points such as S = {-r} (T > 0) (. = o, +~, +2,...). Further, let I.,(-T) be the sample values of J',,,(t) at the sample points nT (n = o, :t:1, :t:2,...). Now, we consider the problem of approximating J'(t - d) from the above sample values, where d is a given delay time. Let M
- 1
g(t)- E m-'O
oo
E
f m ( n T ) r .... (t)
(2)
n-'--~
be the corresponding approximation of . f ( t - d). The functions r are the prescribed bounded real functions called the generalized interpolation functions or, simply, interpolation functions. We assume that these interpolation functions satisfy ] r .... (t) ]= 0 (t < nT),
[ em,,,(t)]= 0 (t > nT + A)
(3)
490
where A > 0 (m = 0 , 1 , . . . , M - 1; n = 0,4.1,4.2,...). The approximation error between f(t - d) and g(t) is defined by e(t) =1 f ( t - d) - g(t) I. Further, let E,,,,,,(t) be the upper limit of e(t) obtained by fixing all the interpolation functions em,,(t) and changing f ( t ) over all the s i g n a l s / ( t ) in r . E,,,,,(t) = sup {e(t)} (4) /(0er Then, we obtain the following theorem [2]. Theorem 1 ~* I W(,o) I' eJO-n)~, -
E.,o=Ct) = ,/-X
oo
12
r
Z m-'O
dto
n ~--'--- o o
(5)
The proof is omitted. Let em,,,(t) (m = 0 , 1 , . . . , M 1; n = 0,4-1,4-2,...) be the optimum interpolation functions which minimize Em,~:(t). Then, it is proved that there exists a set of functions era(t) (m = 0 , 1 , . . . , M - 1) satisfying
r162
(m=0,1,...,M-
1; n = 0 , 4 - 1 , - ! - 2 , . . . )
(6)
Further, Emaz(t) = E,,,o~:(t + T) holds for Bronx(t) which uses these optimum interpolation functions. Hence, g(t') is expressed by M-l
g(t) = Z m=O
Z
l , , ( n T ) . em(t - nT)
(7)
n=--oo
Then, Eq.(3) can be expressed equivalently by I r Now, we assume that the above quantity A satisfies
0 (t < 0 .t > A) ( r n = 0 , 1 , . . . , M - 1).
A = N T + r (N = a non - negative integer; 0 <_r < T)
(8)
Let t be an arbitrary time and let R be the integer satisfying R T < t < (R + 1)T. Precisely, R should be expressed as Rf. But, in this paper, we use only R instead of Rt. Then, we consider the following two intervals defined by
It = {tl R T < t < R T + r},
12 = {tt R T + r < t < (R + 1)T}
(9)
Further, we define the integers Q and L as shown in the following. (i) I f t E I , " Q=N,L=M(N+I)=M(Q+I) (ii)Iftq. Is' Q = N - 1 , L = M N = M ( Q + I ) In the following, we consider the one to one correspondence between an integer p (= 1 , 2 , . . . , L ) and a pair of integers (m,,,) (~ = 0,~,...,M
- ~; n = n -
Q,n-
(Q - ~ ) , a -
P
(O - 2 ) , . . . , n ) .
{ ~t(m,n)
(t e I,)
6(re, n)
(t ~ I2)
(~0)
where the number of (re, n) is L = M(Q + 1) and is identical with the number of p. For simplicity, we sometimes express only p = ~(rn, n) instead of p = ~,(rn, n) or p = ~,2(rn,n). As the direct consequence of Eq.(3). we obtain em,,(t) = r n T ) = 0 (n < R - Q or n > R) for any t and m satisfying t E h (k = 1,2) and 0 < m < M - 1. Hence, we have M - 1
E .... (t) = ~ ,,
(t e h ) ( k
IW(.,) I~ ~i(,-n)~ _ ~ co
= 1 o~ ~)
m=0
2
R
~ n= R-
r
nT)Hm(w)eJ "T~' dw
(11)
,
(~2)
Q
Let flt be the set of pairs (re, n) composed of m and n satisfying m = 0 , 1 , . . . , M 1 and n = R - Q , R (Q1 ) , R - ( Q - 2 ) , . . . ,R, where Q is related to t implicitly. As is shown in [2], minimizing of Eq.(5) is straightforward. ~ with respect to r under consideration and simply differentiate E,,,,,(t) 2 with respect Firstly, we expand E,,,~ to the complex conjugate of the interpolation functions em,,(t) which contribute actually to the approximation at the prescribed t, and make the resultant set of formulas to be zero, that is 0~'~*'(0~ - 0, where rn = 0 , 1 , . . . , M - 1 and
n=R-O,a-(O-~),a-(O-2),...,R.
The resultant set of equations can be expressed in the matrix form as [r = [C] - ~ . IV(t)], where [r is the vector which elements are the interpolation functions contributing actually to the approximation at the prescribed t. [C] is a constant matrix, while [V(t)] is a vector with the variable t. These matrix and vector are determined by P, W(w), HK(w) and ft,.
491
If we perform the above operation for t satisfying 0 < t < T, we can obtain all the functional forms of the interpolation functiofis using the relation Cm.,(t) = r - nT). As is pointed out earlier, although it is assumed initially that the interpolation function have different functional forms, the resultant optimum interpolation functions are expressed as the parallel shifts of the finite number of n-dimensional functions. In [3], the set of signals is a little generalized, while the theory contains the round-off type linear quantization of the sample values fK(tl;.v). However, it should be noted that the functional forms of the optimum interpolation functions are identical with those obtained in the presented discussion. Further, consider another measure of error defined by EE =
Emo.(t) 2 dt. Then, as is shown in [2], B E is expressed
by
EE = ~
~
I T,(w) 12 dw
(13)
n" --oo
To(w) = 1 W ( w ) [
e -;d'' - 1T
H,.,(w),It.l(w)
, T.(w) = -~
m-0
[W(w) lH.,(w)'r.,
- ~
(14)
m--0
Moreover, E E remains invariant if the replacements between ~I,,,(w) --, Bin(w) and Hm(w) "-* ~m(W), or their converse, are performed, where B,,(w) = 1 W ( w ) I H m ( w ) and ~m(W) =l W(w) Ir
BE = ~
I z . ( . , ) I ~ d~ oo
z0(w) =l w(,o) i e- ~ a ~ -
'~.,(w)Bm(w)
(15)
n---oo
, Z.(w)- ~
m--0
IW(w) l,~.,(~,)B.,
- ~
(t6)
m--0
In this discussion, we assume that the inverse Fourier transforms of[W(w) J~,,(w) and [W(w)[H,,,(w) are time-limited. When W(w) is band limited and equal to 1 in a given domain -w0 _< w < w0 (w0 > 0), for example, we have B,,(w) = Hm(w) and ~,,,(w) = ~,,,(w). Therefore, we may soon notice that Zo(w.)F(w) = { 1 - -~-'~'~,,,(w)H~(w)}F(w) rn
is proportional to the distortion of the filter bank obtained by ignoring the aliasing components in the frequency domain, F [ w - 2_~] are corresponding to the aliasing components.
while Zn(w)F [w-~---~"] = -~Zq'm(w)Hm [ w - 2 ? ] m
If W(w) is not constant, similar reciprocity is valid for the following two systems; (i) A system obtained by connecting W(w) in tandem in the input port of a filter bank using Hm(w) and r as the analysis and the synthesis filters, respectively. (ii) Another system obtained by connecting W(w) in tandem in the output port of a filter bank using ~I,,,(w) and Hm(w) as the analysis and the synthesis filters, respectively.
2
ITERATIVE
DESIGN
OF
LINEAR
PHASE
FILTER
BANKS
In this section, we will present a linear phase M channel F I R filter bank having N + 1 tap coefficients with high attenuation in each stop band. These filter bank is designed by the cosine-sine modulation and the iterative approximation based on the above reciprocity. For simplicity, we treat the example with M = 32 and N + 1 = 512 only, but it is obviously extended to more general case. The algorithm proposed can be expressed as follows. 256
(a) Make an initial prototype lowpass filter P~
=~
o - , with the size 257 having the coefficients p,z
n-'0
o
P" -
~'", ,~ ,/<~--*~i 9 /,
)
(
a + (1 -
(~(.-128)] )
a)cos.,
r/.128.
/
(n = o,I,...,256; o < a < I, fo = 32~, ~ > I, ~I > I) (b) Derive tile second prototype lowpass filter e'(z) -
512
Zp~z-"
by
n'-0
256
Pn =
pn_#,p~ n >_ k, n = 0, I,...,512 k--0
(17)
492 511
(c) Make the new prototype lowpass filter P*(z) = ~ p ~ , z - " by n=0
p o, = P'- +P'-+, 2
; n=0,1,2,...,511
(19)
511
(d) Construct analysis filter bank H,(z) = ~_j h,,.,,,z-" using the cosine transformation. The coefficients of the analysis n--0
filters are given by hm,, = 2.0 .p, .coa{~(m ,1,1,0.5)(n- 5]_.!.t) + ( - 1 ) " 4 } (m = 0 , 1 , 2 , . . . , M - 1). (e) Obtain the synthesis filter bank by applying the presented optimization. If the attenuation of the synthesis filter is rg,2ss.s ) ) (n = 0 , 1 , 2 , . . . , 5 1 1 ) for appropriate 0 < b < 1 and 0 < rg < 1, not sufficient, make q, .0 = qO ( b-t- (1 - b)cos \(r("-2ss's)-~'~
where q, .0 and q~ are the coefficients of the initial and the derived synthesis filter. (f) Optimize the analysis and the synthesis filter bank iteratively by the reciprocal relation. That is, the replacement between (~I',,(z))i_ l ---* (H,,(z)) i and (H~(z))i_ l ---*(~I',,(z)) i are performed iteratively, where (*)i (i = 1 , 2 , 3 , . . . denotes 511
the i stage of the iterations for a pair of Bin(z) and 9,,(z). Let fire(Z) = ~
h,,,,z-" be the resultant analysis filters.
n=0
(g) Derive the coefficients of new prototype filter from the relation p~ =-],0,,/[2.0cos{(r/32)" 0 . 5 - ( n - 511/2.0) -i7r/4.0}] (n = 0 , 1 , 2 , . . . , N ) , where h0,, (n = 0 , 1 , 2 , . . . , 5 1 1 ) are the coefficients of the optimized low pass analysis filter
//o(~).
(h) M a k e t h e linear phase analysis filter bank H~(z) (m = 0 , 1 , 2 , . . . , M - 1) having the coefficients h ~ , , = 2.0. p b . - 511/2.0)} (m = 0 , 2 , 4 , . . . , 3 0 ) and h ~ , , = 2.0. p~. sin{(r/32)(m ,1, 0.5)(n - 511/2.0)} (m = 1,3, 5, . . . , 3 1 ) , where n = O, 1, 2 , . . . , 511. (i) Derive the synthesis filter bank by the presented optimization. Then, as the direct consequence of the symmetrical arrangement for the coefficients of the analysis filters, we can easily prove that all the resultant synthesis filters are linear
cos{(r/32)(m "1" 0.5)(n
N
phase. Assume that ~ ( z )
= ~r
(m = 0 , 1 , 2 , . . . , 3 1 ) be the transfer functions of the resultant synthesis filters.
n=0
(j) Make new analysis filters defined by H,.,(z) = o t , , . H ~ ( z ) + ( 1 . 0 - a , , ) . ~ ( z ) (m = 0 , 1 , 2 , . . . , 3 1 ; n = 0 , 1 , 2 , . . . , 5 1 1 ) , where a,, (m = 0 , 1 , 2 , ' . . , 3 1 ) are appropriate scaling factors satisfying 0 < am < 1 (m = 0 , 1 , 2 , . . . , 3 1 ) . These analysis filters are linear phase as well. (k) Derive tim synthesis filter bank by the presented optimization. Then, the symmetrical arrangement for the coefficients of these analysis filters, it is shown that all the resultant synthesis filters are also linear phase. (l) Optimize the analysis and the synthesis filter bank iteratively by the reciprocal relation. We c a n obtain an example of a linear phase filter bank with M = 32 paths and the size N ,1, 1 = 512. Before we derive this linear phase filter bank, we perform 10 iterations. In this example, d = 256 is used. Although the effect of the parameters, fc, r/ and so on, are critical, in this example, we use approximately fc = 47, r/ = 1.0, a = 0.52 and rg = 1.4, b = 0.5. The amplitude characteristics of each analysis and synthesis filters are 99 or 100 dB.
3
CONCLUSION
Although the detail is omitted, it should be noted that the proposed generalized interpolatory approximation has tile minimum measure of error in a certain sense among all the linear/nonlinear approximations using the same sample values of the signal. The presented design gives a simple way to obtain the optimum analysis/synthesis filter banks. Finally, we would like to express our sincere thanks to Professor B. G. Mertzios, Demokritus University, Greece.
References [1] P. P. Vaidyanathan: Multirate Systems and Filter Banks, P T R. Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1993. [2] T. Kida, L. P. Yoshioka, S. Takahashi and H. Kaneda: Theory on Eztended Form oJ Interpolatory Approzimarion oJ Multi.dimensional Waves, Elec. and Comm. Japan, Part 3, 75, No. 4, pp.26-34, 1992. Also, Trans. IEICE, Japan, Vol. 74-A, NO. 6, pp. 829-839 , 1991 ( in Japanese ),
The Optimum Approximation oJ Multi.dimensional Signals Based on the Quantized Sample Values oJ Transformed Signals, Submitted to IEICE Trans. E77.A, 1994.
[3] Takuro Kida:
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
493
Robustness of Multirate Filters Banks F. N. Koumboulis 1 , M. G. Skarpetis ~ , and B. G. Mertzios 3 1 2 3
University of Thessaly, School of Technological Sciences, Dep. of Mech. & Industrial Eng., Volos, Greece. Mailing address: 53 Aftokratoros Irakliou St., pc. 15122, Athens, Greece, tel. +30-1-8023050, e-mail: [email protected] National Technical University of Athens, Dep. of Electrical and Comp. Eng., Div. of Electroscience, Greece. e-mail: [email protected] Democritus University of Thrace, Dep. of Electrical and Computer Eng., 67100, Xanthi, Greece, Fax: 130-541-26947 or 26473, e-mail: [email protected].
Abstract The problem of designing nonmaximally decimated multirate filter banks is studied for the case of uncertain dynamic channels. The necessary and sufficient condition for the existence of an appropriate FIR and anticausal synthesis filter bank yielding perfect signal reconstruction, in spite of the channel's uncertainties, is established. The condition depends entirely upon the polyphase analysis matrix and the z-transform of the dynamic description of the uncertain channel. The general analytic expression of all polyphase synthesis matrices solving the problem is derived. I. Introduction The problem of designing multirate filter banks is an important signal processing design problem from both the theoretical and practical point of view [I],[2]. The problem has attracted considerable attention and it has been studied for different types of analysis and synthesis banks (see f.e. [I]-[5] ) as well as using maximally decimated [6]-[7] or nonmaximally decimated filter banks [8]. Here, we interested for one of the main objectives of the problem namely that of perfectly reconstructing the input signal [6],[7]. Motivated by many practical cases where the channel's behavior is not ideal, the case where the channel is described as a dynamic uncertain system is studied. The filter bank is considered to be nonmaximally decimated (number of channels greater than the decimation ratio). In particular, the necessary and sufficient condition for the existence of an appropriate FIR and anticausal synthesis filter bank achieving perfect reconstruction of the input signal, in spite of the channel's uncertainties, is established. The condition depends entirely upon the polyphase analysis matrix and the z-transform of the dynamic description of the uncertain channel. The general class of all, independent of the channel's uncertainties, polyphase synthesis matrices solving the problem is derived. 2. Problem Formulation Consider the nonmaximally decimated filter bank, presented in Figure 1
,cO0 J Ho(z)
l
=nol o o
o
t
L___.~--~hannelP-~
] H~-~z) I
Analysis bank
qv-I
Decimators
qU~l Expanders
~ Ft~ )
!,. I;C").-
Synthesis bank
Fig. 1- Non maximally decimated filter bank p > M where M is the decimation ratio and p is the number of channels. Hi(z) (/- 0 ..... p - 1) are the analysis filters and Fj(z) ( / - 0 ..... p - 1) are the synthesis filters. The signal x(n) is the input signal while ~(n) is the output signal. The design objective is to find an appropriate synthesis bank namely appropriate filters Fo(z), F1 (z) ..... Fml (z) such that k(n) = x(n) (Perfect Reconstruction). Using the polyphase representation (Fig. 2) of the filter bank the design objective is translated as follows: Find appropriate polyphase matrix R(z) of the synthesis bank such that R(z)E(z) = 1M, where E(z) is the polyphase matrix of the analysis bank. R(z)and E(z) are of dimensions p x Mand M x p, respectively. Eventhough the most standard type of filter banks is that of maximally decimated i.e. p = M. The use of nonmaximally decimated filters banks appear to have many applications especially in convolution codes [8]. Here the nonmaximally decimated filter banks are used in order to compensate the errors appearing in the filter bank output k(n), due to uncertainties of the transmission channels. The perfect transmission of a signal via a channel is an ideal si tuation which facilitates the solution of the respective filtering problem. The behavior of a channel is determined upon characteristics of the mean, properties of the signal as well as external events. Similarly to any other physical system a channel can be considered to have dynamic behavior. For example consider a wire high frequency transmission line. If the length of a transmission line is much less than the wave length of the signal, the transmission line is described as a static system [9]. If the length of the line is about equal or k-times greater (with k small) than the wavelength of
494
channel O r
, 1 M-th delay
I
channel
1-st aflvqnce C~alYI 1
E~
R(z)
9
-1 z
"
q
IM.th z T advance chain
channelp-
I
Decimators
q
Polyphase analysismatrix
Polyphase Expanders synthesis matrix
Fig. 2: Polyphase representation of a nonmaximally decimated filter bank the signal, the channel behaves as a dynamical system [9]. For sufficiently long transmission line the channel behaves as a distributed parameter system [9]. The values of the parameters of the three types of systems described above, i.e. the values of the parameters of the dynamics of a channel, depend upon other physical parameters e.g, temperature, magnetization. In many cases these physical parameters are not known in full accuracy. So they can be considered as uncertainties and consequently the channel is described by a dynamic uncertain model. In general, the parameters of the dynamic model are nonlinear functions of the uncertainties (f.e. the dependence of resistance upon temperature). In this paper, the problem of designing multirate filter banks is studied for the case where the channel has a dynamic uncertain behavior. To preserve generality, the channel is assumed to by affected by I uncertainties, let q l ..... q l, while the dynamics of the i-th channel are assumed to be described by the transfer function di(z,q) (with q = [ql ..... ql] e Q: uncerainty domain). In particular, the problem is oriented by the non maximally decimated filter bank with uncertain dynamic channels, presented in Fig. 3
9
9
9
9
Analysis bank Decimators Channels
Expanders
t A
Synthesis bank
Fig. 3: Nonmaximally decimated filter bank with uncertain dynamic channels or equivalently by the polyphase representation given in Fig. 4"
x~ 1-st delay
J 2-1 ]
M-th I delay
Z -1
9
-j
9
z
]
i,~
A
,I~A
I 1-st ladvance
I
1
E(z)
J Polyphase Decimator analysis matrix
-I ~-r
,.I
-iPolyphase- ~'
.
I
synthesis matrix Expander
--
adv. . . .
~ha;.
Fig. 4: Polyphase representation of a nonmaximally decimated filter bank with uncertain dynamic channel The design objective is to find a polyphase synthesis matrix R(z)which will eliminate not only the influence of the polyphase analysis matrix E(z) to the output signal k(n) but also the influence of the dynamics of the uncertain channel. Hence, the problem consist in finding an R(z) such that R(z) diagonal{ dj(z, q) } E(z)= I u (2.2) j = 0 ..... p - I
The dynamics of different channels are considered to be in general different. This can easily be understood after recalling the fact that different signals travel into the channel (linearized dynamics) as well as that in many practical cases (e.g. encryption [10]) different means with different characteristics are often used.
495
With regard to E(z), or equivalently with regard to the analysis filters Hi(z), j = 0 ..... p - 1 no limitations are imposed except that of causality. The polyphase matrix R(z) it is considered to be anticausal and FIR ([6]-[7]) thus corresponding to anticausal and FIR filters Fg(z), j = 0 ..... p - 1. 3. Solution of the Problem Define B(z, q) =diagonal {dg(z, q) } E(z) (3.1) Fo .....p--1 Based upon the above definition the equation (2.2) takes on the form R(z)B(z, q) = 1M (3.2) As already mentioned, E(z)is causal. The channel (deterministic system) is considered to be causal. So the rational matrix B(z, q) (rational with respect to z) is causal and thus it can be expressed in polynomial ratio form as follows: B,(q)z" + B,-1 (q)z "-l +... + Bo(q)z ~ (3.3) B(z, q) = z" + b ,_l (q)z "-1 +"" + bo(q) z~ where Bg(q) e [ 8~(q)]P• bg(q) e fd(q)are nonlinear functions of the uncertain vector q (with fs(q)the set of real functions of q. The integer n represents an upper bound of the realization degree of B(z,q). As already mentioned, the polyphase synthesis matrix is considered to be FIR and anticausal, i.e. to be of the form R(z) = Roz ~ + R lz 1 +... + R m z m (3.4) where m is the maximum number of advances. Substitution of (3.3) and (3.4) into (3.2) yields (3.5) [Rozo + R l z 1 +...+Rmzm][B,,(q)z" +B,_l(q)z "-1 + . . . + B o ( q ) z ~ 1 7 6 Equating like powers of z in both sides of equation (3.5), defining Bg(q) = 0 for j < 0 and defining Bn(q) B.-l(q) "" Bn-m+l(q) Bn-m(q) B ..... l(q) "'" 0 0 0 Bn(q) ... Bn-m+e(q) Bn-m+l(q) Bn-m(q) ... 0 0 9 9 . . . . . . (3.6a) BE(q)= b b ..i B,'(q) B,-l(q) B,-2(q) ... Bo'(q) b 0 0 ... 0 B.(q) B.-l(q) ... B~(q) Bo(q)
B~(q):E 0 0 ...o 1 i.(q~
I
bo(q)lM 3 (3.6b) RE = [ RM RM-1 -- R 1 Ro J (3.6c) the equation (3.5) can be expressed more compactly as in the following algebraic equation RFBE(q) = BR(q) (3.7) Equation (3.7) is linear with knowns depending upon the uncertainties and unknown RE which does not depend upon the uncertainties. According to Appendix (see relations (A.6-7)_~ the equation (3.8) is solvable if and only if . . .
BE(q) I=rank~[BE(q)] rank~ BR(q) J
(3.8)
If the condition (3.8) is satisfied then according to (A.8) the general solution of equation (3.7) is (3.9) R E : T[BE(q)]~ +(BR(q) I BR(q) )~ where T is an arbitrary matrix. From (3.9) and (3.10) the following theorems are derived. Theorem 3.1: For the multirate filter bank of Fig. 4, there exist an anticausal and FIR polyphase synthesis matrix R(z) of order m, yielding perfect reconstruction of x(n) i.e. k(n) = x(n), in spite of the channel's uncertain dynamics if and only if the condition (3.8) is satisfied. 9 Theorem 3.2: For the multirate filter bank of Fig. 4, the general form of the anticausal FIR polyphase synthesis matrix R(z) of order m, yielding perfect reconstruction of x(n) in spite of the channel's uncertain dynamics, is R(z) : TRa(z) +Rb(z) (3.11) where Rt,(z) = Rboz ~ +"" +Rbmz m and Ra(z) = Raoz ~ +"" +R,,mZ m, and where [ Rbo" "Rbm I = ( B R ( q ) I BR(q) ) ~ , 9 = [ E(q)]~. The matrix T is an arbitrary matrix. 9 Based upon Theorem 3.2 and the relation between the polyphase synthesis matrix and the respective synthesis filter bank [ 1]-[2], the general form of the synthesis filters Fj(z), q = 0 ..... p - 1) can easily be derived. 6. Conclusions The problem of designing nonmaximally decimated multirate filter banks has been studied for the case of uncertain dynamic channels. The necessary and sufficient condition for the existence of an appropriate FIR and anticausal synthesis filter bank yielding perfect signal reconstruction, in spite of the channel's uncertainties, has been established (Theorem 3.1). The general analytic expression of all fixed order polyphase synthesis matrices solving the problem has been derived (Theorem 3.2). Many aspects regarding the problem remain to be solved f.e. the minimal order of the polyphase synthesis matrix solving the problem. The application of the present results for the case of high frequency transmission lines yielding an RLGC channel model is currently under completion. A p p e n d i x
Here some useful mathematical definitions and properties, introduced in [1 1], are presented. Consider the row vector set {wl(q) ..... wv(q)}, where wi(q) e [ gs(q)] l• (i = 1..... v) is a nonlinear vector map Q ~ [(Ca (q)]l• vectors wi(q) (i = 1..... v)are said to be linearly dependent among themselves over 9t, if there exist x~ e 910 = 1..... v) with (x~ ..... xv) r 0 such t h a t x l w l ( q ) + . . . +x~wv(q) =0, Vq e Q. If the vectors w~(q) are not dependent over 91 they are called independent over 91. Consider the subset N(q)c_[8o (q)]l• where N(q)= { w ( q ) e [f0(q)] 1•
496
w(q)=xlw~(q)+... +xvwv(q) Vq ~ Q, xi ~ 9t(/= 1..... v) }. It can readily be shown that N(q) is a finite dimensional vector space over the field of real numbers 9t. Consider the matrix W(q)=[ [w~ (q)]r, ..., [wv(q)] T i t . The image of W(q) over 91 is defined to be Im~ { W(q) } = N (q). Let wA (q) ..... wf, (q) be the linearly independent (over 91) vectors of {wl(q) ..... w~(q)}. The rest of the vectors, let wo, (q) ..... Wov_,(q), are linearly dependent (over 91 ) upon the vectors{wfl (q) ..... wf,(q)}. Thus, {wA (q) ..... wf~(q)} is a base of Imp{ W(q)} and the dimension dim{Im~ { W(q)} } of the space Im~ { W(q) }, is equal to ft. The rank (over the field of real numbers) of W(q) is defined as follows rank~{W(q)} =dim{N (q)} =dim { Im~{W(q)} } (A.1) Consider the following subset of 9t ~ ,~= { z=[zl ..... z~] ~ 91~ " zlwl(q)+... +zvw~(q)=O, Vq ~ Q } The above subset is a subspace of 91~. The kernel of W(q) over 91 is defined to be Ker~ { W(q) } = ~. Note that, dim { Ker~{W(q)} }+ +dim { Im~{W(q)} }= v (A.2) To derive the independent of q matrix corresponding to Ker~{W(q)}, define {z~'..... z~W,} to be a base of Ker~{ W(q)}(z w ~ 91v). Then, the matrix corresponding to Ker~{ W(q)} is
t (q>l
:[
(zr> .....
(A3>
Let e(q)eIm~{W(q)}. Thus, 8(q)~Im~{Wl(q)}, where Wl(q)=[ [wA(q)] r ..... [wf~(q)]T i t . Since the rows of Wi(q) are linearly independent over 91there exist a unique vector, let x + ~ 91~x~, such that e(q)= x +W~(q). The elements of x + are the components of e(q) in Im~ { W(q) }, with respect to the base {wA (q) ..... wf~ (q) }. Augment x + with zero elements placed at positions corresponding to the linearly dependent (over 91) rows of W(q). Based on this augmentation the following generalization of the components of e(q) in Im~ { W(q) }, is derived t"
where Z; = t Zr+ ' if k =fr ~ {fl ..... f~ } (k = l .... v) and where Zr+ is the r-th element of x + If the vectors ' 0 , if k=o~ E {ol ..... o~_~} wf~ ( q ) . . . . . wf.(q) are selected by searching the vectors w~(q) ..... w~(q) from the first to the last, then the matrix Wt(q) and the vector ( e ( q )
/ W(q))~
are uniquely determined. Let E(q)= [~'!q) ~,-(q) ] be a v * x p
matrix with
e~(q) eIm~{ W(q)},i = 1..... v*. Then (A.4) can be generalized a follows
(E(q) I rV(q) )~
i W(q) )~ :[ (e"(q)l.> ,..> >. 1
Numerical algorithms for the computation of all above definitions, can be found in [ 12]. In what follows the solution of a linear non homogeneous algebraic matrix equation with data in ta (q) and unknowns in 91, derived in [ 11], is presented. Consider the equation XW(q) = E(q) X e 9t"'x" (A.6) The matrices E(q) and W(q) are known nonlinear maps of q. Ts problem consist in finding X, such that (A.6) is satisfied. Clearly, the problem is solvable if and only if E(q)_ eIm~ { W(q) }, or equivalently (from (A. 1)) if and only if
rank~L! j
W(q) ] =rankm[W(q)l E(q) If condition (A.7) is satisfied then, (according to (A.3) and (A.4)), the general solution of (A.6) is X = T[W(q)]; + ( E ( q ) I W(q) ~ ,~ where T is an arbitrary matrix. Note that the T, ( E ( q ) \ W(q) }~ and [W(q)]~ are independent of q. References
(A.7) (g.8)
[ 1] Vaidyanathan, P. P., 1993, Multirate Systems and Filter banks, Englewood Cliffs, NJ: Prentice-Hall. [2] Crochiere R. E. and Rabiner L. R., 1983, MuItirate Digital Signal Processing, Englewood Cliffs, NJ: Prentice-Hall. [3] Vaidyanathan, P. P., 1990, Multirate digital filters, filter banks, polyphase networks, and applications: A tutorial, Proc. IEEE, vol. 78, pp. 56-93. [4] Vetterli M., 1987, A theory ofmultirate filter bamks, IEEE Trans. Acoust. Speech Signal Processing, vol.35, pp. 356-372. [5] Smith M. J. T. and Barnell T. P., lIl., 1987, A new filter-bank theory for time-frequency representation, IEEE Trans. Acoust. Speech Signal Processing, vol.35, pp. 314-327. [6] Vaidyanathan, P. P. and Chen T., 1995, Role of Anticausal Inverses in Multirate filter-banks--Part I: System Theoretic Fundamentals, IEEE Trans. Signal Processing, vol.43, pp. 1090-1102. [7] Vaidyanathml, P. P. and Chen T., 1995, Role of Anticausal Inverses in Multirate filter-bamks--Part II: The FIR case, factorizations, and biorthogonal lapped transforms, IEEE Trans. Signal Processing, vol.43, pp. 1103-1115. [8] Fomey G. D. Jr., 1970, Convolutional codes I: Algebraic structure, IEEE Trans. Info. Theory, vol. 16, pp. 720-738. [9] Combes P. F., 1990, Microwave Transmission for Teleconmmnications, New York, Wiley. [10] Schneier B, 1994, Applied Cryptography, New York, Wiley. [11]Koumboulis, F.N., and Skarpetis, M.G., Input output decoupling for system with nonlinear uncertain structure, J. Franklin Inst. In Press. [ 12]Koumboulis, F.N., and Skarpetis, M.G., Robust triangular decoupling with application to 4WS cars, submitted.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
497
D e s i g n i n g a n d L e a r n i n g A l g o r i t h m of N e u r a l N e t w o r k s for P a t t e r n R e c o g n i t i o n Hiroki TAKAHASHI, Masayuki NAKAJIMA Graduate School of Information Science & Engineering Tokyo Institute of Technology 2-12-10okayama Meguro-ku Tokyo 152 Japan Abstract In the case of pattern recognition using neural networks, it is very difficult for researchers or users to design them. In this paper, a method of learning and designing feedforward neural networks is discussed. In t h e proposed method, a neural network is regarded as one individual and neural networks whose structures a r e same as one species. These networks are evaluated by grade of training and they evolve according to an evolution rule proposed in this paper. The designing and training neural networks which performs Handwritten KATAKANA recognitions are described and the efficiency of proposed method is discussed.
1
Introduction
There are many studies about neural networks model which has the function of learning. However, it is not clear how the signal processings are performed in neural networks, because that non-linear units function in parallel. In the case of designing neural networks which performs pattern recognition, researchers design the network structures and learning parameters such as learning rate, coefficient of momentum term and so on, by trial and error based on their knowledge and experiences. Especially in the case of character recognition in neural network, it is very difficult for researchers to design by trial and error, because that the network is large and it takes long time to confirm the performance of the network. There are many studies on designing neural networks [1] [2] [3] [4] [5].Whese researches are classified in two kinds of approaches. One is direct encoding method[3]and the other is grammatical encoding method[i][2]. The direct encoding method has some restrictions of neural network structures because the network connectivities are encoded into a matrix directly. The grammatical encoding method is more flexible than the direct method. However, it is difficult to obtain an optimal network structure. Moreover, structural evolution method is proposed by [4]. The method enables to generate any kinds of neural network structures, but the connections have only three kins of connection weights. Therefore, it is difficult to generate networks for complex pattern recognition. The authors proposed a method of designing optimal neural network structure using GA(Genetic Algorithms)[6][7]. Moreover, we also designed and trained neural network structures which classified some simple patterns[8]. In this paper, a method of learning and designing feedforward neural networks structures based on evolutional method is discussed. In the proposed method, a neural network is regarded as one individual and neural networks whose structures are same as one species. The decision of network structures and training of neural network are estimated based on fitness values of individuals and species. The designing and training neural networks which performs Handwritten character recognition is described and the efficiency of proposed method is discussed.
2
Genotype coding Table.1 shows genotype codings employed in proposed method. Table.1 Genotype codings of neural network. genotypel genotype2 genotype3 genotype4 N n 77 w N r/
n = ( n 1 . . . . , rig) w = (wl,...,wt)
: Number of neural network layers : Learning rate
: Unit numbers of each layer : Connection weights
G e n o t y p e l and 2 present the structure of neural network. The length of genotype4 is based on the number of node 1 which is restricted by network structures given in genotypel and 2. The length of genotype4 is given as following formula. N-1
Z= ~ (nk + 1)nk+i k---1
where, l gives number of nodes.
(1)
498
3
Definition of spieces
The training of neural networks is the minimum search problem in error-weights space. If the structures of neural networks are different, the shapes of error space become different, therefore, it's difficult to compare with the searching positions in neural network of different structure. In this paper, the individuals whose neural network structures are same are defined as same species. That is, the same phenotypes of the individuals which are represented by genotypel and 2 shown in Table.1 are regarded as the same species.
4
D e f i n i t i o n o f e v o l u t i o n rule of i n d i v i d u a l s a n d e v a l u a t i o n
In this section, operations between the individuals in the same spices are described. The operation described in here is performed in every 10 epochs, and in the other epochs except for this one, the weights represented by genotype4 are changed according to the direction of gradient descent in weight space. 1) E v a l u a t i o n The fitness value f ( / i ) of individual Ii is defined by M.S.E (mean square error) calculated by the output values of the network and it's target values. Therefore, the smaller the amount of fitness value is, the more superior it becomes. Np nN
(2)
/(z,) = Z p--1 k - ' l
N Np
nN Opk ~v tpk
: Number of layers 9Number of patterns
: Unit number of N-th layer 9Output value on k-th unit of N-th layer for pattern p : Target value on k-th unit for pattern p
2) S e l e c t i o n In the same species, individuals of large fitness values are disappeared and new individuals are created according to a selection ratio/:'8. 3) C r o s s o v e r a n d M u l t i p l i c a t i o n In this method, two kinds of generating operation are defined. One is crossover operation, and the other is multiplication operation. Crossover operation generates a new individual from two different kinds of individuals. The new individual succeed the features of it's parents. Therefore, we employ one point crossover operation at crossover ratio Pc. Multiplication operation generates a new individual from one individual. In multiplication operation, new individuals are multiplied according to the following formulas in order to create them distributed near the superior individual/8 in multiplication ratio Pi.
wnew
--
ws-t-0.4 • r(ws)
(3)
,,~e,.
=
.8 + 0 . 4 x r(,~)
(4)
ws : Connection weights of individual Is ~78 : Learning parameter of individual/8 wn~w : Connection weights of individual I n ~ ~?~ : Learning parameter of individual I , ~ The function r(a) produces random numbers whose range is - [a [< x _<[ a I. ( 1 - Pc - Pi)N~ individuals are created, when the number of individuals in same kind of species is/Vs. Where
G+P~= 1.
4) M u t a t i o n There are two kinds of mutation operations as follows; a) A genotype is initialized at the mutation ratio Pro~2. b) genotype is evolved according to the formula(3) or formula(4) at the mutation ratio Pro~2
5
D e f i n i t i o n o f e v o l u t i o n r u l e of s p i e c e s a n d e v a l u a t i o n
In this section, we make use of the characteristic of feedforward neural network structures to define an evolution rule of species. 5.1
Investigation
on feedforward
neural
network
structure
Nth-layered feedforward neural network is represented by the unit number of each layer as follows; n = (nl,n2," " ,nN). We discuss a network which has 2 units in input layer and 1 unit in output layer. Fig.1 shows structures of neural networks classified into the number of layers and the unit number of each layer. It results from Fig.1 that the following two kinds of rules enable to present any kinds of neural network structures.
499
1. The number of layers increases by 1 and one unit is added to just the before of output layer. For example, that is the changes such as (a) ~ (b) ~ (f) or (c) ~ (g). 2. The units of just before output layer increases by 1. For example, that is the changes such as (b) ~ (c)
---,(d)---,(e)or (g)~ (k)---,(n).
5.2
Evolution
rule
of species
The feature of feedforward neural network structure mentioned previous subsection are employed in our evolution rule of species. The change rule is shown in Fig.2. The change rule which produces at most three kinds of neural networks from one kind of structure. It is given by the following formula(5). ( n l , . . . , n N - 1 + 1,nN) (7~1'''''
('n.1,.. :
T I ' N - I ' 7~N) "-+
(T/,1. . . .
9nN-1, 1,/1.N+I) ,
nN-1 - 1,nN)
~. (nl, .., n N - 2 , n N )
(5)
if nN-1 > 2 if n N - 1 = 1
2layers
(2,D (a)
(2, l, l)
(2. 2, l)
(b)
4layers ~
(c)
-
-
(2,1. l, 1) (f)
(d)
, , 1)
2, 4, l)
(r
_ . (2, 2, l, 1) (g)
(h)
(2,1,2, I) (j)
(k)
_, 3, , 1)
(2,2, 2, I)
(m) (2, l, 3, l)
(2, 4, I, 1)
(i)
(2,1)~,~ (2,1,1) ~- - ~ - (2, 2,1) ~
(2, 3,1) ~ ' ~ - s ~ "= (2,(2,4,3,1)1,1) (2, 2,1,1) ~
(2,1,1,1) ~
(2,3,2, I)
O)
(2,1, 2,1) ~ (2,1,1,1,1) ~
(n)
(2, 2, 3, l)
~ , 4 , (o)
(2, 2, 2,1) (2, 2,1,1,1) (2,1,3,1) (2, I, 2,1,1) (2, l, 1, 2,1) (2, I, I, 1,1,1)
Fig.2 Changes of neural network structures.
1)
Fig.1 Structures of neural networks. 5.3
Evaluation
and
multiplication
rule of species
1) E v a l u a t i o n In evaluation of species, there are two kinds of criterions as follows; a) T h e fitness v a l u e of t h e m o s t s u p e r i o r i n d i v i d u a l in t h e species Regard the species in which consists tile superior individual in all individuals as superior one. b) T h e c h a n g e of fitness v a l u e of t h e m o s t s u p e r i o r i n d i v i d u a l in t h e s p e c i e s Assume the evolution among the species is delayed in the case that there are few changes of fitness value of superior individual while some generations. 2) M u l t i p l i c a t i o n a) Among the species which are ranked in evaluation a), the top 25% of species increase the number of individuals of species. In the case that the number of individuals in the species increase up to a certain number, the evolution of species occurs according to the formula(5). On the other hand, bottom 25% of species decrease the number of individuals of species. b) In evaluation b), in the case that the evolution of species are delayed, the number of individuals of species decrease. On the other hand, in the case that the evolution is performed, the number of individuals of species increase. c) In the case that the species are superior but the evolution is delayed, the number of individuals do not decrease less than a certain number. On the other hand, in case that the species are inferior, it becomes extinct. In that case, the evolution of species occurs according to the formula(5).
500
6
E x p e r i m e n t a l results
The neural network structures which recognize the Handwritten character illustrated in Fig.3. These images are 17 x 17 pixels. We use 10 kinds of training pattern and recognition pattern, respectively. Initially, ten kinds of species are generated randomly. There are ten kinds of individuals in the initial species. The state of evolution of species and the changes of fitness values of the individuals are shown in Fig.4.
Fig.3 Experimental Handwritten characters.
7
Fig.4 Experimental results.
Conclusions
In this paper, the evolution rule of feedforward neural network structures is proposed. Moreover, we make use of the characteristics of the optimal search problem to teach neural networks using evolutional method.
References [1] KITANO, H.: "Designing Neural Networks using Genetic Algorithms with Graph Generation System", Complex System, 4, 4, pp.461-476 (1991). [2] KOZA, J. R.: Genetic Programming, The MIT Press (1992). [3] MILLER., G. and TODD, P.: "Designing neural networks using Genetic Algorithms'.', the 3rd International Conference on Genetic Algorithms 379-384 (1989). [4] NAGAO, T., AGUI, T. and NAGAHASHI,H.: "Structural Evolution of Neural Networks Having Arbitrary Connections by a Genetic Method", IEICE, E76-D, 6, pp.689-697 (1993). [5] NAGAO,T., AGUI, T. and NAGAHASHI,H.: "An Automatic GA-Based Construction of Neural Networks for Motion Control of Virtual Life", IEICE, J78-D-II, 7, pp.1150-1152 (1995). [6] TAKAHASHI,H., AGUI, T. and NAGAHASHI,H.: "Designing adaptive neural network architectures and their learning parameters using genetic algorithms", SPIE Aerospace and Remote Sensing '93, 1996-09
(1993).
[7] TAKAHASHI,H., AGuI, T. and NAGAHASHI,H.: "Designing neural networks using several kinds of ac-
tivation functions by genetic algorithms", 3rd International Conference on computer graphics and image processing GKPO'94 393-402 (1994).
[8] TAKAHASHI,H. and NAKAJIMA,M.: "Designing and Learning Feedforward Neural Networks using Structure Evolution Rule", Digital Image Computing Techniques and Applications 625-630 (1995).
Proceedings IWISP '96,"4- 7 November 1996,"Manchester, U.K. B.G. Mertzios ans P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
501
STATISTICAL COMPARISON OF M I N I M U M CROSS ENTROPY SPECTRAL ESTIMATORS R C Papademetriou, Member, IEEE Department of Electrical and Electronic Engineering University of Portsmouth, Portsmouth PO1 3DJ, England ABSTRACT The performance of two computationally intensive non-linear power spectral estimation methods is studied from the viewpoint of resolution. The methods, which explicitly include a prior estimate of the unknown true power spectrum, are the cross-entropy (CE) method and the spectral cross-entropy (SCE) method. The data model used comprises two sinusoidal signals of equal amplitudes immersed in 1/f noise. CE appears to have the overall edge in this study.
1.
INTRODUCTION
In minimum cross (or relative) entropy spectral analysis (MCESA), two methods are in use, the crossentropy (CE) [1] and the spectral cross-entropy (SCE) [2, 3, 4]. While their common characteristic is the explicit inclusion of a prior estimate of the true power spectrum, their conceptual difference is that, while in the CE method the underlying random variables are the coefficients of a Fourier series model and the spectral powers are expected values, in the SCE method the spectral power, properly normalized, is used as a probability density function (pdf) and the underlying random variable is the frequency. A few efforts were made to study and compare the properties of CE and SCE [5, 6, 7]. A comparative statistical study of the two MCE spectrum estimators is attempted in this paper through numerical examples. Performance is judged on the basis of such factors as peak location, minimum peak
separation for resolution, and high resolution probability. 2.
COMPARISON OF POWER SPECTRA
The data model used here comprises two equal strength sinusoids in background 1/f noise. More specifically, for the two equal amplitude, different frequency sinusoids, N samples (here N = 200) were generated from the series yj = 2 sin (2n fl tj + ~,) + 2 sin (2rt f2 tj + (1)2) (1) where % = 0, 1, 2, ..., N-1 and ~1, (I)2 random initial phases (the factor of two is to account for negative frequencies). Realizations of the coloured (l/f) noise process were easily generated by passing Gaussian white noise (GWN) through a finite impulse response (FIR) filter with a frequency response in the from 1/q'f for a normalized frequency range 0 < f < 0.5 [8]. From the time-domain samples
zj = yj + %
(2)
with nj realization samples of 1/f noise, autocorrelations R~, r = 0, 1, ...., 5 were computed, by means of the biased estimate
502 N-r
R
_
1 ~ N j-I
Zj Zj
(3)
+r
(which guarantees positive-definiteness), and then CE and SCE spectra were estimated from these autocorrelations, assuming always as a prior power spectrum the background noise.
@ ~ 4f,,.~.
. CE
m~..a~
1" : :
"t41 ....... 84 Ckt , o.,t..........o,,a ~
pwmmm'q~
0.4
u
o
o.t
o.t
(kt
0.4
NmIWVJIZ~Iqmmame~lHlj
I.t
o
tkl
" Oa 0.9 ILl mmwa.mw pmommct,l ~
O.II
Figure 1 - CE vs SCE estimates for different SNRs
The number of realizations used here was fifty. Figure 1 gives (average) CE and SCE spectra for different SNR values. The two sinusoids were arbitrarily located at fl = 0.165 and f2 = 0.315. The CE method can resolve the two components even for low SNRs (eg -8 dB), while the SCE method is comparable only for higher SNRs.
3.
RESOLUTION CAPABILITY
In order to compare the resolution capability of the two estimators, the resolution boundary concept [9] is used here. This may be defined for each fixed SNR as the minimum frequency separation, Af, necessary for resolution. In other words, the resolution boundary is a curve dividing the Af-SNR plane into two regions; above the boundary two sinusoids may be resolved, below the boundary m ~ (de) they will not be resolved. Values of Af for different SNRs were found (using the same model of Sec 2), by bringing the second ,s ~~',,, k sc~ sinusoid closer to the first, which was kept at a fixed frequency (arbitrarily chosen to be 0.165), until resolution was lost. Figure 2 shows that the minimum distance Af for the peaks to be resolved in the CE and SCE estimates decreases with increasing S/N ratio, with the CE method | -% O,~"'o.~ o.I~ o,o," olz o,;8 o.,i _. consistently outperforming the SCE method (ie, ~mm'ou mulqmmm~ m m ~ o ~ (us) CE resolution boundary below the SCE one). It must be pointed out, however, that, because the resolution boundary Figure 2 - Resolution Boundary Curves
503 is peak location dependent for 1/f noise, the resolution measure Af should not be considered as an absolute measure but only as an indicator of the relative performance.
4.
ACCURACY OF PEAK LOCATION
For the comparison of the location of spectral peak (LOSP) error in the two methods, the two sinusoids were fixed at the arbitrarily chosen frequencies t"1= 0.165 and t"2= 0.315. For each SNR value the error was calculated (in both methods) by the following quantitative measure:
(4)
f _l
LOSP error = I f ' , - f, [ + If'2-
where f', and f'2 represent estimated frequency locations. The variation of the LOSP error (frequency bias) with the SNR for the two methods is shown in Figure 3. The error increases when the SNR is decreased, but remains always smaller in the CE method, which shows evident superiority. The rate of change of bias appears to be higher for SNRs smaller than a certain value (in this case - 5dB) for both methods.
5.
0.16 Clg
e
OGt
o.x
0.06
o -I0
*
[
-'.0
"6
. . . . . .
~ .......
-'4
[
*
-9
r1
0
S .......
2
[ ....
9
]
I
4,
o
s~R (da)
Figure 3 - LOSP Error Curves
RESOLUTION PROBABILITY
The comparison of the two estimators is completed by considering the variation with SNR of another measure of resolvability; the high resolutionprobability (HRPr) defined [10] as follows: Given two frequencies fl and f2, let A0 = I f~ - f , I/2. The two frequencies are separable with probability a (ie the HRPr) by an estimation method if a = min (a,, a2), where !
P{ [ f , - fx I < Ao } : a,
0.8 a, m 0.6 m
!
0.4
0.2
0
-s
-,
.4
.~
o
J
2
|
|
4
,
......
;
s
Figure 4 - High Resolution Probability Curves
(5)
and f '1, f '2 are frequency estimates obtained by the estimation method. The HRPr curves for CE and SCE, shown in Figure 4, were derived by averaging over 100 realizations for the same frequency locations as in section 4. They reveal that, the CE method, having higher probability values than the SCE method for all SNRs below zero, is a better high resolution estimation method. This conclusion is drawn from many simulation experiments with different frequency locations t"1 and f2. The only
504 expected difference is that, as these frequencies move towards the higher power part of the 1/f spectnnn, the HRPr curves converge to one for larger SNRs. 6.
CONCLUSIONS
This paper has been concerned with studying the properties of CE and SCE estimators through numerical examples. The simulation experiments performed were for data sets consisting of sinusoids in 1/f noise. In brief, the above comparative analysis leads to the conclusion that the CE method shows evident superiority.
REFERENCES
[1]
J E Shore, "Minimum Cross-Entropy Spectral Analysis", IEEE Trans. Acoust, Speech, Signal
Processing, vol, ASSP-29, No 2, pp 230-237, Apr 1981.
[2]
L Vergara-Dominguez and A R Figueiras-Vidal, "A Minimum Cross Flatness Spectral Estimator and Some Related Problems", Proc Portugal Workshop on Signal Proc and lts Applications, pp A2/2/1-9, Sept 30-Oct 1, 1982.
[31
R W Johnson and J E Shore, "Which is the Better Entropy Expression for Speech Processing: - SlogS or logS?", IEEE Trans Acoust, Speech, Signal Proc, ASSP-32, No 1, pp 129-137, Feb 1984.
[4]
M A Tzannes, D Politis and N S Tzannes, "A General Method of Minimum Cross Entropy Spectral Estimation", IEEE Trans Acoust, Speech, Signal Proc, ASSP-33, No 3, pp 748-752, June 1985.
[5]
N A Katsakos-Mavromichalis, M A Tzannes and N S Tzannes, "Frequency Resolution: A Comparative Study of Four Entropy Methods", Kybernetes, vol 15, pp 25-32, 1986.
[6]
R C Papademetriou and A G Constantinides, "Minimum Relative Entropy Spectral Estimation with Uncertainty in Constraints and Prior Knowledge", inDigital Signal Processing'87, V Cappellini and A G Constantinides (Eds), Elsevier Science Publishers B V (North-Holland), pp 84-88, 1987.
[7]
R C Papademetriou, "On the Robustness of Minimum Cross Entropy Spectral Estimators", Proc 1EEE lnt Conf Signal Proc, Circuits and Systems (IEEE SICSPCS '95), pp 157-162, Singapore, 3-7 July 1995.
[8]
J H McClellan, T W Parks, and L R Rabiner, "FIR Linear Phase Filter Design Program", in
Programs for Digital Signal Processing, IEEE Press, 1979. [9l
M Quirk and B Liu, "On the Resolution of Autoregressive Spectral Estimation", Proc IEEE Int Conf Acoust, Speech, Signal Proc (ICASSP '83), pp 1095-1098, Boston, 1983.
[10]
S G Oh and R L Kashyap, "A Robust Approach for High Resolution Frequency Estimation", IEEE Trans, Signal Proc, vol 39, No 3, pp 627-643, March 1991.
505
..4o
!
I--
l
,-,~
i-,. CE
0
s~
.2o
01.1
I
i
0.2
,1
0.4
~ 4o
~ 0
0.6
,
0.1
T 0.2 i
,l
0.:J
'
0.1
I
~l
I
,
!
0.2
0.3
o14
o.s
0.15
lzl
~
15
"1~
0.6
',
Sm=~kS
.40 9
9
014
(~)
20
\'~-_J,
~l
--~,
,
0.3
........ .. ,~
CE
==-- llCm
k " N a, In
SCE
CE
0.1
10 0.06
5
0 0 _50
, 0.02
I 0.04
J o.oe
I
.I oz
o.o8 In~QUE~CY S I ~ O N
-10
, o.12
!
I
J
~
'
'
,
,
,
-8
-6
-4
-2
0
2
4
6
8
s);R (da)
o.14
(Hz)
1
0.8 --
c|
m 0.6 m
0.4
0.2
08
-
!
-6
!
-4
i
-2
!
0 SN~ (dO)
i
2
!
4
i
6
J
e
J ]0
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
507
GENERALIZED OPTIMUM APPROXIMATION MINIMIZING VARIOUS MEASURES OF ERROR AT THE SAME TIME Takuro KIDA Department of Information Processing, Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama-shi, 227 JAPAN. e-mail [email protected] Abstract
There are many contributions for the interpolation of signals relating to the problem of suppressing the immanent redundancy contained in a signal without vitiating the quality of the resultant approximation. Here, the minimization of the approximation error becomes important. In this paper, we present the optimum timelimited interpolation functions which minimize simultaneously the wide variety of measures of approximation error. It is assumed that the Fourier spectrum of the signal has a weighted L 2 norm smaller than a given positive number. The sample values are selected in the output signals of the given analysis filter bank. It should be noted that the proposed approximation is superior, in some sense, to all the linear and the nonlinear approximations using the same set of signals, measure of error and sample values.
1
GENERALIZED
INTERPOLATION
Let L~ be a set of F(w) satisfying
IF(w) 12 dw < +oo, F(w) - F(-w), where ~ means the complex conjugate of a oo
complex number z. Further, let L~ be a set of signals f(t) satisfying f(t) - ~
F(w)e j'~'dw, F(w) E L~. (x)
Note that f(t) has real value. For simplicity we express by f(t) ,----, F(w) a pair of functions f(t) and F(w) satisfying the above relations. Suppose that W(w) be an arbitrary function in L~. Then, we denote by Oa the set of F(w) in L~ satisfying the following inequality with respect to a given positive number A.
/_o~ Iw(~)[~ ~ IF(w)12
~" -< A
(1)
Moreover, let F be the set of signals f(t) in L~ expressed as the inverse Fourier transform of F(w) in 0,4. Consider that we input f(t) into M linear time invariant filters having the transfer functions of Hm(w) (m = O, 1,... , M - l ) . We denote by fro(t) (m = O, 1,... , M - 1) the corresponding outputs of the filter Hm(w) (m = O, 1,... , M - 1). Let Fro(w) =
Hm(w)F(w) be the Fourier spectrum of fro(t), where m - 0 , 1 , . . . , M - 1. Further, we assume
[ W ( w ) 121 Hm(w)12 oo
W(~o){F(oo)lW(w)}d~oand
dw < +oo. From the Schwarz inequalities for oo
{W(w)Hm(~o)}{F(oo)/W(oo)}d~o,w e
can
oo
easily prove that F(w) and F,,(w) are absolute integrable. Hence, f(t) and fro(t) are continuous. Consider the equally spaced sample points such as S = {nT} (T > 0) (n = 0,-t-1,+2,...). Further, let ym(,~T) be the sample values of fro(t) at the sample points nT (n = 0, 4-1,-t-2,...). Now, we consider the problem of approximating f(t - d) from the above sample values, where d is a given delay time. Let M-1
g(t) = ~
oo
~
~'~(~T)r
(2)
be the corresponding approximation of f ( t - d). The functions r are the prescribed bounded real functions called the generalized interpolation functions or, simply, interpolation functions. We assume that these interpolation functions satisfy [r
0 (t < nT),
Ir
0 (t > nT T A)
(3)
508 whereA > 0(rn-0,1,...,M-1; n-0,4-1,4-2,...). Now, we assume that the above quantity A satisfies A = NT
4- r ( N - a n o n - n e g a t i v e i n t e g e r "
(4)
0 <_ r < T )
For a while, we consider that t is a p a r a m e t e r and 7 is the ordinary time variable extending from - o o to oo. For convenience sake, we present again the constraints imposed on the interpolation functions using the variable r. Ir
Ir
l - 0 (7 < n T ) ,
0 (7 > n T 4- A)
(5)
where A > 0 (rn - 0 , 1 , . . . , M 1" n - 0 , 4-1, 4-2, ...). Further, let R~ be the integer satisfying R ~ T < t < ( R , 4- 1)T. Then, we consider the following two intervals defined by
I~-{t IR,T<_
t < R~T 4- r},
/ ~ - { t I R, T
(6)
4- r < t < (R~ 4 - 1 ) T }
Moreover, we define the integers Q~ and L~ as shown below. (i) I f t e I : " Qt=N, Lt-M(Qt4-1)-M(N4-1) (ii)Ift6.I~" Q, - N - 1 , L t - M(Q, 4-1) - MY In the following, we consider the one to one correspondence between an integer p (= 1 , 2 , . . . , L , ) and a pair of integers ( r n , n ) (rn - 0 , 1 , . . . , M - 1" n - R , - Q , , R , - ( Q , - 1 ) , R , - (Q, - 2 ) , . . . , R , ) .
p _ { ~l(..,,,) ~(~,~)
(t ~ 1:) (t e I~)
(r)
where the n u m b e r of ( r e , n ) is Lt - M ( Q t + 1) and is identical with the number of p. For simplicity, we sometimes express only p - ~ ' ( r n , n ) instead of p - { l ( m , n ) or p - r The approximation error between f ( r - d) and g ( r ) is defined by e(r) - f ( r - d) - g ( r ) . Further, let E ~ ( r ) be the upper limit of l e(r) I obtained by fixing all the interpolation functions C m . , ( r ) and changing f ( r ) over all the signals f ( r ) in F. Ema~(r)sup {I e ( 7 ) l } (8) f(~-)Er In the following, we frequently consider time or frequency function, h ( r , t ) --------.H ( u , t), e.g., with a parameter t, where .-----. means a Fourier transform pair with respect to 7 and u. Further, we denote by 7 the operator h ( t , t) - "/h(7, t). Now, we consider new functions r t) (m - 0 , 1 , . . . , m - 1; n - 0 , 4 - 1 , 4 - 2 , . . . ) with a variable 7 and a p a r a m e t e r t. We assume that C m . n ( r , t ) - 0 ( n < R , - Q, or n > R , ) (9) - 7r From Eq.(9), Moreover, we assume t h a t the interpolation functions r satisfy r - r when n does not satisfy R , - Q , < n < R , , the interpolation functions ~bm,n(t) - ~bm, n(t, t) vanish. Hence, it is necessary to confirm that this condition does not contradict the constraint shown in gq.(5). Now, recall that R ~ T < t < (R~ 4- 1)T. Further, we consider the range of t satisfying R~ - Q~ < n < R~ for a given < t < nT+NT4-r, then integer n. If t - n T , then R~ - n holds and this gives the minimum value o f t . I f n T 4 - N T R, - n 4- N - n 4- Q, holds and t 6. II - {tl R , T < t < R , T 4- r} for this R , . In this case, n - R , - Q , holds which gives the supreme value of t - n T 4- N T 4- r. When t is in the range n T 4- ( N - 1)T 4- r < t < n T 4- N T , R , - n 4- ( N - 1) - n 4- Q, holds and t 6. I~ - {tl R ~ T 4- r < t < (R~ 4- 1)T} for this Rf. In tiffs case, n - R , - Q, holds also. But, the supreme value of t, that is (R, 4- 1)T - ( n 4- N ) T is not larger than the previous supreme value t - n T 4- N T 4- r. In conclusion, the interpolation functions r - r t) have the meaningful value in n T < t < n T 4- N T 4- r which does not contradict the constraint Eq.(5). As shown in [2], for a given t, we have
Em,~(t)
-
IJ_
-~r
IW(w)
,,
oo
(teI~)
(k-lor
eJ('-d) ~ -
M-1 Rt C m , , ( t ) H m ( w ) e J ~ ~ n=R~-Q~
"T~
12dw
(lo)
m=0
2)
(11)
Let 12~ be the set of pairs ( r e , n ) composed of m and n satisfying m - 0 , 1 , . . . , M - 1 and n - R , - Q , , R , - ( Q , - 1),f-6 (Q, - 2 ) , . . . , R , . Minimizing of E q . ( l l ) is straightforward as is shown in [2]. Firstly, we expand E . . . . (t) 2 with respect to r . . . . (t) under consideration and simply differentiate E m a x ( t ) 2 with respect to the complex conjugate of the interpolation functions r which contribute actually to the approximation at the prescribed t, and make the resultant set of formulas to be zero, t h a t is ~ he,.,..(0 Further, let ~b,,,,(t) (m E , , , a x ( t ) . Then, it is proved n T ) , (m = 0 , 1 , . . . , M 1;
- 0, where m - 0 , 1 , . . . , M
- 1 and n - R , - Q , , R , - ( Q , - 1 ) , R , - (Q~ - 2 ) , . . . ,R~.
0,1,...,M1; n - 0 , 4 - 1 , 4 - 2 , . . . ) be the o p t i m u m interpolation functions which minimize that there exists a set of functions era(t) (m - 0 , 1 , . . . , M - 1) satisfying Cm,,(t) - r n - 0 , 4 - 1 , 4 - 2 , . . . ) . Moreover, E m a x ( t ) - E . . . . ( t 4 - T ) holds for E m ~ , ( t ) which uses these
M-1
C f,(nT) m
optimum interpolation functions. Hence, g ( t ) is expressed by g ( t ) = m=0
expressed equivalently by I $,(t) I= 0 ( 1 < 0 , t A ) ( m = 0) 1 , . . . , M - 1 ) . Using these relations, if we perform the above operatio11 for t satisfying 0 I,t of the interpolation funct~ons.
>
&(t
- nT).T h e n , Eq.(5) can be
< T , we can
obtain all the funct~onalforms
n=-ra
THE OPTIMUM APPROXIMATION
2
T,et T and u be a pair of time and frequency variables. Xow, we extend t h e above discussion. Let B ( T ) = v[{f,,,(nT)};T ] be a linear/nonlinear approximation for f ( r ) . We assume that g ( r ) uses t h e sample values j n , ( n T ) (m = 0 , 1 , 2 , .. . , M - 1; n = HI - Q l , R t - ( Q I - l ) , R ,- (Q, - 2), . . . , R 1 )when r is equal t,o t . We assume that v [ { j , ( n T ) j , T ] va11is11eswhen r = t holds and all the j,,,(nT) (m = 0 , l . 2 , . . , M - 1 , n = HI - Q,,ICl - (Q1- I ) , HI - ( Q I2 ) 1 . , R , ) arc zero. For a r b ~ t r a r yf ( 7 ) 111 r , wc assume that theri: exist f ( r , t ) anti g ( ~t ), satisfying f ( t ) = y f ( r , t ) and g ( t ) = y g ( r ,t ) . Since the error i ( r )= f ( T - d ) - g ( r ) depends on t , h signal ~ f ( r , t ) , we express t h e error as i ( r ) = ~ J ( tT) ], . We dcnotc I,y d [ ? ( ~ ) a] i u n r t i o n l a functional/ an operator of 2 ( r ) . We assume that J [ i ( r ) ]has a non-negative value. Moreover, let O Lc a subsct in t h e set of signals r Then; consider t h e following rr~easureof error E " ( T ) for a signal f ( r ) O.
In
E o ( r )= sup {d[e(r)]} f(r)E@
With respect to E e ( T ) , we assurne r~aturallythat E,:(r) 5 E e ( r ) holds for all the set of signals 0 : sat,isfying '3: E O Further, let E ( r ) = E1.(r)be the o b j e c t ~ v en~easureof error to hc n l i n ~ m ~ z c d We C O I I S I ~ ~new I. Inner product and norm such as ( B ( u ) C , (U))O = (?*)-I W ( u )1~B(u)C(2L)du and 11 B ( u ) ) l o = ( 2 ~ ) - ' ..
r=
S-ywI
I W ( u )1' 1
11 C ( u )/ l o <
B(u)
I2
du)
1/ 2
, respectively, where B ( u ) and C ( u ) are arbitrary f u n c t ~ o n ssatisfying
11 U ( u )(lo< t o o and
+oo. Further, we assume t h a t ail the functions f ~ , , ( u ) e ' " " (~ ( m , n ) E Ql ) are independent of each other. ) € we can d e r ~ v ea T h e n , using t h e Schrnidt orthogor~alityalgoritilrn for the set of the functions ~ , ( u ) e ' ~ " ~ ( ( m , nat), set of orthonormal bases { u P ( ut,) } (p = I , ? , . . , L , ) from the sct of ~ , ( u ) e ' ~ ("(~m , n ) E a, ) . S o w , we consider t h a t LI
L.
are the corresponding orthonormal bases, where abbq and bb,, are the complex coefficients with the parameter t , p = c l ( l n ,n ) ( p = I , ' ? , . . , L I , ( m , n )E Sl, ) and y = t l ( i ,k ) ( q = 1 - 2 . . . . , L,,( i ,t) E i l l ). Furt,hrr, let 11s consider tkmporar~lythe following funct.ions
Moreover, let ( ~ n , nand ) ( r , s )E R I , and let y = t i ( m , n ) and 1 = t t ( r , s ) .T h e n , the f o l l o w ~ nrelations ~ hold
=
2
b;q a:,,, =
{
= p, that is, ( r ,s ) = ( m .n ) ) ( 1 # p , that i s , ( r , ~#) ( m , n ) )
1 (1
q= 1
S o w , roosldcr t,hr follow~ngf u n c t ~ o n L,
'(',t)
=
C f m ( n T ) 1Im.,(r, t ) ;
p
= tl(m,n)
p= 1
Let Z ( u , t ) be the Fourier spectrum of defined initially. Further, we can derive
Z ( T ,t)
with respect to
T
and u . Then, it is proved t h a t Z ( u , t ) is contained in @ a
Now, we define e ( r ,1 ) = f ( T - d ) - a ( r ,t ) and ~ ( ut ),= F ( u ) - Z ( u , t ) Obviously, E ( u , ~ )is the Fourier spectrum of e ( r ,t ) w ~ t hrespect to r and u T h e n , we obtain t h e following two theorems.
510 Theorem
1
9For any f ( r ) in F, suppose that f ( r ) --, F(u). Then, for any (rn, n) E ~2,, we have
fro(tin,n)-
(2r)-'
Hm(u)F(u)
ejunT
d?2-
oo
(27r)-'
ji
ejunT
d?s
oo
Hm(u)e(u,t) e junT du = 0 oo
I w(~,)I-~1
F(u)I ~ ,h,
cx)
=
I W(u) I-~1
Z(~,,t)I~ d~, +
oo
The first equation is obvious.
Proof:
Hm(u)Z(u,t)
(~Tr) -1
(18)
I W(u) I-~1 ~(~,,t)I ~ d~, cx)
Further, ( 2 r ) - ' fr
oo
9 H m ( u ) Z ( u , t ) e J'"Tdu -
oo
oo
~ , .__.. bp,qvq(u, t)
(2r) -3 f 2 ~
q'- -o(3
W(u) I~ E f~(nT).
V,(u;t)
~=-~ ,_,si the fact
F(v)v,(v,t)dvdu
- (2r)-'
E
F(v)b~,qvq(v,t)dv
- (2r)-'
F(v)Hm(v)eJ~nTdv
=
that {vp(u,t)} is a set of orthonormal functions for the previous inner product, we can derive
(27r) - ; f-~oo I W ( u ) [-~ Z ( u , t ) F ( u ) d u
- (2r) -2 f-~oo I W ( u ) l-2 Z ( u , t ) Z ( u , t ) d u
-
E q-- --oa
F(v)vq(v,t)dv
(2r)-2
.
c~
Hence, we can easily derive Eq.(lS). (QED). We define that F and F~ are the set of f ( T , t ) for all the value of t ( - c ~ < t < oo) and the set of f ( r , t ) with respect to which all the fm(nT, t) ( (re, n) E a, ) are zero for every t ( - ~ < t < ~ ) , respectively. We assume f ( r , t ) 6 F for each t. Recall that f ( r ) is not necessarily band-limited. Hence, F1 is not empty, in general. Further, let F0 be the set of the functions eo(r,t) - f ( v - d) - z(r,t). We adopt z(t) - z(t,t) - "/z(r, t) as the presented approximation. Then, the approximation error is co(t) = f ( t - d) - z(t) - f ( t - d) - z(t,t) - 7e(r, t). We denote by Eo(t) the corresponding E r ( t ) for co(t). Then, from the above theorems, if we fix t and consider co(r, t) as the input signal, we can easily recognize that the following three conditions hold: (g) F0 C_ F, C_ I', (h) eo(r,t)-'~[eo(r,t)] (i) z(r,t) - 0 if all the fm(nT, t) ( ( r e , n ) E ft, ) are zero. Hence, we have
E(t)=
Er(t)-
sup {d[~(t)]} sup {d[~/~(r,t)} > sup {d[',/~(r,t)]} sup {d[f(r - d , t f( r,O Erl I( ,-,OErl f(~) Er f(rJ) E ['
d)]}
sup {d[f(t - d)]} f(r, OErl
Eo(t)-
sup { d / c o ( t ) / } - sup {d[Teo(r,t)]}sup {d[Teo(r,t)]}sup {d[7~[eo(r,t)]]} I(~) Er I(0 Er eo(~,f) Ero eo(~,1)Ero
<
sup {d["/{f(r - d,t - d) -/)(r,t)}]} sup {d[f(t - d) -/)(t)]} sup {d[~/~f(r - d,t - d)]} l(r,O Erl I(T,OErl I(r,0 Er~
=
sup { d [ f ( t - d)]} l(r,OErl
(19)
Hence, z(t) gives the mininmm Eo(t) among all the E(t) under consideration. This analysis include the discussion for the measure of error emax(t). Hence, the concrete functional derivation of the interpolation functions are the same as previous discussion and the functional form of the interpolation functions are also the same as the previous those.
3
C O N C L U S I O N
It should be noted that the proposed generalized interpolatory approximation has the minimum measure of error in a certain sense among all the linear/nonlinear approximations using the same sample values of the signal. The presented design gives a simple way to obtain the optimum analysis/synthesis filter banks. Finally, we would like to express our sincere thanks to Professor B. G. Mertzios, Demokritus University, Greece.
References [1] P. P. Vaidyanathan: Multirate Systems and Filter Banks, P T R Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1993. [2] T. Kida, L. P. Yoshioka, S. Takahashi and H. Kaneda: Theory on Extended Form of Interpolatory Approximation of Multi-dimensional Waves, Elec. and Comm. Japan, Part 3, 75, No. 4, pp.26-34, 1992. Also, Trans. IEICE, Japan, Vol. 74-A, NO. 6, pp. 829-839, 1991 ( in Japanese ).
The Optimum Approximation of Multi-dimensional Signals Based on the Quantized Sample Values o/Transformed Signals, Submitted to IEICE Trans. E77-A, 1994.
[3] Takuro Kida:
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
511
DETERMINATION OF OPTIMAL COEFFICIENTS OF HIGH ORDER ERROR FEEDBACK UPON CHEBYSHEV CRITERIA. A.Djebbari(a), AI. Djebbari(b), M.F.Belbachir(C), and J.M.Rouvaen(a) (a) IEMN-Dept. OAE UMR 9929 CNRS, 59304 Valenciennes, France (b) Signal and Systems Lab., Univ. of Sidi-Bel-Abbes, 22000, Alg6rie (c) Signal and Systems Lab., USTO BP1505 Oran, Alg6rie Abstract An efficient design method is proposed for error feedback digital filters, to reduce quantization noise in direct form realisation IIR filters. The method is based on minimising the weighted Chebyshev error derived between the no noise desired filter and the designed one using Remes loop. The noise power of the designed filter is lower than the initial one. 1 Introduction Error feedback is a general method used to reduce the parasitic effects due to finite word length in internal digital filters computations. This technique has been applied with success to infinite impulse response (IIR) filters using fixed point arithmetics, particularly for implementing low-pass filters with poles near the unit circle [1]. Error feedback is performed by extracting directly the error signal after each product term and re-injecting it through a simple finite impulse response (FIR) filter [2]. This process doesn't modify the filter specifications, the transfer function being unchanged : it acts only on the noise component of the output signal. In a recent work, LAAKSO proposed to use optimal and suboptimal error feedback filters which were designed by minimizing the mean squared error with LMS type algorithms. He applies this process to reduce the quantization noise in direct form (type I) IIR filters with high order [2]. In this paper, we propose a new noise reduction method based on the determination of optimal error feedback filter coefficients via a Chebyshev criterion. We give then the main results for the noise power reduction obtained on a particular example. 2 Optimal error feedback. Let us consider an IIR filter of order N with some kind of non-linear behavior (rounding, truncation on the absolute value, truncation on two's complement values after each addition ...), whose transfer function G(z) is stable.
2-1 Formulation of the problem. An error feedback of order K is applied as shown in figure 1. The output signal y(n) is given by 9 y(n) = G(z). x(n) + B(z). G(z). e(n) (1) where B(z) = 1+~1. z-1+~2, z -2+...+~ K . z -K (2) B(z) and e(n) being, respectively the error feedback filter transfer function and the quantization error. The FIR filter B(co) may exhibit a symmetric or antisymmetric impulse response, and the number K of its coefficients may be even or odd, which leads to the four classical cases [3] for FIR filters which will be considered in the following. The error feedback filter gain is given by 9 I B(co) I = Q (co). P (o~) (3) J-I
P (co) = 2 ,~,
czn. cos no
and the coefficients 13 in equation (2) 9symetric filter and K=2M+I Q(co)= 1 [3M = ctO
(4) are bound to those ~ in equation (4) by the relations" odd (case 1) J = M+I 2 [3M.k = czk ; k = 1..... M (5)
512
131
e(n)
x(n)
a1
I'
b1
~ 1 7 6
aN
bN
Fig. 1 Structure for error feedback. 9 symetric filter and K=2L even (case 2) Q (co) = cos (co/2) J = L 2 13L-1 = c~0 + a l / 2 2 13L-k = (ak-1 + ak) / 2 ; k = 2 . . . . . L- 1 2 130 = aL-1 / 2 (6) 9 antisymetric filter and K odd (case 3) Q (co) = sin (co) J = M 2 13M-1 = coo- a2 / 2 2 13M-k = (ak-1 - ak+l) / 2 ; k = 2 . . . . . M-2 2 131 = aM-2 / 2 2 130 = aM-! / 2 (7) 9 antisymetric filter and K even (case 4) Q (co) = sin (co/2) J =L 2 13L-1 = oc0 - C~l / 2 2 13L-k = (ak- 1 - ak) / 2 ; k = 2 . . . . . L- 1 2 130 = aL-1 / 2 (8) Equation (2) gives the supplementary condition 130 = 1. Our goal is to compensate for the error e(n) by reducing the modulus of the error transfer function B ( z ) . G(z) in equation (1) to unity, the corresponding residual error being written as 1-IG(co)l. IB(oJ)l. Such a problem may be solved by simple modification of the classical Parks-Mc Clellan's program, based on Remez's algorithm, which is classically used for optimal FIR filter design. For this purpose, a weighted Chebyshev error is used, given by : E(co) = W(co). [ D(co) - e(o~) ] (9) with D(co) = 1 / [ Q(co). I G(co) I ] the desired function and W(o~) the weighting fonction.
2-2 Algorithm description.
Optimal coefficients for B(z) are obtained using the following Remez type algorithm : 9 Read the IIR filter coefficients, the type of error feedback FIR filter and its order. 9 Define the desired and weighting functions. 9 Select extrema over the interval [ O, n ] : fz = { COl, ~2 . . . . . ~0j } 9 Solve the system of equations : E(coj) = W(coj). [ D (o~j) - P (r = (-1)J . 5 with 5 = m a x { c o e l 2 } I E ( ~ ) i and j = l . . . . . J 9 Search over [ O, n ] for the J local extrema of E (o~j) with greatest absolute values, with the condition that these maxima must alternate.
513 Save the abscissas of these extrema into n'-'- { o)'1, o3'2, ...,m'j } 9 If I o~j - co'j I < E v j = 1, 2 . . . . . J , proceed to next step, else make f2 = f2' and return to fourth step. 9 Compute coefficients 13j using relations given in equations (5) to (8).
3 Numerical results.
Let us consider as an example the sharp low-pass filter H1 with transfer function : H l(Z) = [ 1,0000+0,7409.z- 1+2,1045.z-2+ 1,5635.z-3+2,1045.z-4+0,7409.z-5+ 1,0000. z -6] / [ 1,0000-4,1139.z- 1+8,1026.z-2+9,4512.z-3+6,8370.z-4+2,9064.z-5+0,5739.z -6] (10) This filter has a noise power of 43.06 dB and its power spectrum is shown in figure 2 by the curve marked with the symbol 0 . The coefficients of the optimal error feedback FIR filters with orders going from 2 to 10 have been determined using the algorithm presented above (and for the four types of FIR filters). We have also computed the noise power defined as : cf ~~Jo2 = (1/27~). -~ IH 1(co) I: . IB(o)) 12. do (11) Our results show a significant reduction in the noise power. For filter H1 , we get figures of 5.24 dB and 5.29dB for error feedback filters, respectively, symmetric of the 10th order : ~0 ... 135 = 1;-4,5652; 8,5235;-6,5957;-1,7839; 6,8569 1310-j = 13j ; j - 0 ..... 4 and antisymmetric of the 8th order : ~0 . . . ~4 = 1; -4,5881; 8,9127; -8,0078; 0 1~8-j = - 13j ; j = 0 ..... 3 The error signal power spectrum is given in figures 2 and 3, where the reduction of the noise in the pass-band of filter H 1 obtained by increasing the error feedback filter order.
Fig. 2 Noise spectrum of filter H l for symmetric error feedback filters with orders 2,5,6. f : reduced numeric frequency 0 => no error feedback filter.
Fig. 3 Noise spectrum of filter H 1 for antisymmetric error feedback filters with orders 3,5,8 f : reduced numeric frequency 0 => no error feedback filter.
The filter H 1 is only one of several examples (low-pass, band-pass, high-pass) which have been studied using our algorithm. In all cases, we have been able to design efficient error feedback filters and to attain very significant reductions in the noise power.
Comparaison with the LMS optimization method. The method based on the LMS optimization [2] gives error feedback filters with coefficients very similar to those (with the same order) given by our method, and therefore nearly identical performances for noise power reduction. This has been shown by working on all examples considered in ref. [2].
514 However, the order K of the error feedback filter is limited by that N of the IIR filter to be corrected in the LMS method. But, in our case, the previous limit may be overcomed, with the consequence of a further (and sometimes significant) improvment in noise power reduction. The price to be paid is naturally an increase in the error feedback filter complexity in terms of the number of multipliers and delay cells (so the overall filter delay and the physical size increase too). Conclusion. A new error feedback filter design method has been proposed, with the goal to reduce the quantization noise in high order recursive filters. This method is based on a Chebyshev criterion. A sharp band-pass filter has been studied as an example. The noise power attained after correction show clearly the effectiveness of the method. References [ 1] L.B. Jackson, "Round-off analysis for fixed-point digital filters realised in cascade or parallel form", LE.E.E. Trans. Audio Electro-Acoust., Vol. AU-18, pp. 107-122, June 1970. [2] T. Laakso et I. Hartimo, "Noise reduction in recursive digital filters using high order error feedback, LE.E.E. Trans. Signal Proc., Vol. 40, pp. 1096-1107, May 1992. [3] L.R. Rabiner et B. Gold, "Theory and application of digital signal processing", Prentice Hall, New York 1975.
Invited Session R: VLSI DSP ARCHITECTURES
This Page Intentionally Left Blank
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
517
Dynamic Codewidth Reduction for VLIW Instruction Set Architectures in Digital Signal Processors Matthias H. Weiss and Gerhard P. Fettweis Mobile Communications Systems Dresden University of Technology 01062 Dresden, Germany {weissm,fettweis } @ifn.et.tu-dresden.de Abstract - The design of an instruction set architecture (ISA) plays an important role for both exploiting processor resources and providing a common software interface. Three main classes of ISAs can be distinguished: CISC (Complex Instruction Set Computer), RISC (Reduced Instruction Set Computer), and VLIW (Very Long Instruction Word). They differ mainly in assembler and compiler support, pipeline control, hardware requirements, and code density. Comparing these architectures for usage in a DSP, the VLIW architecture shows to be very advantageous. Though, the main disadvantage in many applications is code size explosion. To reduce code size, a method called tagged VLIW (TVLIW) is presented. Dividing the instruction set into control/move and arithmetic instructions, a different usage of functional units can be examined. The first set only requires the parallel execution of a limited number of functional units. The second set, though requiring severalfunctional unl'ts in parallel, is often used inside loops. Within our proposed method, the instruction word is dynamically assembled using a low complex highly regular decoding hardware. Inside loops, the full VLIW functionality is supported by cache methods. 1. INTRODUCTION
Three main classes of ISAs are applied in microprocessors: CISC, RISC, and VLIW. They mainly differ in assembler and compiler support, pipeline control, hardware requirements, and code density. Comparing these architectures for usage in a DSP, the VLIW architecture shows to be very advantageous in terms of processing performance. Though, DSPs often contain a CISC ISA mainly for code density and compatibility reasons. Recently, RISC ISAs were also applied, e.g. in Hx24 by Hitachi or Lode by TCSI. This paper presents a first step for an efficient use of the VLIW ISA. By employing a tagged VLIW ISA advantages in processing power are provided while code explosion can be avoided. Besides code compactness, CISC ISAs prowde the assembly programmer with a wide variety of instructions. Since the programming is done on instruction level (see explanation in Fig. 1), the hardware architecture does not have to be known in full detail at implementation time, allowing different hardware implementations to be object code compatible. However, for the same reason hardware resources cannot be fully exploited. Furthermore, CISC ISAs only support a decoding pipeline but no deep execution pipeline, since instructions are too heterogeneous. The RISC ISA, on the other hand, consists of homogeneous instructions in terms of pipeline properties. This can be achieved by splitting complex instructions into several small instructions. Therefore, the hardware decoding complexity can be reduced to achieve a cycle per instruction (CPI) close to 1 [HePa90]. Superscalar architectures lead to an increase in decoding hardware again, e.g. for hazard resolving or scoreboarding for out-of-order executions. Hence, code execution speedup is carried out mostly by hardware support and not by compiler optimizations.
Fig. 1: Instruction vs. cycle level in pipelined architectures To support compiler optimizations for superscalar architectures a horizontal or VLIW ISA can be applied. Due to timestationary pipellning (1.e. pipeline control at cycle level [Kogg81 ]), both the programmer and the compiler are given full control over the pipeline with the cost of code size increase, e.g. 128 bit code width in the VIPER architecture [Gray93]. Hence, prohibiting this type of ISA for the main application field of fixed-point DSPs. The tagged VLIW scheme proposed in this paper reduces the code size requirements by assembling the VLIW dynamically. This method is based on the distinction between in-line and in-loop code. While the first requires only limited parallelism, the latter can be supported by a simple cache. Thus, the properties of VLIW can be exploited without code size explosion. This paper is organized as follows. In Section 2, the properties of the VLIW architecture in DSPs are explained in more detail In Section 3, the dynamic instruction word codingby using TVLIW scheme is described. In Section 4, to demonstrate its applicability, the scheme is applied to the AT&T DSP16 architecture. 2. APPLYING VLIW TO DSP ARCHITECTURES A VLIW architecture consists of several independent functional units (FU) controlled by one instruction word (IW) and connected by a fairly complex bus system (Fig. 2). In a DSP architecture these FUs are the Program Control Unit (PCU), Address Generation Units (AGU), Datapath Units (DPU), I/O-Units (IOU) etc. [Vanh92]. In some floating-point DSPs, VLIW architectures are already applied [Madi95]. These DSPs are usually used for highperformance applications, which require a high degree of flexibility. On the other hand, in high volume and low power products, e.g. in mobile communications, fixed-point DSPs are employed. Since they are typically programmed in assembly languages, and code density is an important issue they often contain a CISC ISA for both data-stationary (e.g. TI C54x, Motorola DSP5630x) and time-stationary (e.g. AT&T DSP16, NEC 7701x) pipeline control. However, at the advance of more processingpower requirements concurrent processing must be supported. Most DSP algorithms have inherent parallelism [Kung88], which can be exploited by replicating arithmetic umts. This is not necessarily restricted by limited memory bandwidth. Due to locality of algorithms, duplicatingarithmetic units does not necessarily lead to multiport memory architectures and thus can be applied in fixed-point DSPs also, as shown in [Fett96]. Furthermore, by demanding stronger compiler support [Zivo95], an ISA combining flexibility (offered by a VLIW ISA) and code density (offered by a CISC ISA) is required. The main drawback for employing a VLIW ISA is, besides a complicated assembly coding, the code size increase. The main reason for code size increase is the independent control of concurrent FUs. For maximum flexibility, a VLIW ISA must support all permutations of all FUs.
518
Fig. 2: Example of a VLIW DSP-Architecture
Fig. 3: Tagged VLIW Instruction Decoder
Though, not all instructions can exploit the VLIW ISA's full functionality. Program control and move instructions for instance require only a limited number of FUs at one time. These instructions are mainly applied in in-line code. Thus, for in-line code full VLIW functionality is not required. In-loop code on the other hand mainl), consists of arithmetic and logic instructions, which typically require several FUs concurrently and thus VLIW's full functionality. On the DSP architecture's side, loops are already supported, e.g. by including zero-cycle hardware loop counter and cache mechanisms. Furthermore, compiler support loops also by applying techniques such as loop unrolling, software pipelining, and trace scheduling, espec]all), developed for VLIWarchitectures [HePa90]. Hence, the VLIW ISA's functionality must be enabled within loops. This is can be achieved by employing our TVLIW scheme. 3. DYNAMIC INSTRUCTION WORD CODING BY THE T V L I W SCHEME
TVLIW supports different requirements of in-line and in-loop instructions by assembling the VLIW dynamically. As shown in Fig. 2, the very long instruction word (VLIW) consists of a number of functional unit instruction words (FIW). Each FIW controls the associated FU independently from the remaining FIWs. Thus, the whole VLIW can control several FUs concurrently. The idea of the TVLIW scheme is to assemble the actual VLIW out of limited number of FIWs (Fig. 3). If the full functionality of VLIW is required, this assembling may re.quire several cycles. However, these instructions, which require full parallelism of VLIW, mainly occur within loops. With the help of a loop cache, this overhead is only necessary during the first iteration. The TVLIW scheme is based on two assumptions: 9 Equal FIW width: All FUs require a common instruction word width. This can be achieved by designing the FIW for a given TVLIW width, since FIWs can be fully decoded if necessary. 9 Limited parallelism in in-line code: Ifparallehsm can be full), exploited all the time, this scheme is not applicable. However, as shown below this is usually not the case for in-hne instructions. In-loop instructions on the other hand are supported by cache mechanisms. While the first assumption is verified by the case study in section 4, the second assumption is checked in more detail by the following examination of a DSP's instruction set. A. Classification of the Instruction Set
While the in-loop code mainly consists of arithmetical/logical (AL) instructions, including memory accesses, the in-line code mainly consists of move instructions, including register-register and register-memory transfers, and program control instructions, including jumps, branches, calls etc. Program control~move instructions do not require all FUs at the same time. To show this in more detail, in table 1 the FU usage of some program control and move instructions are shown. It can be seen, that at instruction level these instructions typically only use one or two FUs. Note that besides FUs also immediate fields need to be considered. Due to time-stationary pipeline control, the usage of FUs at cycle level must be considered. As an example, a pipelined machine with a one-cycle memory latency is assumed. In table 2, the first four instructions of table 1 are assumed to appear in sequential order. Thus, each column represents the actual usage of FUs at cycle level. As on instruction level, on cycle level only one or two FUs are used at one time. By applying the same method to AL instructions (table 3 and table 4), several FUs are used on both instruction and cycle level.Taking this behavior into account, our current implementation of TVLIW discussed below supports the independent control of two FUs at one time without assembly loss. B. Overview
The block diagram of the TVLIW decoder is shown in Fig. 3. The TVLIW consists of a class field (IWC), mainly indicating of how many TVLIWs the actual VLIW has to be assembled from, and two tag fields (F#), indicating which FU should be controlled by the associated FIW. The output is the actual VLIW, which consists of coded FIWs and nops otherwise. For instructions requiring the full VLIWs processing power multiple TVLIW instructions are collected to build one VLIW. By first iterating a loop, the actual VLIW is storedin a wide cache, where instructions can be read from during the next iterations. C. Functional Description
During the programming process, a VLIW is assumed. The gained immediate object code consists of a set of VLIWs each containing a number of independent FIWs. The main task of the following assembler pass is to reduce the VLIW to one or more TVLIWs by using different instruction classes supported by the TVLIW: single IW, multiple IW, insert IW, and end IW. The single IW class indicates, that the current VLIW only uses two FUs at the most, except for the following case. If the current VLIW contains the same FIWs as the following, the current VLIW is a subset of the following. Thus, the current VLIW will be executed and also stored to be used by the next VLIW. This is indicated by the insert IW class. If the current VLIW uses more than two FUs, the VLIW must be assembled sequentially. Therefore, the current VLIW is divided into a set of TVLIWs, each containing two FUs at the most. If the preceding TVLIW was an insert IW, this TVLIW is removed from the set. All remaining TVLIWs except for the last are indicated by the multiple IW class, while
519
Instruction type
Usage of FU
Example a i
program flow
return, icall, nop
PCU
argumented program flow
call, branch, loop
PCU, IM
memory+register
*Xrl++ = Regl
DP1, AG1
register+memory
Reg 1 = *Yrl ++
AG2, DP 1
register+register
Acc = Reg 1
DP1
register+constant
Regl = 7
DP1, IM
Table 2: FU usage over time for program control and move instructions of table 1
Table 1: FU usage by program control and move instructions
a. examplesare written in a c-like notation
Usage of FUs
Example
Instruction Type
Acc += Regl ~ *Xrl++
AL instruction with PMA a
AG1, DP1
AL instruction with 2 PMAs
Acc += *Xrl++ 9*Yrl++
AG1, AG2, DP1
AL instruction with a constant and 1 PMA
Ace = Const 9*Xrl++
AG1, IM, DP1
parallel AL instructions with 2 PMAs
Accl += *Xrl++ 9*Yrl++ AG1, AG2, II Reg2 = *Xrl++ DP1, DP2 II Acc2 += *Xrl++ ~ Reg2
Table 4: FU usage for arithmetic~logic instructions of table 3
Table 3: FU usage by arithmetic~logic (AL) instructions a.
PMA: Parallel Memory Access
the last is indicated by the end IW class. The end IW class is necessary to clear all previous stored FIWs. The insert IW class is introduced to support especially coding of unrolled loops. In unrolled loops, previous instructions often are expanded by one or two furtfier FIWs to gain the current instruction. In this case, previous instructions can be used for the following.
D. Hardware Description The hardware requirements for a TVLIW decoder are inexpensive and highly regular. As shown in Fig. 4, the hardware structure can be divided into three parts. The IW itself 6Sonsists of a two bit wide class field, a m-bit wide tag field for determining one out of 2 m FUs, and two n-bit wide FIWs controlling two independent FUs. In the first step, the control signals Qcx, QAX, and QBX are generated from the decoded class and both tag fields, respectively. The tag signals control the crossbar unit, in which the n-bit wide FA- and FB-fields are routed to the appropriate intermediate busses F' X, {x: 0..2m-1}.. A nop is switched to the remaining 2m-2 intermediate busses, if the particular FU is not selected by QAX or QBX, respectively. In parallel, the class signals are used to determine the way of assembling the VLIW. The first multiplexer controls an n-bit wide register, storing the intermediate F' X in a multi or insert IW case or being cleared in the end IW case. The final multiplexer switches either the intermediate F' X, the content of the register, or a nop to the actual F X. The complete set of all F X represents the actual VLIW. Thus, the hardware expenses are only one 2:4- and two m:2 mdecoder, m n-bit wide registers, 2 times m n-bit wide 3:1- and m 2:1-multiplexer. However, the crossbar unit may require some - though highly regular - wiring.
E. Further Remarks For reducing both, the IW width and hardware cost for the TVLIW decoder, particularly the crossbar switch, combinations of FUs within one TVLIW can be limited. For instance, the same FU cannot be used twice in one TVLIW and both permutations FI:F2 and F2:F1 do not have to be supported. Thus, the combinations Fn, Fm, n < m do not have to be supported. If the separate fields are combined into one field, ~ n
i=l
i combinations can be removed at the expense of a
slightly more complextag decoder. At the same time, restrictions of this kind can be used to simplify the crossbar switch. In the event of an interrupt, the contents of the decoder registers must be saved. This is necessary for restoring the current state, if the interrupt occurs during a multi- or insert-instruction. Thus, the interrupt service routine can use the whole VLIW also. 4. CASE STUDY: APPLYING T V L I W TO A T & T ' s DSP16 To demonstrate the TVLIW scheme on a real-world-example, we chose the DSP16 of AT&T. This DSP contains a well structured register set, a small 16 bit wide instruction set, a simple bus architecture with only one read/write bus and, above all, has a time-stationary pipeline organization. By orthogonalizing the instruction set into separate FIWs, the dynamic instruction coding scheme can be applied.
A. Overview As can be seen by table 5, a FIW width of n=8 bit is sufficient to support the DSP16's functionality. Additionally, m=5 Functional Unit Permutations
I PCU-, X-Unit 90 + 3
< 28
Load/Store
Y-Unit
DPU
208
208 + 48 < 28
4 + 224
< 28
< 28
Table 5: Permutations for all FUs for the AT&T DSP16
distinct FUs are required: a program control unit (PCU) including instructions for the X address generation unit (XAU),
520
Fig. 4: Hardware Structure of TVLIW Decoder
a YAU, a load/store unit (LSU), a data path unit (DPU), and one short immediate field, used by all FUs. The load/store unit requires also a long immediate word. This is supported by dividing the long immediate word into two short words, one for the low and one for the high byte. To support the same functionality as the original instruction set, the TVLIW word should consist of two FIWs. Thus, the TVLIW consist of two 8 bit wide FIWs, two 3 bit wide tag fields and a 2 bit wide class field, resulting in a width of 24 bit.
B. Results This case study shows that the assumptions of a common FIW width can be achieved. The resulting TVLIW-ISA requires 8 additional bits to provide the same functionality as the original 16 bit ISA. However, by employing our scheme several FUs can be used independently, which is necessary for architectural enhancements. Using the TVLIW scheme, the architecture can be expanded for instance by a further bus, requiring a further AGU, or even another DPU, e.g. for supporting Galois Field Arithmetic [Dres96], without modifying the TVLIW ISA. 5. CONCLUSIONSAND FUTURE WORK We presented a tagged VLIW (TVLIW) scheme, which dynamically assembles the full parallel VLIW. Thus, the advantages of VLIW architectures can be gained, while the main drawback, the code size explosion, can be avoided. We showed that the hardware cost for a TVLIW decoder is inexpensive. Hence, it is a good candidate for high-end fixedpoint DSPs. By dividing the code into program control~move and arithmetic instructions it was shown that only the latter instructionsrequire the full parallel functionality of VLIW. Thus, TVLIW supports instructions, which require only a limited number of functional units, to be executed within one cycle. Arithmetic instructions on the other hand are mainly used inside loops, where the TVLIW scheme can be supported by a simple VLIW cache. In future work, we will be analyzing more complex algorithms found in digital signal processing to gain a detailed insight into a trade-off optimization between processing power and code size. Our current research concentrates on the impact of time-stationary coding on both the compiler and hardware architecture, in particular on cache architectures which efficiently support TVLIW. 6. REFERENCES
[AT&T89] AT&T Inc., DSP16 and DSP16A User's Manual, 1989. [Dres96] W. Drescher, ,,VLSI Architecture for Multiplication in GF(2 m) for Application Tailored Signal Processors", 1996, 1996 IEEE Workshop on VLSI Digital Signal Processing [Fett96] G. Fettweis et al .... Strategies in A Cost-Effective Implementation of The PDC Half-Rate Codec for Wireless Communications", VTC '96, vol. 1, pp. 203-207 [Gray93] J. Gray and A. Naylor and A. Abnous and N.Bagherzadeh, ,,VIPER: a VLIW integer microprocessor", IEEE Journal of Solid-State Circuits, vol. 28, pp. 967-979, 1993 [HePa90] J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach. Morgan Kaufmann Publishers Inc., 1990 [Kung88] S.Y. Kung, VLSI Array Processors, Prentice Hall, 1988 [Kogg81] P.M. Kogge, The architecture of pipelined computers, 1981 [Madi95] V. L. Madisetti, Digital Signal Processors, Butterworth-Heinemann, 1995 [Vanh92] J. Vanhoof et al., High-level synthesis for real-time digital signal processing: the Cathedral-H silicon compiler, Kluwer Academic Publishers, 1992 [Zivo95] V. Zivojnovic', ,,Compilers for digital signal processors", DSP and Multimedia Technology, vol. 4, nr. 5, pp. 2745, July/August 1995.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
521
Implementation Aspects of FIR filtering in a Wavelet Compression Scheme G. Lafruit, B. Vanhoof, J. Bormans, M. Engels and I. Bolsens IMEC, Kapeldreef 75, 3001 Leuven, Belgium 1. Abstract This paper analyzes the implication of some FIR filter implementation choices on the VLSI cost for the Wavelet transform. Because the number of multiplications involved in the Wavelet decomposition represents a serious bottleneck, we compare a number of techniques for reducing this number of multiplications. Traversing the search space along the minimal implementation cost path leads to the use of Sweldens' Lifting Scheme, applied on Wavelet filter banks with rational coefficients, having integer nominators and power-of-two denominators.
2. Introduction The 2D Fast Wavelet Transform of a image represents the original image by a hierarchy of Wavelet Images (Detail Images and Average Images), corresponding to different quality or resolution levels. The image pyramid structure is generated by repeatedly filtering and subsampling the preceding image level, starting from the input image. The 2D filtering for each level is performed by combining a 1D lowpass filter L(n) of length Mo+l and a highpass filter H(n) of length MI+I, first horizontally and then vertically, as shown in fig. 1. Average Image
of Level 2
.~ AverageImage 1 of Level 1 I1 In-
I
Detail Images of Level 1
Stage 1: Horizontal filtering
Stage 2: Vertical filtering
-~
1 i
Images "~ "~t Detail of Level 2
i Stage 1: Horizontal filtering
subsampling on rows: keep one sample out of 2 in one row
Stage 2: Vertical filtering ~ [,z]
subsampling on columns: keep one sample out of 2 in one column
Figure 1: Data flow graph of the 2D-DWT with separable filters. The optimal implementation style for these 2D filtering modules is determined by the overall algorithmic specifications (e.g. the quantization model, fixed or variable filter coefficients). We therefore define a number of guidelines which are particularly useful in image compression systems, using the Wavelet Transform. For satisfying the Area/Performance/Power constraints in ASIC design, one must take special care to avoid the use of area consuming hardware building blocks. Typically, a 16x16 bit multiplier and a 16 bit delay element represent respectively 2k and 160 equivalent NAND gates, while a 16 bit adders only represents a VLSI area cost of 120 equivalent NAND gates. Also RAM modules yield a high area cost. For instance, a 32xl 12 bit single RAM module represents already 0.51 mm 2 in 0.5 gm 3 metal layers MIETEC CMOS technology. For dual-port RAM this grows to 2.56 mm 2. Obviously, memory, register cells and hardwired variable multipliers should be avoided whenever possible.
3. Reduction of the register cost Two main styles exist for the implementation of algorithms with successive FIR filtering. In the folded implementation style, the calculation of all wavelet levels is mapped onto one processor, while in a digit-serial architecture, each level has its dedicated processor with its own digit size to adapt to the multi-rate characteristics of the Wavelet Transform. The digit-serial implementations overcome the drawback of less than 100 % hardware utilization of folded implementations. However, because of the decreasing digit sizes in the successive levels of digitserial implementations, an inter-level data converter must transform an (x-byte output tbrmat of each output sample into an ~2-byte input format for the next level, leading to a high number of additional area-consuming register cells [1]. The smaller number of registers required in a folded implementation largely compensates the lack of 100% hardware utilization. As a consequence, a folded implementation is clearly preferred.
522
4. Filtering by subconvolutions Any vertical filtering involved in the Wavelet decomposition, can be started only when max(Mo, MI) lines from the horizontal filtering stage are available. For large images and/or Wavelet filter sizes, the VLSI area and power consumption cost for these delay line memories is high. In order to reduce this memory cost, the full image should be subdivided in smaller entities, which are processed separately. The filtering of a large image is thus subdivided into subconvolutions over smaller subimages of width L, by means of the overlap-save or overlap-add method [2], as shown in fig. 2. The classical mask-moving convolution technique, applied on the full image, is thus replaced by a blockbased convolution technique. Within each block (subimage), the mask-moving convolution technique can still be used. The delay line memory cost for these subimages, processed in successive time slots, is clearly reduced.
Figure 2 : The Overlap-Save method (a) uses memory in the input space of the convolution, while the Overlap-Add method (b) uses memory in its output space.
5. The quantization model and the convolution implementation Some quantization models (e.g. Shapiro's Zero Tree Coding [3]), used in Wavelet image compression algorithms exploit the correlations between the different levels of the Wavelet pyramid (there exists a high probability that an edge in a detail image corresponds to an edge in the corresponding detail images of the next higher levels), leading to a possible larger compression ratio gain. These quantization models introduce two additional constraints: 9 The filters must be symmetric (linear phase). 9 The calculation of the Wavelet pyramid should be performed "vertically" with reducing subimage sizes (see fig. 3) and not level by level with fixed subimage sizes (horizontal calculation scheme), in order to avoid multiple memory accesses for the quantization coding.
Figure 3: The vertical and horizontal calculation schemes, related to the quantization coding. For the Shapiro Zero Tree coding, vertical interrelations are exploited and the vertical calculation scheme must be adopted. These additional constraints have an impact on the choice of the block-based convolution method to be used with regard to the number of multiplications. Indeed, for non-symmetric filters, the overlap-save and overlap-add methods exhibit the same performances, if we take care to avoid the "dummy" multiplications provided by the zero-padding of the data blocks (subimages) in the overlap-add method of fig. 2(b). For symmetric filters, the number of
523
multiplications can be reduced, but not in the same amount for both methods. In the overlap-save method, a reduction with a factor 2 is obtained. However, in the overlap-add method, the symmetry cannot fully be exploited as a consequence of the zero-padding. The reduction R of the number of multiplications in the overlap-add method between the non-symmetric and symmetric filters is always smaller than 2 and depends on the ratio of the filter length M and subimage width L: 2 R = (1) M l+ 2 L The number of multiplications of both methods applied on symmetric filters is similar for L>>M/2. In the "vertical" calculation scheme, the subimage width necessarily decreases from one level to the next higher level of the pyramid, so that the constraint L>>M/2 is possibly not satisfied for all levels. In this case, the overlap-save method is preferred.
6. R e d u c t i o n o f t h e m u l t i p l i c a t i o n c o s t In [4] it is shown that to minimize the round-off noise of a FIR filtering of an 8-bit/sample input image, using 16-bit filter coefficients, having an absolute value smaller than one, a 16-bit internal fixed-point representation is sufficient. If the filter coefficients are represented by rational values with an 8-bit integer nominator and a power-of-two denominator, a 12-bit internal representation, as proposed in [5] is sufficient, reducing the VLSI area cost with a factor 1.3. To further reduce the VLSI area cost, special techniques can be applied to lower the number of variable multiplications. Several methods have been compared. For moderate filter lengths up to 15 taps, we have found that the number of multiplications can be reduced with a factor ranging between 1.3 and 1.5 with Mou's diagonalization technique [6] and around a factor 1.8 using the Chinese Remainder Theorem [7] and Sweldens' Lifting scheme [8, 9].Also note that the number of delay line elements using Mou's technique dramatically increases with a factor 2 to 3 and counteracts the marginal gain in the number of multiplications with regard to the VLSI area cost. Mou's technique has however the advantage that the clock rate can be reduced, enabling low power implementations. Finally, Sweldens' Lifting scheme has a more regular structure than that obtained with the Chinese Remainder Theorem, so that, in general, the former requires a smaller number of register cells. When using fixed filter coefficients, the above techniques can still be applied, together with the expansion of the constant multiplications as shift-adds. However, since the reduction of the number of algorithmic multiplications, using for instance the Chinese Remainder Theorem, is always done at the expense of an increasing number of additions, the overall number of adder modules, as well as the corresponding VLSI area cost, are not necessarily reduced for fixed filter coefficient having a small number of non-zero bits in their binary or Canonical-Signed-Digit (CSD) [10] representation. Simulations show that for fixed filter coefficients, the highest VLSI area gains are obtained by a heavily optimized hardware sharing between the Lowpass and Highpass filter operations [11, 12]. A similar scheme is naturally obtained in Sweldens' Lifting technique. Clearly, Sweldens' Lifting scheme is often an adequate choice. It is worth to notice that hardware sharing between different sets of High- and Lowpass filters can introduce a substantial multiplexing overhead. Experience shows that one should make a choice between the implementation of up to 2 or 3 fixed Wavelet filter banks on one hand and a programmable filter coefficients implementation using variable multipliers, on the other hand.
7. A r e a e s t i m a t i o n s For calculating the Wavelet decomposition of a 1024x1024 8-bit pixel image within 1 s, a pixel rate of 1 MHz is required. I f f is the clock frequency of the chip, expressed in MHz, then the number of cycles available per input pixel is f. An upper bound for f is determined by the RAM access time. For a RAM with an access time of 100 ns, f can not be larger than 10 MHz. Each input pixel must then be processed within 10 cycles. Table 1 gives some estimations about the cycle budget and VLSI area for symmetric Wavelet filter banks in different configurations: 9 Different number of taps for the Lowpass (Mo) and Highpass (M/) filters 9 Overlap-save (S) or overlap-add (A) convolution method 9 Direct form implementation (D) or with Sweldens' Lifting Scheme (L). 9 Varying number N of hardwired multipliers in a folded implementation The VLSI area estimations are performed using the 0.5 ~m 3 metal layers MIETEC CMOS technology and include the routing overhead. It is only provided for those configurations in which each input pixel can be processed within 10 cycles.
524
Table 1:Estimated cycle budget and VLSI area for different implementation styles of symmetric Wavelet filter banks
8. Conclusion We have shown that the VLSI implementation of a FIR filter in a Wavelet decomposition scheme used for image compression should satisfy a folded architecture, using Sweldens' Lifting Scheme, applied on Wavelet filter banks with rational coefficients, having integer nominators and power-of-two denominators. Simulation results suggest that the VLSI area cost can be reduced with approximately a factor 3 when implementing fixed, instead of programmable filter coefficients. For large images and/or large filters, successive subimage convolutions should be applied using the overlap-add or overlap-save method, with some preference to the latter.
9. Acknowledgment This research was supported by the SCADES-3 program of the ESA (European Space Agency), a grant of the Flemisch institute for promotion of scientific-technological research in the industry (IWT) to J. Bormans and a grant to M. Engels as Senior Research Assistant of the Belgian National Fund for Scientific Research (NFWO). The authors would like to thank Lode Nachtergaele (IMEC), Martin Janssen (IMEC) and Peter Schelkens (Vrije Universiteit Brussel - Free University of Brussels - ETRO) for their contribution to this work.
10. References [1] T. Denk, K. Parhi, "Architectures for lattice structure based orthonormal Discrete Wavelet Transform", Proceedings of the International Conference on Application Specific Array Processors, San Francisco, CA, pp.259-271, August 1994 [2] R. E. Blahut, "Fast Algorithms for Digital Signal Processing," Addison-Wesley Pubishing Company Inc., New York, April 1985. [3] J.M. Shapiro, "Embedded Image Coding using the zerotrees of wavelet coefficients," IEEE Transactions on Image Processing, Vol. 41, No. 12, pp. 3445-3462, December 1993 [4] J.I. Artigas, L.A. Barragan, J.R. Beltran, E. Laloya, J.C. Moreno, D. Navarro, A. Roy, "Word length considerations on the hardware implementation of two-dimensional Mallat's wavelet transform," Optical Engineering, Vol. 35, No. 4, pp. 1198-1212, April 1996 [5] A.Y. Wu, K.J.R. Liu, Z. Zhang, K. Nakajima, A. Raghupathy, S-C. Liu, "Algorithm-based low-power DSP system design: methodology and verification," VLSI Signal Processing VIII, edited by T. Nishitani and K.K. Parhi, IEEE Signal Processing Society, pp. 277-286, 1995 [6] Z-J Mou, P. Duhamel, "Short-Length FIR filters and their use in Fast Nonrecursive Filtering," IEEE Transactions on Signal Processing, Vol. 39, No. 6, pp. 1322-1332, June 1991 [7] R.C. Agarwal, J.W. Cooley, "New Algorithms for Digital Convolution," IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 25, No. 5, pp.392-410, October 1977 [8] Wim Sweldens, "The Lifting Scheme: A custom-design construction of biorthogonal wavelets," technical report from ftp.math.sc.edu/pub/imi_94. [9] W. Sweldens, "The Lifting Scheme: A New Philosophy in Biorthogonal Wavelet Constructions," Proceedings of the SPIE conference, Vol. 2569, pp. 68-79, 1995 [10] S.W. Reitwiesner, "Binary Arithmetic," Advances in Computers, Academic, Vol. 1, pp. 231-308, 1966. [11] M. Janssen, F. Catthoor, H. De Man, "A Specification Invariant Technique for Operation Cost Minimisation in Flow-graphs," Proceedings of the 7th International Symposium on High-Level Synthesis, pp. 146-151, Niagaraon-the-Lake, Ontario, Canada, May 1994 [12] M. Janssen, F. Catthoor, H. De Man, "A Specification Invariant Technique for Regularity Improvement between Flow-Graph Clusters," Proceedings of the European Design and Test Conference, pp. 138-143, Paris, France, March 1996
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K.
B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
525
Recursive approximate realization of image transforms with orthonormal rotations Gerben J. Hekstra and Ed E Deprettere Department of Electrical Engineering Delft University of Technology, Delft, The Netherlands email: [email protected], [email protected] Monica Monari Department of Electrical Engineering Bologna University, Bologna, Italy Richard Heusdens Digital Signal Processing Group Philips Research Labs, Eindhoven, The Netherlands
Abstract Image transforms, such as the LOT and various modifications of it and also the DCT, which are all commonly used in transform coding for data compression, can be recursively decomposed yielding a sequence of orthogonal matrices of decreasing order. The basis functions on which the transform is built can approximated to any order of accuracy by realizing the set of orthogonal matrices in its decomposition by means of so-called fast rotations which are orthonormal within the range of the required accuracy. For the approximation to be optimal, all orthogonal matrices in the decomposition must be simultaneously expressed in terms of fast rotations. This paper presents a procedure to compute the optimal solutions being either the solution of minimum cost for a given lower bound of the accuracy or the solution with the highest accuracy for a given upper bound of the cost.
1.
Introduction
Data compression of images --such as X-ray image sequences-- for storage purposes is heavily constrained by the requirement that the reconstructed images should not reveal coding artefacts. Compression techniques using discrete cosine transforms (DCT) [ 1] or conventional lapped orthogonal transforms (LOT) [2] fail to meet these requirements at high compression ratios. The modified lapped transform (MLT) overcomes some of these problems, but it is not orthogonal which is a disadvantage from the point of view of implementation. One of the authors [3] has designed a new LOT which is orthogonal and does not introduce any blocking artefacts when applied to medical image compression. This new LOT was designed taking the following constraints into account.
526 From the viewpoint of coding complexity: critical sampling (minimum amount of data), perfect reconstruction, good frequency discriminating properties of analysis filters. From the viewpoint of coding efficiency: analysis filters have zeros at z = 1, except for the low-pass filter. From the viewpoint of perception: linear-phase overall transfer functions, linear-phase synthesis filters (symmetric sensitivity), smoothly decaying to zero synthesis impulse responses (no blocking artefacts), short synthesis filters (no ringing effect), number of filters sufficiently large (simple noise shaping). From the viewpoint of implementation cost: orthonormality (minimal error blow-up), para-unitary (analysis and synthesis operators same structure), critical sampling (minimum sample rate). The new LOT obeys all the above criteria, but does not allow a realization in terms of just DCT or DST operations only. The question arises, of course, whether this is a drawback. The answer is: NO and this has to do with the fact that the arguments that usually plead in favor of DCT based operations are questionable. Indeed, the NlogN argument stems from a computational complexity measure in terms of number of operations. This number being low does not imply that the implementation is fast and small: Large wordlengths are needed to preserve the orthonormality of the basis functions when implemented with traditional multiplier-add operations. We have shown in [4] that exploiting the orthonormality and known structure of the basis functions can bring down the complexity of the realization of the transforms. The approach leans on the important property asserting that if an isometry is decomposed into orthogonal operations, then the sensitivity with respect to perturbations in the orthogonal operations' arguments is low. As a result, these arguments can be so perturbed that the orthogonal operations can be implemented at very low cost without deteriorating the global isometry significantly. More specifically we have shown the following. 9 The LOT, and in fact many other commonly used orthogonal transforms including the DCT, can be recursively decomposed yielding a set of square orthogonal matrices of decreasing order. 9 There exists a set of matrices gi -- (ci, -s'i, s'i, ci) which are orthogonal within the range of required accuracy and which form a complete set in the sense that any orthogonal matrix U of any reasonable size can be factorized into a sequence of these planar rotations Ri, where the approximation is again within the range of the required accuracy. * The matrices Ri, called orthonormal lut-rotations, or fast rotations, can be implemented with only a few shift and add operations. The recursive decomposition of a transform and its VLSI implementation is elaborated upon in [4]. The present paper deals with the approach and procedure to find the optimum approximations of the orthogonal matrices which characterize the recursively decomposed transforms. The basic problem of finding a cost-effective solution is an P-parameter optimization problem, in which the P orthogonal matrices originating from the transform's recursive decomposition, have to be approximated simultaneously. The optimization program employs an 2P-tree branch-and-bound search which exploits the empirically verified (close-to) monotonous behavior of the cost and accuracy functions and is capable of finding either the solution with the best accuracy for a prescribed maximum cost or the minimum-cost solution for a prescribed lower bound of the accuracy.
2.
The LOT and its recursive decomposition
The compression and encoding of images is carried out in a transformed domain using a transform operator A(A) which is an upper triangular block-bounded Toeplitz matrix with as block entries the N x 2N matrix
A=PA
All
A21
A11J]
-A21J
(I)
527 where PA is a permutation matrix and A21 = BAAll with BA orthogonal. The matrix BA is the first orthogonal matrix in the sequence of orthogonal matrices which emerge from a recursive decomposition of A. The decomposition goes as follows. Put a e - 89 + A l l J ) a n d A o - 8 9 J), and let U o - [AerAor] r. The matrix A can then be written as
A_PA[ I ~al['
I
Now since U0 =- [AeTAoT]T withAe and Ao even- and odd-symmetric, respectively, it holds that this matrix has a similar decomposition to A. It turns out that, in fact, the entire decomposition of U0 can be written out recursively as
BUk
II
Uk+l
1['
I -J
'""
where the BVk are orthogonal and decreasing in size by NxN, ~ • ~ , . . . , 1 x 1. The recursion is a remarkable property and will in general not exist, that is, will terminate after the decomposition of U0 [5]. However, most transforms do have this property, and those that do not can still be approximated using the procedure to be described. We have tacitly assumed that similar decomposition and approximation results apply when considering the inverse transform (after decoding) S(S). In fact, for most of the transforms S,A = 2", S = A and the decomposition and approximation of A yields as well the approximation of S. It is important to note that the approximations ~(A) and ,~(S) are exact inverses of each other.
3.
The approximation concept and approach
The approximation is based on the following principle. Let y -- Ax with A an m x n matrix, n > m. If AAr < I, then there exists an orthogonal matrix Q of size (m + n) x (m + n) such that y = [IO]Q[OI]rx. The matrix Q is an orthogonal embedding of the matrix G - [AcA], where Ac satisfies AcArc + AAT" _ I. Now Q and hence A is approximated by an orthogonal matrix Q - Qr which is a product of essentially T 2 x 2 fast rotations, see [4, 6]. The approximation must be such that - [IO]Q_.[OI]Tx is equal to y within the range of required accuracy. Moreover, T must be sufficiently small to allow a (single chip) VLSI implementation that is cost-effective and fast enough for real-time compression and coding of, say, image sequences of size 1024 x 1024 and rate of upto 50 images per second. The factorization of Q is done, by choosing the optimal fast rotation qt according to a greedy criterium at each step t, such that Qr - I-Irt=lq(t) converges rapidly to Q, see [4] for details. The rotation qt, determined by an index pair (i, j)t and a rotation angle ~ , is the embedding of a 2 x 2 fast rotation into the i - t h and j - t h rows and columns of an (n + m) x (n + m) identity matrix. Each qt has a certain cost associated with it, which is the number of shift-add-pair operations needed to implement the rotation over the angle cx, see [6] for details. The approach used is quite general. It is also applicable in cases when A is itself an isometry (as for example the LOT) or even an orthogonal matrix (as for example the DCT).
4.
Recursive approximate realization of the LOT
A property of the matrices BA, Buo, Bu~,... in the recursive network of the LOT is that all are orthogonal and already mainly diagonal. This property is essential for an even faster convergence of the approximations of these matrices. For the full recursion of the proposed 16 x 32 LOT, we approximate each of the matrices with the matrices/~a,/~Uo,/~U~,/~U2, using the technique described in the previous section, with respectively TA, Tuo, TUl, Tu2 steps in the appromations. The approximation fi, follows from the reconstruction using these approximates. For completeness, we have to mention that the recursive decomposition ends with the matrices Bu3, U4, and that these too are used in the reconstruction. Since both are trivial 1 x 1 matrices, Bu3 - [1], U4 - [~], with an exact realization of no real cost, they need not be approximated. The number of steps
528 used in the respective approximations of the matrices form an index i = (TA, Tuo, Tu~, Tv2) to a P-parameter approximation (in this case, P = 4). We call the corresponding approximation A - fmat(i) the solution belonging to this index, where fmat(i) is the reconstruction function, in terms of the index i. Clearly this function depends on a given factorization of the matrices. We also define the function fcost(i) and the function face(i), both in terms of the index i, as the overall cost function of the solution, and the overall accuracy of the solution. We measure the overall cost as total number of shift-add operations in the resulting network, which is a weighted sum of the cost of the realization of submatrices and of the introduced butterfly operations. We measure the accuracy of the solution as the norm of the difference between original and approximation, for which we take face(i) = _2 log(llA-,ill), where the approximation,zi is given by ,4 = fmat(i). Analysis of the cost and accuracy functions reveals that the cost function fcost(i) is monotonous throughout the solution space. Hence we can write, for any index i, and any positive increment 8 > 0: fcost(i + 5) > fcost(i),
gi, 'v'8 >__0
(4)
The accuracy function face(i) is close-to-monotonous, that is, for most indices i and increment 5 > 0, but not for all the monotonicity holds. The disturbance of the monotonicity is small and very local in nature, so that we can set an mempirically determined-- error bound ~ > 0, such that we can write:
f a c c ( i + 8 ) + e > faec(i),
gi, VS>O
(5)
Further analysis reveals that, when choosing an increment ~5in only one dimension, the accuracy function exhibits saturation. This means that an increase in accuracy in only one of the matrices is only cost-effective up to a certain level, until the combined accuracies of the other matrices start to play a role, and saturation sets in. This is a clear indication that, in order to find cost-effective solutions, a simultaneous approximation of the matrices must be made.
5.
The search for optimal solutions
For the search of a cost-effective solution to the approximation problem, let us first define the target accuracy atarget and target cost Ctargetfor the search, and state that any solution i must satisfy both face(i) > atarget and feost (i) < Ctarget. Furthermore, we define a cell C(p, q) in the solution space as the collection of points lying between the bounding indices p, q with p < q, p < i < q. Hence, we formulate the discrimitave property of a cell C(p, q), that it contains no and write i ~ C(p, q) r cost-effective solutions if face(q) + ~ < atarget or fcost(P) > Ctarget. We have implemented a heuristic 2`o-tree branch and bound search algorithm, that is capable of finding cost-effective solutions either by finding the best solution s for a given target accuracy atarget, such that face(S) > atarget and fcost(S) is minimal, or by finding the best solution s for a given target cost Ctarget, such that fcost(S) < Ctarget and face(S) is maximal. To explain its operation we take the first case, with a given target accuracy atarget, and initial target cost of Ctarget -- 0% and search the entire solution space as follows. First, we factorize each the matrices BA, Bu0, 999in the recursive decomposition of the LOT independently, until they reach a sufficient level of (maximum) accuracy, thus setting the bounds of the index to the solution as the number of steps required to reach this maximum accuracy. For the 16 x 32 LOT, the upper bounds are ( 128, 128, 27, 5), leading to a solution space of size 2.2x 106. For the 32x64 LOT, this results in upper bounds of (511,511,128,27,5) and a solution space of size 4.3 x 109. Next, a given cell (the root of the search is the entire solution space) is split into at most 2 `o subcells, and each of these is tested whether they could contain any solutions. If a cell may contain solutions, it is split and checked recursively. If not, the corresponding branch of the search tree is cut off. If a solution is found during the search, it is used to set the new target cost Ctarget dynamically, so that less cells need to be examined. The result of the search is a solution that satisfies the constraints and has guaranteed minimum cost. We have made the interface between the search program and the objective functions face(i), fcost(i) such that it can be used for other transforms. We have used it for approximated networks for MLT, DCT, and wavelet transforms with success.
529
6.
Results and Conclusions
Table 1 shows the results for approximate realizations of the 16 • 32 LOT, of increasing accuracy. The accuracy is shown here as IIA-AII. Our method shows a rapid convergence, so that solution number 16 in the table, with cost only 776 shift-addsolution 4 8 12 16 20 24 28 32 36 40
index
targetacc. ....
(13 8 3 1) (25 18 5 2) (36 32 7 2) (45 44 11 3) (64 56 11 3) (82 77 13 3) (90 85 18 4) (103 96 22 5) (116 115 24 5) (127 124 26 5)
....0.4210 0.1606 0.0613 0.0234 0.0089 0.0034 0.0013 0.0005 0.0002 0.0001
actu~ acc.
cost
0.4163 0.1578 O.O6O2 0.0231 0.0088 0.0033 0.0013 0.0005 0.0002 0.0001
314 494 644 776 900 1046 1122 1194 1266 1314
Table 1" Optimal solutions for the 16 x 32 LOT with full recursion depth and varying target accuracy. pair operations, is already (visually) indistinguishable form the original, both in smoothness of the basisfunctions and in the frequency responses. As a comparison, a direct implementation would require 512 high-accuracy multiply-add operations (-- 10.000 additions), without having the desirable properties that orthogonal implementations like ours have. Of course, a multiplier implementation following the DCT decomposition is cheaper, but fails for many transforms, such as our new LOT. We have also tested our search program on partial depth recursive decompositions of the LOT. Table 2 shows the results for full depth (level = 5) until direct (level = 0) implementations. The solutions are targeted at the accuracy of solution 16 of recursion depth 5 4 3 2
approximated matrices
index
BA , Buo , Bu, , Bu2 , (Bu3 , U4) BA , Buo , Bu1, Bu2 , U3 BA , Buo , Bu1, U2 BA , Buo , Ul
(45 44 11 3) (45 45 11 3 3) (49 49 9 17) (49 45 66) (56 260) (691)
1
BA,Uo
0
A
actual
cost
acc.
0.0231 0.0230 0.0231 0.0211 0.0230 0.0226
776 780 948 1428 1988 3808
Table 2: Optimal solutions for the 16x32 LOT with different levels ofpartial recursion and a fixed target accuracy of 0.0233. table 1. The results clearly show that the full depth recursive decomposition of the LOT, with simultaneous approximation of the submatrices leads to the best results.
Acknowledgements This research was supported in part by the Dutch National Technology Foundation STW under contract DEL55.3621, and also in part by a grant from the EU in the Erasmus Program.
530 References [ 1] R. Veldhuis and M. Breeuwer. An Introduction to Source Coding. Prentice Hall, New York, 1993. [2] H.S. Malvar and D.H. Staedlin. The LOT: Transform coding without blocking effects. IEEE Trans. on ASSP, 37:553559, 1989. [3] Richard Heusdens. Design of lapped orthogonal transforms. IEEE Trans. on Image Processing, to appear. [4] Gerben J. Hekstra, Ed E Deprettere, Richard Heusdens, and Zhiqiang Zeng. Efficient orthogonal realization of image transforms. In 1996 IEEE Workshop on VLSI Signal Processing, San Fransisco, November 1996. [5] Richard Heusdens. Overlapped Transform Coding of lmages: Theory, Application, and Realization. PhD thesis, Delft University of Technology, 1996. [6] J. Gtitze and G. Hekstra. An algorithm and architecture based on orthonormal l.t-rotations for computing the symmetric EVD. Integration, the VLSI Journal, 20:21-39, 1995.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
531
Radix Distributed Arithmetic: Algorithms and Architectures Mohammad K Ibrahim De Montfort University, Leicester, UK Invited Paper Abstract In this paper, the concept of radix distributed arithmetic is presented for the first time. The radix approach can be used to describe the arithmetic functionality of the Algebraic Mapping Networks (AlMa-Net) which is a fine grain soft description of Signal Processing (SP) systems architecture. The advantage of using the radix approach is that it will result in a wide range of architectures with different trade-off, one for each radix. The conventional distributed arithmetic is seen as an end point of a spectrum when the radix=2.
1
Introduction
With advances in VLSI, application designers have a wide range of SP implementation styles with different design trade-off such as RISC, programmable DSPs, cores, FPGA, and ASIC. Furthermore, due to advances in signal processing algorithms, arithmetic, and architectures, system designers have a range of possible algorithms and architectures to choose from for each signal processing function with each having a different trade-off. This has made the design of SP systems much more complex. It involves the evaluation of alternative algorithms as well as hardware and software solutions. In the design of SP systems, Large Grain Data Flow programming languages (LG-DF) are currently used as an interface between system designers and implementers for the following reasons [1]. LG-DF languages are a convenient simulation environment because they are equivalent to using mathematical equations and block diagrams which is the natural way of describing SP algorithms [1]. For implementers, the LG-DF graphs is popular since they do not specify the implementation style, and hence any implementation can be used so long as it maintains the integrity of the dataflow graph [2]. Also, LG-DF show the interdependency between the data streams and hence can be used to exploit large grain parallelism, scheduling and partitioning [1]. Synthesis and design automation tools for SP systems have received a great deal of research activity in recent years to translate the LG-DF specification into the final hardware and/or software implementation. This generally involves specifying the target execution platform first, in order that the nodes of the LG-DF description can be written or compiled into the semantics of the host. The host semantics could be C or its variants (for programmable processors), hardware description language (for dedicated chips), etc. It is worth noting that in signal processing, dataflow graphs are not used for fine grain specification of the LG nodes, since these SP systems are implemented on a control flow type of machines. It is also very important to note that the LG-DF graphs as programming language have nothing to do with fine grain data flow architectures or machines [2]. Despite advances in the synthesis tools, however, it is becoming clear that the initial specification of the system at the algorithm level, which is usually specified in terms of "algebraic operators", has a considerable influence on the choice of the final implementation. Furthermore, in the majority of cases, those who develop functions or systems at the algorithm level are not fully aware of the implications that their chosen algorithms will have on the final fine grain implementation. The algorithms are selected primarily for their performance with respect to accuracy. There is a great need for a fine grain description of signal processing systems which can be manipulated by those who are involved in the development of functions and systems at the algorithm level as well as system implementers. Since modem SP systems use both custom hardware and software running on programmable
532 CPUs, this description must be applicable for all implementation styles without being specific to a targeted execution platform. This fine grain description will have several advantages as explained in the next section.
2
Algebraic-Mapping Network (AlMa-Net):
Algebraic-Mapping Network ( AlMa-Net ) is a generic fine grain description of SP systems currently being developed by the author with colleagues in the DSP Systems and Applications Group at De Montfort University and colleagues at other organizations. The generic "algebraic" nature of the AlMa-Net will enable a quick and systematic manipulation and exploration of the different architectural styles that are available without the need of using their corresponding fine grain semantics. This has several advantages including : 9 SP system designers at the algorithm level can take an active role in the design and implementation of their systems, and to develop a greater understanding of the implications of algorithm selection and style of implementation on the final implementation. 9 SP system implementers will be able to evaluate different style of implementations or execution platforms using a generic fine-grain description which overcomes the need of first acquiring the hardware and software development tools of the execution platform for evaluation. The AlMa-Net consists of functional nodes, data storage nodes and edges for communications. These are all described using algebraic expressions. More details are given in section 4. One aspect of AlMa-Net is that for specific implementation styles, algorithms need to be specified using the Radix Algebraic Processors (RaAP). RaAP has been developed in the last few years as a fine grain algebraic description of signal processing functions and systems at the sub-word level [3-6]. The radix methodology has been used extensively in the design of digit serial architectures [3-6]. In the following sections its application to the design of Distributed-Arithmetic algorithms and architectures will be reported for the first time.
3
RadixDistributed Arithmetic
Distributed arithmetic is generally used to find the inner vector product with one of the vectors being a constant. Given two vectors with M elements, U = [ul...UM] and V = [Vl . . . VM],the inner vector product, W, is given by: M W = Z u(i) v(i) i=1
(1)
We can write the elements u~, i= 1,..,M, in terms of radix-2 n arithmetic as follows: N u ( i ) = Z u)(i) 2 jn j=0
(2)
where uj(i) is the jth digit of u(i), and n is the digit size in terms of the number of bits. Substituting for u(i) in the first equation, and after manipulation, we have,
Wj = Wj_1 + 2" Pj for j=0 ....N, where W=W~, and
(3)
533 M
PJ =i~I u/(i)v(i)
(4)
The above two equations completely describe the radix distributed arithmetic algorithm. For a constant vector, V, equation 4 describes what is being implemented using memory. It is the inner vector product between the vector V and the digit-vector, Uj = [uj(1),....,uj(M)] v. Equation 3 describes how the inner digit-vector products, Pj, j-1 .... N, are added together. It is interesting to note that when the radix=2 (n=l) the resulting algorithm is the conventional distributed arithmetic, which means that the vector V is multiplied by the bit-vector of the input data. In the conventional implementation, the computation of equation 3 is performed in a bit-serial fashion, where the bits of the elements u(i), i=l .... M, are fed serially to the memory to calculate Pj starting with the LSB. Clearly, the speed of this implementation is limited by the bit-serial computation of the inner vector product. The radix-distributed arithmetic generalizes the basic distributed arithmetic concept in that equations 3 and 4 are generic for all radix and do not represent a specific realization. For each radix, however, the equations result in different structure. As a result, a wide range of distributed arithmetic structures and, hence, tradeoff become available, one for each radix. In the next section, the description of the radix distributed architecture using AlMaNet is briefly described.
4
Radix Distributed Arithmetic Architecture using AlMa-Net:
The architectures of the radix distributed arithmetic using AlMa-Net is shown in figure 1. The most significant advantage of the radix and AlMa-Net approach is that the architecture in figure 1 is generic which results in a different structure for each radix. In other words the architecture in figure 1 is effectively a "soft" architecture. In an AlMa-Net, the functional nodes are denoted as squares. For example the functional unit in figure 1 is an adder used to perform the addition in equation 3. In an AlMa-Net the edges are represented using a transmission matrix, which could be considered as a space-switching matrix. Also, in AlMa-Net, all data storage nodes (denoted as circles) are described by F(D, No, A, RW, E), where the variables could be scalar, vector or a matrix (however, they should all be of the same type) and each "element" corresponds to a port, D is the data, No is the Wordlength of the data, A is the address for the ports, RW is the read/write indicator, and E is the enable for the ports. The interesting property of the AlMa-Net function F(D, ND , A, RW, E) is that it is a programmable mathematical expression which is (i) only activated when the enable parameter(s) is triggered, (ii) variables could be inputs or outputs depending on the corresponding read/write indicators. It can also represent parallelism through the use of vector and matrix parameters. These properties can be extended to nodes other than those that correspond to memory.
wj
where: V= f(Ov, Nv, Av, RWv, Ev ) L= f(DI, Nz, & , RW~, El )
T - f(Dt, Nt, At , RWt, Et ) S= f(Ds, Ns, As, RWs, Es )
W
R = f(Dr, Nr, Ar, RWr, Er )
Figure 1: AlMa-Net for the Radix Distributed Arithmetic Architecture
534 For example, L denotes a latch which has two ports, one input port and one output port. Therefore, DI=(Wj, Wj.1) , Al=(jmod(1), jmod(1)), RWl = (W,R), El=(jmod(1), jmod(1)). Another example is Ls which is a sampler latch with again one input port and one output port, where in this case, Ds=( Wj, W), A,=(jmod(1), jmod(1)), RWs = (W,R), Es=( S(jmod(M+ 1)), jmod(1)). Here S(k) =1 for k=0 and equal to zero otherwise. In the case of R, which is a RAM (possibly a ROM) with one port, Dr=( Pj ), Ar=( Uj ), RWr = (R), Er=(jmod 1). Similar expressions can be written for V, T and R, but they are more involved since they are multi-port memory elements. Note that in all cases, the Wordlength can also be specified for each port. In figure 1, the transmission matrix of all the edges are equal to the identity matrix. Obviously, as in the conventional case, when the memory size required in R becomes large, address repartioning methods can be used in the calculation of Pj, e.g.
M/2
PJ = ?1 uj(i) v(i) + =
M
Z
i=M/2+l
uj(i) v(i)
Furthermore, the operation in the dashed box of figure 1 can be implemented using a variety of ways.
5
Summary:
In conclusion, radix distributed algorithms and the AlMa-Net description of the corresponding architecture is described in this paper. The main advantages of the radix approach and the AlMa-Net are (i) they are generic fine grain description, (ii) they represent soft architectures because it describes a family of structures, one for each specific value(s) of the parameter(s).
Acknowledgment: The author would like to thank the following colleagues for their valuable discussions on the radix approach and the AlMa-Net: Dr Amar Aggoun, Dr Akil Bashagha, Christine Hillyar, and Dilip Chauhan of De Montfort University, Dr Kamran Kordi of GEC Hirst Research, and Ahmed Ashur, Mujahed Mekallalati, Dr Leon Harrision of Nottingham University. The author would also like to acknowledge the funding support of De Montfort University through the vice-chancellors Research Initiative.
References: E.A. Lee and D.G. Messerschmitt, " Static scheduling of synchronous data flow programs for digital signal processing," IEEE Trans. on Computers, vol. 36, no.I, pp24--35, 1987. E.A Lee and T.M. Parks, "Dataflow process networks," Proc. IEEE, vol. 83, no. 5, pp773-799, 1995. A. Aggoun, A. Ashur and M.K. Ibrahim, "A Novel Cell Architecture for High Performance Digit-Serial Computation," Electronics Letters, 29, pp938-940, 1993. M.K. Ibrahim, "Radix Multiplier Structures: A structured design methodology," lEE Proceedings part E, 140, pp185-190, 1993. A.E. Bashagha and M.K. Ibrahim, "High radix digit-serial division," accepted for publication in the lEE Proceedings on Circuits, Systems, and Devices. A. Aggoun, M.K. Ibrahim and A. Ashur, "Bit-level pipelined digit serial Processors," Accepted for publications in the IEEE Transactions on Circuits and Systems.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
535
Order-Configurable P r o g r a m m a b l e Power-Efficient F I R Filters * Chong Xu, Ching-Yi Wang and Keshab K. Parhi D e p a r t m e n t of E l e c t r i c a l E n g i n e e r i n g U n i v e r s i t y of M i n n e s o t a , M i n n e a p o l i s , M N 55455 Abstract
We present a novel VLSI implementation of an order-configurable, coefficient-programmable, and powerefficient FIR filter architecture. This single-chip architecture contains 4 multiply-add functional units and each functional unit can have up to 8 multiply-add operations time-multiplexed (or folded) onto it. Thus one chip can be used to realize FIR filters with lengths ranging from 1 to 32 and multiple chips can be cascaded for higher order filters. To achieve power-efficiency, an on-chip phase locked loop (PLL) is used to automatically generate the minimum voltage level to achieve the required sample rate. Within the PLL, a novel programmable divider and a voltage level shifter are used in conjunction with the clock rate to control the internal supply voltage. Simulations show that this chip can be operated at a maximum clock rate of 100 MHz (folding factor of 1 or filter length of 4). When operated at 10 MHz, this chip only consumes 27.45 mW using an automatically set internal supply voltage of 2V. For comparison, when the chip is operated at 10 MHz and 5V, it consumes 109.24 mW. At 100 MHz, the chip consumes 891 mW with a 4.5V supply that is automatically generated by the PLL. This design has been implemented using Mentor Graphics tools for an 8-bit word-length and 1.2#m CMOS technology.
1
Introduction
With the recent explosion of portable and wireless real-time, digital signal processing applications, the demand for low-power circuits has increased tremendously [1]-[3]. This demand has been satisfied by utilizing application specific integrated circuits or ASICs; however, ASICs allow for very little reconfigurability. Another new trend is the need to minimize the design cycle time. Therefore many programmable logic devices (PLDs) (e.g., field-programmable gate arrays) are being utilized for prototyping and even production designs [4]. The main disadvantage of these P LDs is that they suffer from slow performance because their architectures have been optimized for random logic and not for digital signal processing implementations. In this paper, a solution for the implementation of high-speed, low-power, and order-configurable finite impulse response (FIR) filters is presented. This architecture was designed by applying the folding and the retiming transformations and the filter order can vary from 1 to 31 using one chip. Multiple chips can be cascaded to achieve higher order FIR filters. This new architecture consists of two parts: a configurable processor array (CPA) [5] and a phase locked loop (PLL). The CPA contains the multiply-add functional units and the PLL is designed to automatically vary the internal voltage to match the desired throughput rate and minimize the peak power dissipated by the CPA. We utilize a novel programmable divider and a voltage level shifter in conjunction with the clock to control the internal supply voltage. The CPA portion contains folded multiply-add (FMA) units which operate in two phases: the configuration phase where the processor array is programmed for a specific sample-rate and filter-order, and the execution phase where the processor array performs the desired filtering operation. We also implemented novel programmable subcircuits that provides the order configurability of the architecture. This design has been implemented using Mentor Graphics tools and 1.2#m CMOS technology. In section 2, we briefly describe how the CPA is derived and the design parameters. In section 3, the design of the CPA components are described in more detail and section 4 describes the P LL components. Simulation results are provided in section 5 to demonstrate the effectiveness of the design and the power savings.
2
Background
Consider the transpose-form architecture of a 6-tap FIR filter that realizes the function y(n) = aox(n) + + a 2 x ( n - 2 ) + a3x(n-3) + a a x ( n - 4 ) + a s x ( n - 5 ) . If we implement this 6-tap filter using 2 multiply-add functional units, which corresponds to using a folding factor of 3 [6], (i.e., 3 multiply-add
alx(n-1)
*This research was supported by the Advanced Research Projects Agency and the Solid State Electronics Directorate, WrightPatterson AFB under contract number AF/F33615-93-C-1309.
536
operations are folded to the same functional unit), we will have a folded architecture shown in Fig. 1. This architecture consists of folded multiply-add units (FMA). The inputs and outputs (x(n) and y(n)) to each FMA will hold the same sample data for three clock cycles before changing to the next sample. To completely
Figure 1: T h e folded a r c h i t e c t u r e of the 6-tap F I R filter (folding factor -- 3). pipeline the folded architecture, additional delays are introduced at the input (x(n)) by using the retiming transformation [7] along with pipelining. This modified structure is now periodic with a period of three clock cycles (or 3-periodic). This technique can be applied to any N-tap FIR filter for any folding factor, p. To achieve programmability and the CPA architecture, we convert the fixed number of registers in Fig. 1 into programmable delays that are constrained by a maximum folding factor Pmaz as shown in Fig. 2. To implement an N-tap filter using this architecture, a total of M (where M -- [N/p~) FMA modules are required. This CPA architecture is a periodic system with period p; therefore it is designed to produce filter outputs from module FMAo in clock cycles (t mod p) -- 0 (where t = time in clock cycles)and hold it for p cycles. Note that mux4 in Fig. 2 is only required for module FMAo to hold the filter output data for p clock cycles and is redundant in the other F M A j modules (j # 0). These other multiplexers can be replaced by a single delay along with sharing of the ( p - 1) registers in the feedback accumulation path. The switching times of all of the programmable multiplexers are summarized in Table 1.
Figure 2" A configurable processor array (CPA) for N-tap F I R filters which is p-periodic. mux# 1 2 3 4
mux definition ai in clock cycle ( ( p - 1)(j + 1) + i) mod p I in clock cycle ((p 1)(j + 1) 1) mod p I in clock cycle ( (p 1 ) ( j + l ) 1) m o d p I in clock cycle ((p 1)(j + 1)) mod p Table 1: Multiplexer definitions
Before implementing this general structure, we had to set values for Nmax and Pmax. We chose to set Nmax (maximum number of taps) to 32 because an FIR filter will provide good performance for filter lengths around 32. We set Pmax (maximum folding factor) to 8 because we wanted Pmax to be a power of 2 and desired greater flexibility with minimal control overhead. With N m a x - 32 and Pmax 8, a total of 4 FMA modules needed to be integrated onto a single chip. -
3
-
Configurable Processor Array
The 8-bit parallel multiplier is a key part of the CPA module because it determines the critical path of the system. We chose to utilize the Baugh-Wooley algorithm for the multiplier because the control overhead is smaller than other algorithms (e.g., Booth recoding) and the full-adders are not wasted on sign extensions. This algorithm generates a matrix of partial product bits and a fast multi-operand adder [8] was employed
537
to accumulate these partial products. To minimize the critical path in the accumulation path, we used the Wallace tree approach [9]. In the CPA design of Fig. 2, we see that the feedback accumulation path requires p - 1 synchronization registers. Because p is a programmable parameter, p - 1 can range from 0 to 7 (Pmaz - 1), we implemented them as a programmable delay line as shown in Fig. 3. Each delay line contains seven 8-bit registers, seven 8-bit multiplexers, and one control unit. The control unit is a simple decoder, that converts p into seven control bits and each control bit directs the data through or around a delay.
r
I [
8 bit reg
~
:: j
8 bit
dout(7 :o)
reg
Figure 3" p - 1 p r o g r a m m a b l e delay line. The multiplexers mux2, mux3 and mux4 shown in Fig. 2 are 2-to-1 p-periodic multiplexers. Their functions are to select input I in one of p clock-cycles. These multiplexers use a 3-bit ([log2(Pmaz)]-bit) binary counter with asynchronous reset and synchronous parallel load. In addition, two 3-bit registers and a comparator are used in the control circuitry of each multiplexer. One register holds p and the second holds a programmed clock cycle value ranging from 0 to p - 1. When the counter output equals the held clock cycle value, the controller allows the data on I to pass to the output. The final multiplexer in Fig. 2, muxl, is a programmable p-to-1 p-periodic multiplexer which consists of one 8-bit 8-to-1 multiplexer and one control unit. At each counter state one of p control lines will be high to activate the p-to-1 multiplexer.
4
P h a s e Locked Loop
Reducing the supply voltage of VLSI chips is commonly used to save power; however, it also slows down the critical path of the circuit. If the supply voltage is reduced too much, the critical path will become too slow to assure correct functionality of the design. Therefore we designed a phase locked loop (PLL) circuit that automatically controls the internal supply voltage to provide the lowest voltage allowable while still achieving the throughput required for the application [10]. The PLL consists of a phase detector, a charge pump with a loop filter, a voltage controlled oscillator (VCO), a programmable divider, and a voltage level shifter. All of these components form a feedback circuit that automatically adjusts the voltage level as required by the programmed parameters and the clock speed. The schematic of the programmable divider used in the P LL is shown in Fig. 4. To achieve a 50% duty cycle, we had to accommodate three possible cases of p. If p is 1, the input clock simply passes through the divider without any change. For even p, the divider toggles its output every p/2 input clock cycles by using a programmable counter. When p is odd (p > 1), the divider must alter the output every ( p - 1)/2 + 1/2 input clock cycles. This means the output may toggle at the rising edge and falling edge of the input clock. To detect the edge where the divider should toggle its output, we utilize two programmable counters; one to detect rising edges, and the other to detect falling edges. These counters generate a series of pulses representing edges and an OR gate combines them into a single pulse. Finally the Toggle component alters the output according to the pulses generated by the OR gate. The two multiplexers in Fig. 4 select the appropriate clock output from the three cases depending on the value of p.
-~ p,(~.-o) E ~ *
-
Figure 4: programmable divider The level in (known with an
function of the voltage level shifter (VLS) is to raise the output voltage of the loop filter to a usable the CPA. By sizing transistors in the VLS, we can adjust the amount of voltage that will be shifted as the voltage shift level). However, the power consumption of the voltage level shifter will increase increase in the the voltage shift level. So there is a tradeoff between power consumption and the voltage
538
shift level. Our experiments have shown that a shift of 0.6V provided enough internal voltage to safely operate the CPA within the design specifications while minimizing the power consumption.
5
Simulation
Using Mentor Graphics tools, simulations determined the critical path of the design to be 7ns at the schematic level which means that it is safe to operate the architecture up to 100 MHz. The CPA was designed to be operated with sample rates in the range of 10MHz to 100MHz, which corresponds to an internal clock rate of 1.125MHz (with p - 8) to 100MHz (with p - 1). This range of frequencies corresponds to an internal power supply range of 4.5V to 2.0V. Efficient power consumption is one of the important features of our design and Table 2 shows the power consumptions in mW for each CPA component at different frequencies and power supplies. From Table 2, we can see that at 100MHz, the power consumption of the CPA without the PLL and Component ]] 5V, 100MHz multiplier 140.6 pmux(p-1) 5.17 adder 18.8 pldelay 60 pmux(2-1 ) 11.6 ldelay 8 FIR(digital) 1101.48
4.5V, 100MHz 112.5 3.85 16.52 43.2 9.5 5.63 863.32
5V, 10MHz
2.0V, 10MHz
~ 14:23
1.98
1.14 2.18 6.03 0.65 0.9 109.24
0.050 0.28 0.77 0.063 0.099 13.87
Table 2: Power consumption for digital parts of FIR filter in mW using a 5V supply voltage will consume l101.48mW. By utilizing the PLL supply voltage for 100MHz (4.5V) the power consumption can be reduced to 863.32mW. At 10MHz, we can save 95.37mW by using the PLL supply voltage automatically generated for 10MHz verses a 5V supply. Of course the PLL will consume some power of its own and results of power consumption simulations for the various components of the PLL are listed in Table 3. From Table 3, we can see that even if we include the power consumption of the P LL, we will still save 210.06mW at 100MHz, and 81.79mW at 10MHz. I charge pump I Component . . . Il phase . detector ! loopfilter I VCO Ilevelshifter
100 10 ,z,z I[
I I:Z [ 0.999 00
dividerl total
11. :: 11 : 1
Table 3: Power consumption for PLL parts in mW
References [1] A. P. Chandrakasan and R. W. Brodersen, "Minimizing power consumption in digital CMOS circuits," Proceedings of the IEEE, vol. 83, pp. 498-523, April 1995. [2] D. Singh, J. M. Rabaey, M. Pedram, F. Catthoor, S. Rajgopal, N. Sehgal, and T. J. Mozdzen, "Power conscious CAD tools and methodologies: A perspective," Proceedings of the IEEE, vol. 83, pp. 570-, April 1995. [3] A. P. Chandrakasan and R. W. Brodersen, "Design of portable systems," in IEEE Custom Integrated Circuits Conference, (San Diego, CA), pp. 259-266, May 1994. [4] S. D. Brown, "An overview of technology, architecture and CAD tools for programmable logic devices," in IEEE Custom Integrated Circuits Conference, (San Diego, CA), pp. 69-76, May 1994. [5] V. Visvanathan and S. Ramanathan, "Synthesis of Energy-Efficient Configurable Processor Arrays," in International workshop on Parallel Processing, 1994. [6] K. Parhi, C.Wang, and A.P.Brown, "Synthesis of control circuits in folded pipelined architecture s," IEEE J. Solid State Circuits, vol. 27, pp. 29-43, Jan 1992. [7] C.E.Leiserson and J. Saxe, "Optimizing synchronous systems," in VLSI and Computer Systerms, pp. 41-67, 1983. [8] I. Koren, Computer Arithmetic Algorithms. Prentice-Hall, 1993. [9] C. S. Wallace, "A suggestion for a fast multiplier," Computer Arithmetic, vol. 1, pp. 114-117, 1990. [10] N. H. E. Weste and K. Eshraghian, Principles of CMOS VLSI Design: A Systems Prospective. Addison-Wesley Publishing Company, 2nd ed., 1993.
Session S: VIDEO CODING III: MULTIMEDIA
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
541
ON SPEECH COMPRESSION STANDARDS IN MULTIMEDIA VIDEOCONFERENCING: IMPLEMENTATION ASPECTS Milan Markovi61, Zoran Bojkovi~2 ~Institute of Applied Mathematics and Electronics, Kneza Milo~a 37 11000 Belgrade, Yugoslavia, e-mail: [email protected] 2Faculty of Transport and Traffic Engineering, Vojvode Stepe 305 11000 Belgrade, Yugoslavia Abstract: In this paper, standard algorithms for coding of narrowband 3.2 kHz and wideband 7 kHz speech for NISDN multimedia videoconferencing (a part of the overall ITU-T H.320 family of standards) as well as for the very low bit rate multimedia communications (a part of the overall ITU-T H.324 family of standards) are considered. The possibilities of real-time implementations of considered algorithms on a hardware module with a sh~gle digital signal processor are considered too. I. Introduction Speech compression has advanced rapidly in recent years spurred on by cost-effective digital technology and diverse comercial applications. The surprising growth of activity in the relatively old subject of speech compression is driven by the insaitable demand for voice commtmication, by the new generation of technology for cost-effective implementation of digital signal processing algorithms, by the need to conserve bandwidth in both wired and wireless telecommunication networks, and the need to conserve disk space in voice storage systems. Most of this effort is focused on the usual telephone bandwidth of roughly 3.2 kHz (200 Hz to 3.4 kHz). Interest in wideband (7 kHz) SlXxx;hfor audio in videoconferencing has also increased in recent years. Within the wired network the requirements on speech compression are rather tight with strong restrictions on quality, delay, and complexity. Since standards are essential for compatibility of terminals in voice communication systems, standardization of ~ h compression algorithms has lately become of central importance to industry and government. In this paper, standard algorithms for compression of narrowband 3.2 kHz and wideband 7 kHz ~ h for 'hetwork" applications such as multimedia videoconferencing for basic (2B+D) or primary (30B+D) N-ISDN access are considered. In these applications, speech compression is used in connection with the 1TU-T H.261 px64 kb/s (p=l .....30) video compression standard [1]. As the most popular ~ h compression algorithm in these applications, ITU-T G.728 for narrowband and G.722 for wideband ~ h (as a part of the overall ITU-T H.320 family of standards) are used [2]. Also, ITU-T G.723 dual rate speech coding standard algorithm [3] (as a part of the overall ITU-T H.324 family of standards) for very low bit rate multimedia communication for wireless and PSTN systems are considered too. The possibilities of real-time implementations of considered standard algorithms on a hardware module with a single digital signal processor are elaborated.
2. Speech compression standards for multimedia videoconferencing After adoption of earlier speech compression standards, ITU-T G.711 PCM 64 kb/s [4] and G.721 ADPCM 32 kb/s [5], it can be concluded that the wired telephone network speech quality is achievable by using ITU-T G.728 LDCELP 16 kb/s compression standard with less than 2 ms coding delay [6]. In other words, after establishing the G.728 standard, there is relatively little remaining interest for these applications and this bit rate [7]. Namely, for many applications, especially when echo cancellation is involved, the time delay introduced by speech coding into the communications link is a critical factor in overall system performance. Typical delays of 60 to 100 ms and occasionally even higher are common in speech coders. Also, algorithms which include error correcting codes and bit interleaving to combat high channel error rates can incur a substantial additional delay. In 1988, the ITU-T established a maximum delay requirement of 5 ms with a desired objective of only 2 ms for a 16 kb/s standard algorithm. This culminated in the adoption of the LD-CELP G.728 algorithm in 1992. The G.728 speech compression algorithm, shown in Fig. 1, achieves a one-way coding delay of less than 2 ms by making both the LPC predictor and the excitation gain backward adaptive, and by using a small excitation vector size of five samples. The pitch predictor is not used due to its sensitivity to channel errors, and resulting performanse loss is compensated for by increasing the LPC predictor order from 10 to 50. The excitation gain is updated by a 10-th order adaptive linear predictor based on the logarithmic gains of previously quantized and scaled excitation vectors. The LPC predictor and the gain predictor are updated by performing LPC analysis on previously coded speech and previous log-gain sequence, respectively, with the autocorrelation coefficients calculated by a novel hybrid windowing method. The excitation codebook is closed-loop optimized and its index is Gray-coded for better robustness to channel errors. An adaptive posffilter is used at the decoder to improve coder performance. The official ITU-T laboratory tests revealed that the speech quality of this 16 kb/s LD-CELP coder is either equivalent to or better than of the ITU-T G.721 ADPCM 32 kb/s standard coder for almost all conditions tested. Recently, the ITU-T has been conducting a standardization study of mediumdelay coders where the delay requirements allow a total codec delay at most 32 ms [8,9].
542 V Q L,a~x io chamr = i 0
][nput
ioi
Synthesized
J v===~ ]
=pe=~
_<-~
/*
i
15~LPC
i
~
I Fedl=~
-
i
I
analysis, j
=
t l ,
]k4ixdxmum
weild~tlng I"
MSE (a)
filter
J
VQ Index
fxol~
J
L'~'a~"L"
(b)
Output postfilte~
i
^d=r,u,,=
,,m,~h
1
Figure 1: Block diagram of ITU-T G.728 LD-CELP 16 kb/s narrowband speech compression algorithm: (a)encoder; (b) decoder. The ITU-T G.722 ADPCM 64, 56, or 48 kb/s standard algorithm for 7 kHz wideband speech coding is based on a two-band subband coder, with ADPCM coding of each subband, as shown in Fig. 2 [ 10,11,12]. A transmit part of G.722 coder converts a digital signal which is coded using 14 bit uniform PCM with 16 kHz sampling into 64 kb/s bit stream by using subband ADPCM technique. Decoder performs the reverse operation to the coder noting that the effective bit rate at the input of the decoder can be 64, 56, or 48 kb/s depending on the mode of operation. Namely, the use of embedded coding technique for ADPCM permits operation of the low-frequency subband at three quantizing rates (i.e. 6, 5, or 4 bits per sample), with corresponding bit rates of 64 (mode 1), 56 (mode 2), or 48 kb/s (mode 3). Modes 2 and 3 enable an auxiliary data channel, capacities of 8 and 16 kb/s, for simultaneous data transmission (such as text, fax, telewriting) over the 64 kb/s (B channel) basic rate channel [13].The filter banks that are used for analysis and synthesis produce a communication delay of about 3 ms. Recently, there has been considerable interest in wideband speech coding for ISDN and videoconferencing applications. Effective coding schemes, often based on CELP, that achieve at 32 kb/s the same quality as the ITU-T G.722 algorithm at 64 kb/s are developed [14,15,16]. Auxiliary datachannel input; O, 8, or 16 k b / s
14 bits, 16 kHz Input
:
-, . .
I I I
Transmit quaclrature mirror filters
Lower subloand ADPCM encoder
I I I I
Audiosignal output --"
~
Receive
i
I
.quadrature
mirror
nRer=
~ - ~
HillJler subband _ADPCM decoder
I ~
, it |
i
]
I
13 variants)
I
I
t
I
48 kb/s
48 kb/s
output
I I II
. . . .
-
I I I I
I 16 kb/s I= --
6gbband
Lower
Datainse rtlon device
Multiplexer
_~ DemultiDlexer
Mode indication
-
D~ extraction device (determines mode)
i I
I
| ' 9
! = ]
Input
Auxiliary datachannel output; O, 8, or 16 k b / s
Figure 2: Block diagram of ITU-T G.722 ADPCM (64, 56, 48 kb/s) 7 kHz wideband speech compression algorithm The ITU-T G.723 standard [3] specifies a coded representation that can be used for compressing the speech or other audio signal component of multimedia services at a very low bit rate (for wireless and PSTN multimedia communications). This coder has two bit rates associated with it, 5.3 and 6.3 kb/s. The higher bit rate has greater
543
quality. The lower bit rate gives good quality and provides system designers with additional flexibility. Both rates are mandatory part of the encoder and decoder and it is possible to switch between the two rates at any 30 ms frame boundary. An option for variable rate operation using discontinuous transmission and noise fill during non-speech intervals is also possible. The coder is based on the principles of linear prediction analysis-by-synthesis coding and attemts to minimize a perceptually weighted error signal. The excitation signal for the high rate coder is Multipulse Maximum Likelihood Quantization (MP-MLQ) and for the low rate coder is Algebraic Codebook Excited Linear Prediction (ACELP). The frame size is 30 ms and there is an additional look ahead of 7.5 ms, resulting in a total algorithmic delay of 37.5 ms. All additional delays in this coder are due to processing delays of the implementation, transmission delays in the communication link and buffering delays of the multiplexing protocol. The block diagram of the encoder is shown in Figure 3.
Simulated
y[nJ v
I Frame r
I
2.2
_
.
.
High
.
.
.
.
.
.
Pass
.
.
[
.
.
.
.
.
~ ( z) .
Response
.
I
I
~ r ~
~
I
I T-I
I" ......
. . . .
Noise Shaping
~J
i
Interpolator
2.6
4
'
1
It
[ Zero Input....... ' t V-~ R e s p o n s e ~ I I 2.13 z[n] l
..too.'~ II~ _It' ,I. ~_
Decoder
!
2.7
i
.
- ~I Calculator - 2"12 I |
Perceptual ~ J WeighUng
.
. .... - I
&nl-al~Csis ~
,=orm.r,,
.
.J
I
~
2.5
P(z)~
,
Filter2.31[
2.4
C2uantizer
Decoder
.
,',...'-
pr--dietnr . . . . .
/ 9 [
.r.or,
Update
~
-P.1
--
2.19. __ /
Excitation Oe2~
Pitch Oe2C.l:er
i
,,,,,'-,,,,,_0,
i
T
- .
~.~;
.
Figure 3" Block diagram of the ITU-T G.723 speech coder
3. Implementation aspects A key requirements for most speech coders is that their computational requirements fall well within the range of modem digital signal processing (DSP) chips so that the coders can be implemented both inexpensively and efficiently. Lists of computational requirements, measured in terms of milions of instructions per second (MIPS), for telephone bandwidth coders, wideband speech coders, and audio coders, are presented in [17,18]. As for the multimedia N-ISDN videoconferencing applications, complexities of ITU-T G.728 and G.722 standard algorithms are about 19 MIPS and 9 MIPS, respectively. Based on the computational capacities of modem signal processors, as presented in Table 1, it can be concluded that the realization of the mentioned standard algorithms on a single digital signal processor is possible. Table 1 Basic characteristics of modem DSPs DSP TMS320C25 TMS320C30 TMS320C31 TMS320C40 DSP32C AD SP2100 AD SP2101 ADSP21020 DSP56001 DSP56156 DSP96002
Frequent7 (Hz) 50 33/40 27/33/40 40/50 50 40 12.5 20/25 20/27/33 40/60 33/40
Word length (bit) 16 32 32 40 32 16 16 40 24 16 32
Point fixed floating floating floating floatinl]: fixed fixed floatin 8 fixed fixed floatin[
MIPS (max) 12.5 20 20 25 12.5 10 i2.5 20 16.5 30 20
There are examples of real-time implementations of considered standard compressmn algorithms. A fulduplex real-time implementation of G.722 algorithm on a single Motorola DSP56156 is shown in [19]. On the other
544 side, as shown in [6], the complete G.728 algorithm is implemented on two AT&T DSP32C signal processors (one for coder and another for decoder) while in [20], a ful-duplex implementation of G.728 algorithm, except the adaptive postfilter in decoder, is implemented on a single TI TMS320C31 signal processor. Besides, there are examples of softver implementations of full-duplex G.728 standard algorithm on a single signal processor ADSP21020 [21]. As for the very low bit rate multimedia applications, the ITU-T G.723 coder was optimized to represent speech with a high quality at the above rates using a limited amount of complexity. Fixed and floating point C code realization of this coder are specified in [31 (the floating point is specified in ANNEX B to Rec. G.723) and available from ITU-T. Having in mind that the G.723 coder belongs to the class of the analysis-by-synthesis coding algorithm, its computational requirements are approximately similar or less than the requirements of G.728 standard algorithm. 4. Conclusion This paper is dedicated to the consideration of the speech compression standard algorithms for use in multimedia videoconferencing for N-ISDN as well as for PSTN and wireless multimedia communication systems (as parts of H.320 and H.324 families of standards). The basic characteristics of the ITU-T G.728, G.722, and G.723 speech coding standards and possibilities of real-time implementations of these algorithms on a hardware module with a single modern digital signal processor are elaborated in the paper. Based on the computational capacities of modern signal processors and some examples of real-time implementations, it can be concluded that the full-duplex realization of the mentioned standard algorithms on a single digital signal processor is possible. Acknowledgements: This paper was supported by Ministry of Sciences and Technology of Republic Serbia, projects No. 04M02 and Telecommunications No. 448, through Institute of Mathematics SANU Belgrade amd Faculty of Electrical Engineering Belgrade, respectively. References [11 M.Liou, 'Overview of the px64 kbit/s Video Coding Standard," Communicationsof the ACM, Vol. 34, No. 4, Apr. 1991. [21 J.Dampz, R.Klotsche, and M.Weiss, 'Multimedia Terminals: Advantages, Technology, Networking," Electronic Communications, Alcatel, 4th Quart. 1993, pp. 387-393. [31 ITU-T, 'Draft Recommendation G.723 - Dual Rate Speech Coder for Multimedia Communications Transmitting at 5.3 & 6.3 kbit/s, October 17, 1995. [41 CcITr, 'Recommendation G.711 - Pulse Code Modulation (PCM) of Voice Frequencies," CC17T Red Book, vol. III, fasc. 111.3,VllIth Plenary Assembly, Malaga,-Torremolinos, Spain, Oct. 8-19, 1984. [51 CCITT, "Annex 1 to Report of Working Party XVIII/2 - Report of the Work of the Ad Hoc Group on 32-kb/s ADPCM," COMM, XVIII-R 28-E, 1984. [61 J-H.Chen, 1LV. Cox, Y.-C.Lin, N.Jayant, M.J.Melchner, '~k low delay CELP coder for the CCITr 16 kb/s SlXxx:hcoding standard," 1EEEJ. Sel. Areas Commun., vol. 10, pp. 830-849, June 1992. [71 A.Gersho, "Advances in Speech and Audio Compression," Proceed. of the 1EEE, No. 6, June 1994, pp.900-918. ISl S.Hayashi, M.Taka, 'Standardization activities on 8 kbit/s speech coding in CCITI' SGXV," in Proc. IEEE Int. Conf. on Wireless Communications, Vancouver, Canada, June 1992, pp. 188-191. [91 R.Salami, C.Laflamme, J-P.Adoul, A.Kataoka, S.Hayashi, C.Lamblin, D.Massaloux, S.Proust, P.Kroon, Y.Shoham, 'Description of the ITU-T 8 kb/s Speech Coding Algorithm," in Proc. 1EEE Workshop on Speech Coding for Telecomunications, Annapolis, USA, 1995. [101 CCITt, 'Recommendation G.722; 7 kHz audio coding within 64 kbit/s," SG XVIII Rep. R26 (C), Aug. 1986. [1 l lX.Maitre, '7 kHz Audio Coding Within 64 kbit/s," 1EEE Journal on Selected Areas in Commun., Vol. 6, No. 2, February 1988, pp. 283-298. [121P.Mermelstein, 'G.722, A new CCITt coding standard for digital transmission of wideband audio signals," IEEE Communications Magazine, vol. 26, no. 1, pp. 8-15, Jan. 1988. [131M.Taka, S.Shimada, T.Aoyama, 'Multimedia Multipoint Teleconference System Using the 7 kHz Audio Coding Standard at 64 kbit/s," IEEEJournal on Selected Areas in Commun., Vol. 6, No. 2, February 1988, pp. 299-306. [141A.Fuldseth, E.Harborg, F.T.Johansen, J.E.Knudsen, 'Wideband SlXxx:hcoding at 16 kbit/s for a videophone application," Speech Commun., Vol. 1l, No. 2-3, June 1992, pp. 139-148. [151C.Laflamme, J.-P.Adoul, R.Salami, S.Morissette, P.Mabileau, "16 kbps wideband speech coding technique based on algebraic CELP," in Proc. IEEE ICASSP '91, Toronto, Ontario, Canada, 199 l, pp. 13-16. [161E.Ordentlich, Y.Shoham, 'Low-delay code-excited linear predicitive coding of wideband ~ h at 32 kbps," in Proc. IEEE1CASSP'91, Toronto, Canada, 1991, pp. 9-12. [17]L.R.Rabiner, "Applications of Voice Processing to Telecommunications," tutorial at ISRC of University of Brisbane, October 1994. [181A.Spanias, 'Speech Coding: A Tutorial Review." Proceedings of the IEEE, Vol. 82, No. 10, October 1994, pp. 1539-1580. [191P.Atherton, 'G.722 Audio Processing on the DSP56100 Microprocessor Family," Application Note, APR404/D, Motorola Ltd., 1992. [20lJ.Hongbing, W.Yue, T.Kun, F.Chongxi, 'Hardware implementation of CCITT G.728," in Proc. of 1CCT'94, Shangai, China. [21 ]IEEE Signal Processing Magazine, October 1994.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
545
Multimedia Communication Graphical User Interface Design Principles for the Teleeducation J. Turan - K.Fazekas* - L. Ovsenik- M.Kovesi Department of Radioelectronics Technical University of Kosice Park Komenskeho 13 04021 Kosice Slovakia TeL/Fax: +42 95 6335692 E-mail: J. TURAN@CCSUN. TUKE.SK
*Department of Microwave Telecommunications Technical University of Budapest Goldmann Ter 3 1111 Budapest Hungary TeL/Fax: +36 12 043289 E-mail." [email protected]
Abstract In the paper a systematic approach to multimedia communication graphical user interface design is evaluated. The proposed application is in the field of radioengineering teleeducation. The design starts with a standard task analysis using task knowledge structures as the basis of the task model. Then description of the media resources available to the system and their access is evaluated. Finally, the task information is elaborated by attaching dialogue acts to specify the desired communicative effects for each task step. This method could be a frame for a software mean developing.
1. Introduction Multimedia graphical user interfaces are currently created by intuition. They are usually designed and developed without exact analysis of multimedia information presentation. The most nowadays multimedia applications present possibilities of the technical means but they have not respected a user-centred approach. The multimedia interfaces developed by this way cannot achieve the maximum effect [ 1,2,3]. In the paper a systematic approach to multimedia communication graphical user interface design is evaluated. The proposed application is in the field of radioengineering teleeducation. The design starts with a standard task analysis using task knowledge structures as the basis of the task model. Then description of the media resources available to the system and their access is evaluated. Finally, the task information is elaborated by attaching dialogue acts to specify the desired communicative effects for each task step. This method could be a frame for a software mean developing.
2. Definition o f multimedia and basic ideas o f multimedia interface design The exact definitions of human- computer- interface terminology and user- centred -design are very important for tmderstanding of the multimedia design problems. We must view multimedia interfaces from an appropriate perspective. Multimedia is one of the most innovative ways of using a telecommunications network to achieve effective communications between people and for access to information. In [3,4,5,6] was pointed out that multimedia approach can be viewed either from a technological perspective or from a user - centred perspective. The technological perspective is defined through lists of technical characteristics of system claiming to be multimedia systems such as multidimensional presentation techniques, multimodal interaction or hypermedia techniques. User - centred perspective focuses on the possibilities offered by the technology. A user centred definition characterise multimedia systems as systems enabling the usage of multiple sensory modalities and multiple channels of the same or different modality enabling the user to perform several tasks at the same tasks at different times [3,7]. The understanding of these two definitions is important for view the key question of multimedia interface design when to use which media and in what combination to achieve the maximum effect.
3. Multimedia characteristic f r o m technological and user perspective Multimedia communications can be defined from technological perspective as combination of four key ingredients: a./ two or more of the five media of communication ( audio, data, fax, image and video) b./interactive capabilities between the communicating parties c./communications with human users d./synchronisation The types of interaction in multimedia are: a./search and browse b./interactive buttons c./windowing d./fast forward, rewind, pause and skip e./conversational The operating systems as Windows or OS/2 offer these types of interactions and enable to use they for each media.
546
There are three types of interactive communications corresponding to the CCITT group of interactive communications: a./conversational b./retrieval c./messaging To be successful different media must be combined and coordinated in a natural way, otherwise there will be a risk of information overload. In human computer interaction we use three sense channels - visual, auditory and tactile often in combination, and there is only limited amount of research available to guide us in media combination. For example it is known that there is some interaction between voice output and concurrent visual tracking and that voice response places a greater load on working memory than manual responses. Multimedia is relevant to a wide spectrum of applications, ranging from computer-integrated telephony and texts with voice annotation to cooperative teleworking on documents (including video). In the discussion here, multimedia applications will be confined to broadband multimedia communication with the inclusion of at least one of the following information types: 9 high-speed data 9 still images and documents with high resolution (or browsing capabilities) 9 moving pictures (video, animated graphics).
4. Requirements of the communications and network level There are several alternatives for classifying broadband multimedia communications according to information types, communication types (dialogue, messaging, retrieval, distribution), organisation (tree-structured media, hypermedia), functions (e.g., interactive, link navigation, cooperative teleworking), and other criteria. One suitable alternative classifies the utilisation of information types with respect to user or terminal. This is particularly advantageous for the definition of telecom services to effectively support multimedia communication. The target of future multimedia communication is." to efficiently support manifold applications; to provide worldvide communication capabilities of user as large as possible; unproblematic and cost-effective operation and utilisation on the basis of standardised components with high production volumes; and simple interworking [ 1, 4, 5]. There is, at present, no all-encompassing model for distributed applications and the handling of multimedia information within a heterogeneous broadband network environment. Multimedia application and communication models are under discussion at a number of organisations such as BERKOM (Berlin Communication System), RACE (Research and Development in Advanced Communications Technologies in Europe), and at international standardisation bodies for communication and information technology (CCITT, ISO, etc.). This approach aims at fully supporting the increasing variety of applications by conceiving suitable, reusable technical building blocks.
Fig.1. Co-operative modem teleeducation 9 9 9 9 9
From the network's point of view, the most important requirements of multimedia communications are: high speed and changing bit rates several virtual connections over the same access (including flexible/dynamic channel allocation) synchronisation of different information types suitable standardised services and supplementary services supporting multimedia applications (comprising connection-oriented and connectionaless bearer services and teleservices) various services qualities
547
9 adaptability to evolving multimedia needs and progress in information processing, storage, and presentation. A modem teleeducations must be thought of in terms of networked organisation (Fig.l). The objective of cooperative teleworking among students and teachers (with simultaneously possible using of databases on a learner) is the provision of some degree of "telepresence" for geographically distributed persons and teaching materials in a quality comparable to that of a real-world lecture (conference). Cooperative teleworking enables a group of distant participants to jointly view, discuss, and edit multimedia documents while at the same time using communication and computing resource. This can be considered an extension of conventional audio/video conferencing access, and collaborative work assistance. A desktop multimedia workstation allows the student to create, retrieve, and manipulate and activate a "hotline" to a teacher (central specialist). Cooperative teleworking represents a case of complex and dynamic communication which encompasses a number of participants, connections, information types, systems, and functions [8,9].
Domain Users " ~ Description ,r
Task Model
T~sk ~
Available Media
Model Course
Oomain\ ; Description'~~ Analy~s) ~ ~ f~.~~ Task-Information / Reqmr ~ ~ A nents n o t a t e d Course
\ /"'~ aformat)on~ Anal?'sls~ ~J \ ~
t
TaskInfo Model
// Media"~ / Selection)
Task-Information Sequence
/
_
ation- MediaList
ros _.at. ~o
Script
FinalPresentation Script
Model
Information flow ,
( ~
LIImplementation [
IS~ IDestinati~I
Fig.2. Systematic method for GUI design The agenda of issues which a method must address were, first, the creation of a task model incorporating specification of information requirements and presentational effects, accompanied by a resources model describing the information media available to the designer. The method should advise on selecting appropriate media for the information needs and scripting a coherent presentation for a task context. The design must with directing the user's attention to extracting required information from a given presentation and focus on the correct level of details. In addition, the design method should guide the designer to the cognitive issues underlying an multimedia presentation such as selective attention, persistence of information, concurrency and limited cognitive resources such as working memory. Fig.2 gives an overview of a systematic method for teleeducation graphical user interface design based on the methods of task components. This is based on the following components: model, information flow, process, source, destination [ 10]. An example of GUI specification is in Tab. 1.
548
Tab. 1. Example of GUI Specification ( Teleeducation, Computer aided Learning ). Student interface
Teacher interface
1. Program out 9 login 9 password 9 general program information 2. Courses selection 9 courses menu 9 ordering alphabetical 3. Courses structure 9 theory 9 examples 9 tests 9 consultation 4. Course flow control (examples and tests control) 9 panel control window 9 manual and interactive (scroll, animation and video control, etc.) 9 automatic (programmed control of multimedia documents) 5. Video and audio conferences 9 interactive consultation with teacher 9 time negotiation 9 conference mode selection (audio, video, whiteboarding) 9 E-mail messages 6. Other features 9 multimedia documents utilisation statistics (data and time used, control questions and test-results, etc.) 9 courses total time limitation 9 courses price calculation 9 courses level selection (beginners, experts, etc.)
1. Program out * login 9 password 9 general program information 2. Courses selection 9 courses menu 9 ordering alphabetical 3. Courses structure 9 theory 9 examples 9 tests 9 consultations 4. Multimedia documents creation 9 courses structure specification 9 timeschedule specification 5. Video and audio conferences 9 interactive consultation 9 time reservation confirmation 9 multimedia documents presentation from courses database with video or audio commentary 9 conference mode selection (audio, video, whiteboarding) 9 E-mail messages 6. Courses statistics and economics
5. Conclusions This method proved useful as a means of exploring the issues involved in multimedia teleeducation graphical user interface design (Tab.l). The diagram (Fig.2) provide tool for thought about presentation issues concerning what information is required and when. The proposed method was implemented in the "big" Pentium based PC multimedia platform for interactive multimedia course about rapid transform and its application in pattern recognition and DSP [ 10].
References [1]Altly,J.L.-Bergan,M.:The design of multimedia Interfaces for process control. Proc. 5th IFIP, IFAC, IFORS, IEA Conf. on Man-Machine Systems, 1992, 249-255 [2]Altly,J.L.-McCartney,C.D.:Design of a Multi-Media Presentation System for a Process Control Environment. In: Pro. Eurographics Workshop, Stokholm, 1991 [3]Wright,D.J.:Broadband- Business Services, Technologies, and Strategic Impact. Boston/London: Artech House,1993 [4]Barbosa,L.O.-Georganas,N.D.:Multimedia Services and Applications. Europ. Trans. Telecommun., Vol.2, No.l, 1991, 5-19 [5]Armbruster,H.-Wimmer,K.:Broadband Multimedia Applications Using ATM Networks: High-Performance Computing, High-Capacity Storage, and High-Speed Communication. IEEE JSAC, Vol.10, No.9, 1992, 1382-1396 [6]Rosenberg,J. et al.:Multimedia Communications for Users. IEEE Commun. Mag., May 1992, 20-36 [7]Maybury,M.:Planning Multrimedia Explanations Using Communicative Acts. In Proceedings of 9th National Conf. on Artifical Intelligence, 1991, 61-66 [8]Faraday,P.M.-Stucliffe,A.G.:A Method for Multimedia Interface Design. In J.L.Alty, D.Diaper (eds), People & Computer VIII, Cambridge Univ. Press, 1993 [9]Sutcliffe,A.-Faraday, P.:Systematic Design for Task Related Multimedia Interfaces. Information and Software Technology, Vol.36, No.4, 1994, 225-234 [10]Turan,J.-Kovesi,L.- Kovesi,M.:CAD System for Patern Recognition and DSP with Use of Fast Transformation Invariant Transforms. Journal on Communications, Vol. XLV, 1994, 85-89
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
549
IMAGE AND VIDEO COMPRESSION FOR MULTIMEDIA APPLICATIONS
D.G. SAMPSON
E.A.B. da SILVA
M. GHANBARI
Demokritus University of Thrace, G R E E C E
Federal University of Rio de Janeiro, BRAZIL
University of Essex, ENGLAND
Abstract
In this paper we discuss a coding method that is suitable for multimedia applications. This method is based on the efficient coding of image wavelet coefficients using zerotree multi-stage lattice vector quantization. We refer to this method as Successive Approximation Wavelet Vector Quantization (SA-W-VQ). The basic idea in SAW-VQ is that the original blocks of wavelet coefficients are successively refined based on vectors of progressively decreasing magnitude and a finite set of prototype orientations. Block zero-tree prediction and adaptive arithmetic coding are incorporated to improve the efficiency of the codec. It is shown that this coding scheme achieves high compression ratios with good picture quality, maintaining a very simple implementation. Simulation results are provided to evaluate the coding performance of the described coding scheme for still image and low bit rate video coding. Comparison with both the standard JPEG coder and the RM8 implementation of the standard H.261 video codec, shows that the presented codec provides improvements in both the peak signal-to-noise ratio and the picture quality. 1. Introduction
Multimedia communications is the field referring to the representation, storage, retrieval, dissemination of, and collaborative work with, electronic documents composed of multiple <<media>> such as text, voice, graphics, images, audio and video [1]. Multimedia applications involve vast amounts of visual data in the form of libraries of video sequences and catalogues of high quality still images. This points to the need for efficient compression techniques which reduce the amount of data stored or transmitted. However, apart from the requirement for efficient compression, there are other important features which an image/video compression algorithm need to incorporate in order to meet the design considerations of multimedia applications [2]. Such are: - browsing capabilities - progressive transmission - efficient decoding at various levels of quality The wavelet transform has recently emerged as a promising tool for representing images and video [3]. The multiresolution/multifrequency feature of wavelet transform is important in representing images/video at progressively increasing spatial and/or temporal resolutions. Also, wavelet based image decomposition offers the advantage of excellent energy compaction characteristics (ideal for compression) [4]. We have developed a novel compression scheme based on wavelet transform decomposition and embedded lattice vector quantization, referred to as successive approximation wavelet lattice vector quantization (SA-W-LVQ) [5-7]. The basic idea in SA-W-LVQ is that the original blocks of wavelet coefficients are successively refined based on vectors of progressively decreasing magnitude and a finite set of prototype orientations. Block zero-tree prediction and adaptive arithmetic coding are incorporated to improve the efficiency of the codec. The main characteristics of this scheme are : - a single compressed bitstream which can provide video/images at multiple resolutions each available at a range of bitrates - information is packed in the bitstream in a way which allows the most important image data to be coded always with priority.
550 2. S u c c e s s i v e Approximation Wavelet Lattice Vector Quantization The basic principles of the coding algorithm are summarised. A more detailed description can be found in [5]. According to this algorithm, the mean value of the input image is extracted and an Mstage wavelet transform is employed. Each sub-image Bi ,where B can be horizontally (H), vertically (V) or diagonally (D) oriented and i-1,2 .... ,M, is then partitioned to mxn blocks of wavelet coefficients. In order to form the input vectors a different scanning is used according to the orientation of the band. After scanning, the algorithm proceeds as follows. The initial value of the magnitude threshold is set to T1=a"
IlXllmx
[[Xllmax,where a
is selected according to the 0max of the orientation lattice codebook and
denotes the maximum magnitude of the wavelet coefficient vectors [5]. All vectors in every band are scanned, and the ones with magnitude less than T 1 are marked as zero (zero blocks). The rest (non-zero or coded blocks) are represented by their closest orientation codevector scaled with T} that is, Q1 ( x i ) = T 1 "~'1. The location of the zero vectors is transmitted using 3 symbols: zero block (Z), zero-tree root (ZT) and coded block (C). Block zerotree roots exploit the similarities among the bands of the same orientation by producing a single symbol to indicate that a block of wavelet coefficients and all its corresponding ones in the higher bands of the same orientation are zero. The arithmetic coder with an adaptive model described in [8] is used to code the string generated by the three symbols (ZT, Z and C). The orientation codevectors for each coded (C) block are also encoded with an arithmetic coder. The magnitude threshold is then updated by multiplying it by a. The non-zero blocks are further refined, by coding the residual error between the original and the reconstructed C blocks with their closest orientation codevector and the new threshold. The indices of the new orientation codevectors are encoded into the bitstream via the arithmetic coder. In the next pass, all zero blocks are scanned again and their magnitudes are compared against the new threshold. A new string of the three symbols is encoded in the bitstream to provide information about the location and the status of the blocks at this stage. As in the previous pass, the indices of the C vectors are coded and the entire process is repeated until a certain bit rate is achieved. The wavelet coefficient vectors are scanned according to their reconstructed values, the higher energies first, as in [4]. This guarantees that the most important information is always coded first, which is very desirable in video coding. Indeed, an important advantage of the proposed compression scheme for video coding applications is that a constant bit rate can be achieved by allocating a fixed number of bits for each frame. This eliminates the need of a buffer for smoothing out the bit rate variation. Coding the most important coefficients (in terms of energy) always first, guarantees that the bit rate budget will be efficiently used for coding those image data that would result into maximum distortion. Successive refinement of the wavelet coefficients offers control over the maximum level of quantization error made in each coefficient. This can be important for reducing unpleasant artefacts in picture quality, since the error introduced by a poorly quantized single wavelet coefficient can be spread over an area of the reconstructed image [5]. Furthermore, it provides the means to guarantee that an arbitrary level of average distortion for each band is met. This could be convenient for performing bit allocation among the wavelet coefficients taking into consideration the human visual system (HVS) sensitivity to each frequency band [7]. The advantages of this method over the conventional techniques and the existing standards are: - higher compression rates can be achieved - better picture quality is obtained at very low bitrates - browsing capabilities and progressive transmission are easily achieved
551 - fast and efficient decoding at various levels of quality.
3. Simulation results The SA-W-LVQ coding algorithm has been applied to - compression of grey scale and colour still images [5] - perceptually transparent coding of super high definition images [7] - low bitrate coding of image sequences [6] Table 1 summarises the Peak Signal-to-Noise Ratio (PSNR) results obtained by these methods for coding the test image LENA at resolution 256x256x8 and 512x512x8, and a set of five ISO/CCITT test images at resolution 720x576x8, at bit rate 0.4 bit/pixel. In this table, the orientation codebooks are built from the first spherical shell of the regular lattices D4, E8 and L16. In general, although there are no substantial differences in the performance of the three codebooks, higher dimensional codebooks result into better PSNR values and the best performance is always achieved by using the L16-based codebook. These simulation results also demonstrate that the proposed coding scheme achieves considerably better R-D performance as compared with the JPEG coder. The improvement in PSNR over the JPEG-coded images is constantly around 2.50 dB for L 16 and 1.50 dB for the D4 codebook. Finally, comparisons with the EZW image codec [4], which is a very efficient scheme employing zerotree scalar quantization of wavelet coefficients are in favour of the zerotree lattice vector quantiser.
Test Image
D4
E8
L16
EZW
JPEG
BARBARA
29.36
30.60
30.90
29.03
27.27
BOATS
34.19
34.78
35.24
34.29
32.63
GIRL
35.27
35.91
36.12
35.14
33.98
GOLD
31.01
32.76
32.61
32.48
31.38
ZELDA
38.43
39.36
39.44
39.08
37.16
LENA 256
30.13
30.15
30.29
30.06
28.07
LENA 512
35.17
35.86
36.09
35.02
33.42
Table 1 9PSNR performance of SA-W-VQ for several test images at 0.4 bpp compared with EZW and JPEG Test Sequence Miss America Miss America
Average PSNR (dB) 36.64 39.59
Coding Scheme H.261-RM8 OBM-SAWLVQ
Claire
31.90
H.261-RM8
Claire
34.05
OBM-SAWLVQ
Salesman
40.33
H.261-RM8
Salesman
41.98
OBM-SAWLVQ
Table 2 9Average luminance PSNR performance comparison for the test image sequences CIF 10Hz, at 64 kbit/sec. The performance of the SA-W-LVQ coding algorithm for low bitrate video coding applications at the range of px64 kbit/sec is evaluated and compared to the RM8 implementation of the standard H.261 video codec using a set of test CIF image sequences. In this experiments overlapped block matching (OBM) motion compensation is employed [6]. Table 2 summarises the average PSNR obtained by the
552 OBM-SAWLVQ over the RM8 simulations. This improvement is also reflected into the picture quality of the reconstructed frames which are free of the annoying blocking effects of the H.261 coded images. The efficiency of the OBM motion compensation for the SA-W-LVQ video codec is demonstrated in Figure 1.
Figure 1. Performance comparison between Overlapped and Conventional Block Matching (a) MISS AMERICA, CIF 10 Hz, 64 kbit/sec (b) CLAIRE, CIF 10 Hz, 64 kbit/sec 4. Conclusions We have discussed an efficient method for image and video data compression suitable for multimedia applications. In this technique, referred as Successive Approximation Wavelet Vector Quantization (SA-W-VQ), the most important vectors of wavelet coefficients are successively coded by a series of vectors of decreasing magnitudes. Moreover, the structural similarities among the bands of same orientation are exploited by incorporating a block zero-tree structure. In image sequence compression, the overlapped block matching motion compensation (OBM-MC) significantly increased the efficiency of the wavelet transform coder, by eliminating the blocking artefacts in the prediction error image introduced from the conventional block matching. The OBM-SAWVQ video codec offers constant bit rate, with no need for a buffer; yet, remarkably, the PSNR fluctuations from frame to frame are reasonably small for the image sequences that have been tested. This is due to the fact that the SAWVQ always codes the most important image data first. Moreover, there is small quantization error accumulation as the image sequence advances to higher order frames. Simulation results illustrate that OBM-SAWVQ achieves promising performance at 64 kbit/sec. 5. References [ 1] B. Furht, S. Smoliar and H.-Z. Zang Video and Image Processing in Multimedia Systems, Kluwer Academic Publishers, 1995. [2] J.D. Gibson, T. Berger, R. Baker and T. Lookabaugh, Multimedia Compression." Applications and Standards, Morgan Kaufmann Publishers, 1996. [3] M. Vetterli and J. Kovacevic, Wavelets and Subband Coding, Prentice Hall, 1995. [4] J. Shapiro, <<Embedded image coding using zerotrees to wavelet coefficients,~ IEEE Transactions on Signal Processing, vol. 41, pp. 3445-3462, December 1993. [5] E.A.B. da Silva, D.G. Sampson, and M. Ghanbari, < IEEE Transactions on Image Processing, Special Issue on Vector Quantization, vol. 5, no. 2, pp. 299-310, February 1996. [6] D.G. Sampson, E.A.B. da Silva and M. Ghanbari, <
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
553
A Multilayer Image Coding and Browsing System Guoping Qiu University of Derby, School of Computing & Mathematics Derby DE22 1GB, United Kingdom email: [email protected]
Abstract: In this paper, a multilayer image coding and browsing system is described. The implementation is based on the well-known Laplacian pyramid image data structure and the JPEG image compression standard. Different coding strategies are used at different levels (layers) to achieve efficient bandwidth reduction. A classified vector quantization scheme suitable for coding Laplacian residual image data is also described. The emphasis of this paper is on the implementation strategy rather than on the implementation details.
I Introduction We describe a layer-structured image coding and browsing system which forms part of a universal image coding and browsing system that tries to meet the variance demands for visual image storage, communication, and management. Some of the goals that the system tries to achieve include: Flexibility, the system will be capable of achieving different compression ratios according to the requirements of the users, ranging from lossy to loss-less compression. Efficiency, the system will use the most appropriate technologies to implement the coding operations according to the nature (statistics) of the images in order to achieve the best compression performance. Scalability, the system will be able to compress a given image to different dimensional sizes according to the requirements of the applications. Progressive Transmission, the system will be able to perform progressive transmission, allowing image quality levels from coarse to fine to be built up progressively. Fast and Flexible Browsing, the system will enable fast browsing of compressed images at different dimensional scales and quality levels.
In this paper, we will report the design and implementation of some aspects of the system. In particular, a three-layer subsystem as shown in Fig. 1 will be described.
554 Channel~,~
__2 ENCOD/E~ " I[ DECOD/E~
Ill. REDUCE
OriginalInputImage
•1
~ Third-layer Outputlm~e
~+~ ~
Second-layer OutputImage
First-layer Outputlmase
Fig. 1 A three-layer image coding/browsing system
The system uses a pyramidal data structure. The first layer is designed to compress/browse the image at its original dimensional size. The quality level of this layer can be varied, depending on the coding strategy adopted in Q and Q-l. Also, depending on the users' need. In the second layer, the dimensional size of the image is reduced by a factor of 2 in both of its co-ordinates. Again, the compression/browsing quality levels of this layer are dependent on the coding and decoding strategies Q and Q-i implemented in this layer. In the third layer, the dimensional size of the image is further reduced by a factor of 2 in both of its co-ordinates. At this layer, the baseline JPEG [1 ] is used for compression. This system is capable of compressing/browsing images at three different dimensional sizes, each can have varied quality levels, depending on the bit rate and quality level requirements. In the following, we shall briefly describe each block of the system.
II Implementation Strategies A well known and efficient data structure suitable for image data compression is the Laplacian pyramid proposed by Burt and Adelson [2]. There are two fundamental functions in this structure, REDUCE and EXPAND. These two functions are used to generate Gaussian and Laplacian pyramids and is described in detail by Burt and Adelson [2]. JPEG standard provides a hierarchical framework for encoding and decoding images at different spatial resolutions. In hierarchical mode, JPEG standard proposes a pyramidal structure to implement this option. However, the standard has not specified how the pyramid is constructed, Burt and Adelson's Laplacian pyramid structure is a good candidate for this application. The Laplacian images contains mainly sparse, high frequency data. The pixel values in the flat areas will be very small, only in the edge areas, the pixel values will be relative high. Baseline JPEG will not be an efficient technique to compress these images. Alternative coding strategies, such as the one described below will be a more suitable method. To implement the coding strategies for Q and Q-I in Figure 1, different encoding and decoding algorithms can be used to compress the Laplacian images. For loss-less
555 compression, DPCM based techniques can be used to implement encoding and decoding operations. For lossy compression, a classified vector quantization technique is implemented. In this scheme, the differential image is divided into N = m x n blocks, each forms an N dimensional vector. Let X(k)= (Xl(k),x2(k),...,xN(k)) be the vector from the kth block. This vector is classified into one of several classes according to the following procedures. (1) Calculating the followings for all training images l
N
X e2 (k)9=-N Z,=1
2 ,(k)
o .2 (k)
1 N
1 N
-~ .= ( x , ( k ) - re(k)) 2 where re(k)= -~ ,=1
(2) A K-means type algorithm is used to cluster e2(k) into three classes corresponding it being small (S), Medium (M) and large (L). The same process is performed on cr2(k). (3) Classifying an image block into one of nine classes as shown in Fig. 2 e2N N
S
M
L
S
SS
SM
SL
M
MS
MM
ML
L
LS
LM
LL
Fig 2, Image block classes For each class, a corresponding codebook is designed and different number of bits are used to code different class according to the nature of the class. For example, class SS will required the least number of bits and class LL will require the most number of bits.
III S u m m a r y
At this point, complete implementation of the system has yet to be performed. To show the feasibility of the idea, Figure 3 shows the Lena image compressed by the baseline JPEG (implemented using Independent JPEG Group's JPEG software). Figure 4 shows the Lena image obtained by first "REDUCE" the original image, then compressed the reduced image by baseline JPEG, and then decode the compressed JPEG image and "EXPAND" it. It is seen clearly that the visual quality of the later is better.
IV. R e f e r e n c e s
1. W.B. Pennebaker and J. L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993 2. P. J. Burt and E. H. Adelson, "The Laplacian pyramid as a compact code", IEEE Trans. Comm., Vol. COM-31, pp. 337-345, 1983
556
Figure 3, Baseline JPEG compression, file size = 4379 bytes
Figure 4, Layered compression, file size = 4091 bytes
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios a n d P . Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
557
Switched Segmented Image Coding- JPEG Schemes for Progressive Image Transmission C. A. Christopoulos 1, A. N. Skodras 2, W. Philips 3, J. Cornelis 4 ~Ericsson Telecom AB, HF/ETX/MN, S-126 25 Stockholm, Sweden 2Universit)' of Patras, Electronics Lab., Patras 26110, Greece 3ELIS, University of Gent, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium 4Vrije Universiteit Brussel, VUB - ETRO (IRIS), Pleinlaan 2, 1050 Brussels, Belgium E-mail: [email protected]
Abstract-- This paper describes two schemes that combine JPEG and Segmented Image Coding (SIC) for progressive image transmission. Compared to JPEG-based progressive transmission schemes, the new schemes produce reconstructed images of better quality during all the stages of transmission and never transmit more bits. Also, the computat~ional complexity of the new schemes is lower than that of the SIC-only approach.
1. Introduction Progressive image transmission (PIT) involves the gradual improvement in the quality of an image as more information is transmitted. In recent years, PIT has been proposed as a means of providing the user as soon as possible with an interpretable image, in the specific situation where he/she is interactively interrogating an image database over a low capacity transmission channel, such as the Public Switched Telephone Network (PSTN). This allows the user to decide whether to wait for a more detailed reconstruction, or to abort the transmission. Access to large image databases, such as those emerging in the medical world, will benefit from progressive coding. PIT can be achieved using JPEG [1]. However, even though the quality of the images in the last stages of the transmission (i.e., for low compression ratios) is very good, this is not the case in the first stages (i.e., for high compression ratios) because of the blocking artifacts which appear in these stages. Recently published rcscarch [2-4] indicates that Segmented Image Coding (SIC) is much better suited for PIT than JPEG in the initial stages of transmission, i.e., at high compression ratios 14]; however, in the last stages of PIT, the image quality of SIC is not better than that of JPEG. As JPEG has a much lower complexity than SIC, JPEG is therefore preferred in the last stages of transmission. We therefore propose two switching schemes that combine SIC and JPEG to achieve both a good image quality in all transmission stages and a lower computational complexity than SIC. In SIC, the image intensity f ( x , y ) of a region ~ is approximated by a weighted sum of orthogonal base functions I3-6]. In this paper we use the weakly separable (WS) base functions described in I5,61, bccause thesc require much lower computational and memory requirements than the base functions traditionally used in SIC 13], while they produce images of comparable quality [5,6]. In SIC, the base functions do not have to be transmitted because they are completely determined by the region's shape, which is known by the receiver; instead, the receiver computes the base functions from the region's shape using the same algorithm as the transmitter. As the set of base functions has to be recomputed for each region, the computational complexity of both the full-SIC coder and the full-SIC decoder is much higher than that of the JPEGcoder/decoder. Note that in contrast to JPEG, not all the coefficients have to be computed at once in SIC; instead, new basis functions are calculated when the corresponding coefficients are required. In the SIC-part of the new schemes, the number of coefficients N c for a region with np pixels is determined as N c = min(~* rip, Nbmax), where 0 < ~ < 1 and Nbmax are user-specified parameters. This strategy assigns more coefficients to larger regions because generally more degrees of freedom are required to represent large regions with the same accuracy as small regions. Furthermore, the strategy also limits the maximum number of base functions in a region, and therefore the computational and memory requirements.
2. A new switched SIC-JPEG scheme for PIT This new scheme uses SIC in the first stages of transmission and then switches to JPEG. The SIC coding part consists of the following steps: (a) segment the image in a number of regions; code and transmit the contour image (and possibly the mean value of the pixels in each region); Co) calculate a few (more) basis functions; (c) calculate the corresponding texture coefficients; (d) quantize, code and transmit the coefficients; (e) if extra information is required by the decoder, then go to stage Co), else stop transmission.
558 If in a particular stage of the transmission the results achieved by SIC are not significantly better than those achieved by JPEG, then the switched scheme should switch from SIC mode to JPEG mode. However, detecting this would require running JPEG and SIC in parallel and would therefore be inefficient from a computational point of view. Therefore, we adopt the following more practical but suboptimal approach: we switch to JPEG after computing a fixed number of SIC coefficients (the precise number is determined by the parameter ~, which is 0.2 in our experiments). Our experiments show that this suboptimal approach is a reasonable compromise. Finally, note that in the JPEG mode, we do not code the original image by JPEG (such an approach would require more bits than a simple JPEG coder). Instead, the following approach is used: the difference between the original and the SIC reconstructed image at that stage is calculated. Then, the value +128 is added to each of the pixels of the difference image. All the resulted values are clipped in the range [0,255]. The image is then compressed with JPEG at a compression factor such that the total number of bits (JPEG plus SIC) does not exceed the number of bits required by the JPEG-only PIT scheme at the same compression ratio. The receiver reconstructs the JPEG compressed difference image, subtracts the value 128 from each pixel and adds it to the SIC reconstructed image (which is of course available at the receiver).
3. Hybrid SIC/DCT and JPEG switching for PIT One disadvantage of using polynomials for the reconstruction of the texture in a region is that polynomials reconstruct the image very slowly, i.e., that a significant number of base functions is needed to get a clear improvement in image quality. This is because large regions are preferred in SIC to limit the number of bits assigned to contour coding and because accurately reconstructing texture on a large region requires (relatively) many base functions. In order to eliminate this disadvantage, we propose a second new scheme which employs a hybrid SIC/DCT scheme instead of SIC in the first stages of transmission. The hybrid scheme divides the segmented image in rectangular blocks. Those blocks which are fully contained within a segment are encoded using DCT base functions. The remaining, non-rectangular parts of each segment are grouped and encoded using SIC, as a single segment (see Fig. 2). Note that this way of splitting in rectangular blocks differs from the one in [7], which inevitably results in blocking artifacts at high compression ratios, as well as in edge destruction (because parts of the edge regions arc coded independently). Also note that in the hybrid SIC scheme, wc used blocks of size 16x16 pixels, instead of the 8x8 blocks of JPEG (this produces better results because the rectangular blocks in the hybrid scheme never contain edges). The advantage of the hybrid scheme is that (a) edge information is still usefully exploited and (b) in the rectangular blocks, the Discrete Cosine Transform (DCT) may be used and this results in faster computation, especially when using one of the many efficient algorithms for computing the DCT [ 1]. The computational complexity and the memory of the hybrid compression scheme are significantly lower than those of SIC, because the SIC basis functions are computed in smaller regions and because there is no need to computc base functions in the rectangular parts of the regions. In practice, the computational complexity of the second switched scheme is about 30% lower than that of the full SIC scheme. Furthermore, the scheme leads to faster rcconstruction of the texture inside large regions, because of the extra division in rectangles. Note that there is no need to transmit the contour points representing the rectangles in the regions, since they can be constructed by the decoder without any extra information.
4. Results and discussion The results obtained with the proposed switched schemes on the "cameraman" image (see Fig. 1) are shown in Figures 2 to 8. Fig. 2 shows the segmented image (the rectangular blocks are used only the hybrid SIC/DCT scheme) obtained with the segmentation algorithm presented in [2]. Fig. 3 is the SIC image after 38:1 compression (assuming that coding the contours requires 1.6 bits per contour pixel; this is a conservative estimate as the method in 181 claims to require only 1.2-1.3 bits per contour pixel). Fig. 4 shows the JPEG reconstructed image at the same compression ratio. Figures 3 and 4 clearly demonstrate that SIC produces better images at high compression ratios (in this case 38:1). Figure 5 and 6 display images compressed by a factor 30:1 using SIC and the hybrid SIC/DCT method, respectively. As expected, the image in Fig. 5 is better than the one in Fig. 3 because of the lower compression ratio. However, the hybrid SIC/DCT image in Fig. 6 is even better. Indeed, both methods reconstruct edge regions equally well, but the hybrid SIC/DCT method reconstructs texture within large regions better. Fig. 7 shows the result of the switched SIC-JPEG scheme after compression factor of 10:1 (the difference image is compressed with JPEG by a factor of 15:1) and Fig. 8 is the result of the switched hybrid SIC/DCT-JPEG scheme after compression factor 10:1. These figures show that all three methods result in images of similar quality at low
559 compression ratios (i.e., ratios at or below 10:1). Furthermore, our experiments have shown that this quality is similar to the quality of JPEG at 10:1.
Fig. 1. The original image
Fig. 2. The segmented image (the rectangular blocks are used only in the hybrid SIC/DCT scheme)
Fig. 3. SIC reconstructed after 38:1
Fig. 4. JPEG reconstructed after 38:1
Fig. 5. SIC reconstructed after 30:1
Fig. 6. Hybrid SIC/DCT reconstructed image after 30:1 compression
560
Fig. 7. Switched SIC-JPEG reconstructed after 10:1
Fig. 8. Switched hybrid SIC/DCT-JPEG reconstructed after 10:1
5. Conclusions Two schemes which combine Segmented Image Coding and JPEG for progressive image transmission were described. The results in the paper show that, compared to progressive JPEG, the new schemes offer images of better quality during all the stages of the transmission. Also, their computational complexity is lower than that of the SIC-only method. The total number of bits transmitted is not more than what is required by using only JPEG. The schemes are independent of the segmentation algorithm and the way the intensity of the region is approximated.
Acknowledgments This work was financially supported by the Belgian National of "'postdoctoral research fellow" and through the projects the Advancement of Scientific-Technological Research in 950202) and Samset (IWT 950204), by the EC ACTS ERBCHRXCT930382.
Fund for Scientific Research (NFWO) through a mandate 39.0051.93 and 31.5831.95, by the Flemish Institute for Industry (IWT) through the projects Tele-Visie (IWT project SCALAR (AC077) and by the HCM project
References [1] [2]
[3] [4] [5] [6] [7] [8]
W.B. Pennebaker, J. L. Mitchell, JPEG Still Image Data Compression Standard, Van Nostrand Reinhold, New York, 1993. C.A. Christopoulos, A.N. Skodras, W. Philips, J. Cornelis, and A.G. Constantinides, "Progressive very low bit rate image coding", Proceedings of the International Conference on Digital Signal Processing, Limassol, Cyprus, June 26-28, 1995, pp. 433-438. M. Gilge, T. Engelhardt, R. Mehlan, "Coding of arbitrarily shaped image segments based on a generalized orthogonal transform", Signal Processing: Image Communication, Vol. 1, No. 2, October 1989, pp. 153-180. M. Kunt, M. Benard, R. Leonardi, "Recent results in high-compression image coding", II~;E Trans. on Circuits and Systems, Vol. 34, November 1987, pp. 1306-1336. W. Philips, C.A. Christopoulos, "Fast segmented image coding using weakly separable bases", Proceedings of ICASSP 94, Adelaide, Australia, April 19-22, 1994, Vol. V, pp. 345-348. W. Philips, "Fast coding of arbitrarily shaped image segments using weakly separable bases", Optical Engineering, vol. 35, pp. 177-186, January 1996. Special section on Visual Communications and Image Processing. T. Sikora, and B. Makai, "Shape-adaptive DCT for generic coding of video", IEFE Trans. on Circuits and Systems for Video Technology, Vol. 5, No. 1, Feb. 1995, pp. 59-62. M. Eden, and K. Kocher, "On the performance of a contour coding algorithm in the context of image coding Part I: contour segment coding", Signal Processing, Vol. 8, No. 4, July 1985, pp. 331-386.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
561
Low Bit Rate Coding of Image Sequences using Regions of Interest and Neural Networks Nikolaos Doulamis, Athanasios Tsiodras, Anastasios Doulamis and Stefanos Kollias Department of Electrical and Computer Engineering National Technical University of Athens Greece Heroon Polytechneiou 9 Zographou Tel +30 1 7722491 email [email protected]
Abstract In this article we study the transmission of video signals at very low bitrates and we propose a new coding scheme that improves the quality of image sequences at bitrates below 64Kbits/sec. Our technique relies on the extraction of the moving object or objects in a frame using a neural network and quantising these regions with finer quantiser step size than the remaining part of the image. The inputs of the neural network are groups of motion vectors which are calculated with a blockbased estimtation scheme. The output of the network defines the regions of interest in the image. However, this output is extended in the space and in the time so that the final masks to be computed. The results show a significant improvement of the quality of the frames at low bitrates.
1. Introduction
Several coding techniques have been proposed in the past in order to reduce the spatial-temporal redundancy of image sequences. International standard efforts related to different applications have been made. Examples include the FCC proposals for digital transmission of HDTV aiming at bitrate of 20 Mbits/sec [ 1], MPEG 2 for broadcast television at bitrates of 10 Mbits/sec [2], MPEG-1 for digital video at bitrate of 1.5 Mbits/sec and H.261 at px64 bits for videophone and video applications. Nowadays, current efforts are concentrated on MPEG 4 for transmitting video signals below 64 Kbits/sec. Coding at very low bitrates (below 64 Kbits/sec) supports many applications, such as video-phone, remote sensing, telemedicine, video-text, communication aids for deaf people and so on. Since the public switched telephone networks (TN) have been designed for transmission of speech, and the bandwidth of video signals exceeds many times the available capacity of these networks, it is very difficult to achieve transmission of visual data through TN with satisfying quality of the decoded sequence. For instance, a CIF (Common Intermediate Format) standard requires approximately 73 Mbits/sec for transmission without any compression. Therefore, if it is assumed that the channel capacity of TN is extended to 16 Kbits/sec, the compression ratio of this video signal should be higher than 4500! However in many applications, such as video-phone, the full resolution provided by CIF or broadcast TV is not necessary. On the contrary, the used format is QCIF (resolution is now 144x176 for luminance and 72x90 for chrominance). The frame rate is also reduced to 10 Hz (or even 6 Hz in some applications), since the key is to provide the emotional dimension. However, the compression ratio still remains high. Many methods have been developed for transmitting video signals through TN. In the model -based approach the coding algorithm is relied on a 3-D or 2-D model. These techniques use some a priori knowledge about the kind of image sequence that is about to transmit (for example in video-phone applications a human head and shoulder's model). Unfortunately, these approaches fail when other objects, which have been never modelled, appear in the scene. They also demand a lot of computational time that cannot be obtained by the current VLSI architecture. Consequently it is very difficult in video-phone applications which need to be implemented in real time, to make use of the above method. Other approaches try to segment the image into objects and find afterwards the motion of these objects. However the computational time of these methods is also high. Block-based techniques have been implemented in VLSI architecture and can be applied in real time applications. Nevertheless the quality of block-based approach at very low bitrate coding is not acceptable. Therefore a new mechanism is required that tries to optimise the block-based techniques according to the perception of human eye. All video coding algorithms, which have been proposed in the past, process all the parts of one frame equivalently. However, it has been shown [3] by several psycho-physic experiments,2 that human eye does not perceive the whole part of an image in the same way (e.g., in a video-phone application the human eye focuses more on the head and shoulders of the speaker than the background). As a result, it is anticipated that an efficient video coding algorithm should compress less the significant parts of an image (according to the perception of humans) and
562
more the unimportant ones. The regions which are considerable for the perception of human eye, are named Regions of Interest (ROI). In this article, we propose a new method for low bitrate coding, where for each frame there is a mechanism which extracts those parts of an image that should be compressed with high quality. In order to mark the important and the unimportant parts of a frame, a mask is transmitted from the encoder to decoder. Since the ROI of a frame are different from the following frame a mask is required to accompany each frame. Nevertheless, these masks do not increase a lot the total bitrate owing to the binary information that they carry and to the fact that this information can be compressed efficiently using a combination of Run Length and Huffman Coding. An artificial neural network is used as the extraction mechanism of ROI. The choice of neural network has been made since it can be easily implemented by VLSI architecture and since it gives better results (due to its non-linear operation) as far as ROI are concerned.
2. ROI based encoder Although classic block-based algorithms (MPEG) can be used in real time applications, when they are applied at very low bitrate schemes, they appear to produce many problems in the quality of the coded image sequences. A technique without motion compensation (MJPEG) is unable to achieve rates below 64 Kbits/sec without significant deterioration of video quality (block artefacts). On the other hand, if motion estimation is used video quality is improved (reduction of temporal redundancy) at the same bitrate. However low bitrate can be obtained only if the motion estimation error is not transmitted to the decoder or, if it is transmitted, the quantisation factor should be high. In the latter case the reconstruction error will be accumulated from frame to frame and thus a serious distortion of the image will be caused. To solve this problem, we propose a new scheme that is illustrated in Fig 1. This scheme relies on the introduction of ROI in the image and improves the quality of the image keeping the total bitrate constant. After the ROI are extracted, the coding scheme quantises ROI and nonROI with different quantisation factors QH, QL (higher for nonROI and lower for ROI). If the ROI extraction mechanism works properly the previous stated technique enhances the video quality and simultaneously achieves the proposed low bitrate. Furthermore, an appropriate choice of ROI can also restrain the accumulation of error from frame to frame and minimises the need of use of intra (I) frames that refresh the whole image sequence increasing the bitrate.
Video
J Run
~Sream
"l Lenght
[ Crr M ~-~ Lvo "~ -Motion
H
| QH else QL Ext ~' Sp/T
I
Vectors~l
~
IDCT I
....
I
Vectors
DCT(IDCT)
' Run
Calculates 9 the D C T
(IDCT)ofan8x8block Length:Run Length Coding FD: Frame Delay MC:Motion Compensation QH/QL :Quantisation of DCT with High/Low Qf LVQ "Learning Vector Quantisation Ext Sp/T" Extends the mask in the
Or M 9GroupTimeof MotioSpacevectors nand
Figure 1 The mechanism for ROI encoder The appropriate choice of ROI affects significantly the efficiency of the video coding algorithm. After the motion vectors of 8x8 or 16x16 block are computed, a motion compensation scheme transmits these vectors and the DCT of the reconstruction error. Generally, it is anticipated that when the motion vectors are small (or zero), the respective DCT reconstruction error will be small too. Therefore, if we quantise with QH these AC coefficients of reconstruction error, the result will slightly affect the quality of the image. As a consequent these regions can be characterised as non Interest Regions (nonROI). Moreover, all the blocks with large values of motion vectors cannot be characterised as ROI since their values do not always correspond to real moving objects on which the human eye is concentrated. Fig 3 shows the motion vectors of Frame 2 of Claire sequence (the motion vectors have been calculated for 8x8 blocks within 16x16 area). It is observed that the values of the motion vectors in the background are large, although there is no motion in this area (or this motion is not perceivable by the human eye). This phenomenon occurs due to changes of the image luminosity or to the presence of noise. Hence. a simple algorithm that depends on the value of a motion vector in order to find the motion regions will fail. Therefore, we are forced to propose a more complicated technique so as to extract the moving objects of an image. Since the accumulation of reconstruction error is due to large motion vector values and since human eye concentrates mostly on moving objects, we propose a mechanism relied on neural networks which is able to select as ROI those areas that contain moving object or objects and to record the other areas as nonROI.
3. Moving object extraction using neural networks.
563
(In Fig. 3 it can be seen that blocks which belong to moving objects have two significant properties: 9 The majority of adjacent blocks to the current one, have motion vectors which are characterised with the same or almost the same value and orientation with the motion vector of the currently block. 9 The absolute value of both x and y co-ordinates of the examined motion vectors are not very small, i.e. the values of the motion vectors of the current block as well as of its adjacent ones are not close to zero. In this case the region should be classified as a nonROI one. According to the previous properties, two 2-D parameters can be examined for selecting a block which has a motion vector with large value as ROI one or not. These are the mean and the variance of the region with centre at the current block and with four or eight adjacent (4 or 8 connectivity) blocks. In algorithm that uses the mean and the variance of a group of blocks in order to decide whether this block belongs to a ROI or not, it should compute these previous parameters and compare them with a threshold T, which cannot be constant for all the kind of images. That disadvantage can be removed if a neural network is used that is able to classify correctly the blocks of the image into ROI or nonROI using appropriate training set. Another advantage of the network is that it can separate the space nonlinearly and therefore find the optimally partitioned surface.
Figure 2: The block diagram of Learning Vector Quantisation (LVQ)
Figure 3 : Motion vectors of frame 2 of Claire sequence.
The learning vector quantisation algorithm has been chosen for neural network due to its simplicity and efficiency to classify quickly and correctly the data. A block diagram of the classifier is shown in Fig. 2 for the case of two element input vectors classified into two target classes. A preliminary classification of the input vectors into two subclasses (four in this example) make the first layer of LVQ. The second layer is a simple linear layer that joins the subclasses to form the final target class. As a consequence LVQ can classify input vectors into target classes which are not linearly separable in the input vector space. The neural network has been trained using as input vectors groups of 3x3 blocks each one consisting of 8x8 pixels using some frames of the Claire sequence. Fig. 4 presents the output of the LVQ network using frames with which the network has not been trained. Since the motion of object in a film sequence is different from frame to frame, these results show that the network can generalise properly. The output of the network is presented (in black colour) together with the motion vectors (white lines on the black blocks).
4. Extended the network mask in the space and time. Since the orientation and values of motion vectors within a region in the background are not always random, it is necessary to train the neural network with a strict criterion to avoid taking as ROI areas those belonging to background. However, using this strict criterion, in the network output does not take into account the boundaries of moving objects. This is so because a motion vector that belongs to the boundary of a moving object does not have the majority of its adjacent ones following the same orientation and magnitude. To solve this problem which affects the quality of the video sequence, we extended the mask (output of the network) in these blocks that are adjacent to the previous one and have motion vectors in significant value. The results of this process are shown in Fig. 4 where the extended blocks are illustrated in dark grey colour. The extension in space improves the quality of the image (see Fig. 5) without increasing a lot the size of the mask. Thus, the output rate remains low. The same extension is necessary to be done in the time domain too. In video-phone applications there are a lot of frames where motion is absent (more than 50%). This is not a disadvantage for two main reasons. 9 No motion means small DCT error and hence no deterioration of quality of image when coarse quantisation is used. 9 The mask of LVQ does not contain many blocks as ROI and therefore few regions are quantised with high quality. This deducts that total bitrate remains low while quality does not deteriorate importantly. Nevertheless, if high motion id found in a frame, the following frame may not contain any ROI at all (lack of motion) although some blocks which belong to the area of the previous mask (moving object) containing. If these blocks are quantised with QH the reconstruction error will stand accumulating to the following frames causing degradation of the quality of the image. On the other hand, if these blocks are quantised with QL the total bitrate is not
564
affected very much (these blocks are not many), while improving the image quality. The extension of the mask in the time domain is illustrated in Fig 4. (light grey).
5. Results In this section we compare our video-encoding algorithm with the known MC-DCT approach. We use a scheme similar to H.261. The first frame is coded without motion compensation (I frame) but all the other frames are coded based on the motion estimation of the previous decoded frame (P frames). Fig. 5a presents the frame 21 of the Claire sequence (consists of 150 frames) that has been coded with the known MC-DCT algorithm using quantisation factor equal to 32. Fig. 5c illustrates a zoom of this figure. Figures 5b and d show the same frame of Claire sequence with ROI based and MC-DCT coding. It is observed that especially in corresponding Fig. c and d that the quality of the image in ROI is better than MC-DCT algorithm. The quantisation factor of ROI is 8 and of nonROI areas 32. It should be mentioned that a lot of images do not contain ROI, as there is no motion in them. In this case the quantisation factors remains at the high level (32) and therefore all the image quality is deteriorated. However, due to the small reconstruction error, this quantisation does not affect significantly the total quality.
Figure 4 : The masks of Claire sequence at frames 3 15 and 16.
Figure5 : Frame 21 of Claire sequence at MC-DCT and the new proposed algorithm (as well the zooming).
Conclusion In this paper we describe a new efficient mechanism based on ROI for low bitrate coding. An LVQ neural network has been used for extracting the moving objects within a frame. The whole scheme improves the quality of image sequence at the regions on which human eyes concentrate on, without increasing significantly the total bitrate. The ROI occupy a small part of the image (approximately 20% of the frames) and many frames (about 30-40%) for video-phone applications have no ROI since there is no motion in them. Thus the total proportion of the ROI are extremely small. Furthermore. since the proposed method relies on block-based coding, it can be implemented in real time using the curent VLSI architecture (this is the main advantage of ROI based on MC-DCT encoder). The method of extracting of moving objects, based on an LVQ neural network, can be used in many other applications apart from low bitrate coding and this consistutes topics of further research.
References [1] "Special Report: Federal Communication Commission Advanced Television System Recommendation," IEEE Trans. Broadcast, vol. 39, no. 1, Mar. 1993. [2] Special Issue on Video Coding for 10 Mb/s, Signal Processing: Image Commun, vol. 5,no 1-2, Feb, 1993. [3] M. Argyle and M. Cook, "'Gaze and Mutual Gaze," Cambridge Univ. Press, 1976. [4] CCITT Recommendation H.3261, "Video Codec for Audio-Visual at px64 Kbits/sec," Geneva 1990. [5] T. Ebrahim, E. Reusens and W. Li, "New Trends in Very Low Bitrate Coding," Proc. of the IEEE, vol. 83, no. 6, June 1995. [6] K. Aizawa and T. Huang, "Model-Based Image Coding: Advantage Video-Coding Techniques for Very Low Video-Coding Applications," Proc. of the IEEE, vol. 83, no. 2, pp. 259-271, Feb. 1995.
565 [7] D. Kalogeras, "adaptive Techniques in Coding and Recognition of Scenes," Ph.D. Thesis, Dept. of ECE, NTUA, Mar. 1996. [8] E. Nguyen and C. Labit. "Adaptive Region-Based Quantisation in Subband Coding Using A Priori Levels of Interest," Proc of the Picture Coding Symposium PCS'94, September 1994.
This Page Intentionally Left Blank
Session T: IMAGE ANALYSIS I
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
569
ITERATED FUNCTION SYSTEMS FOR STILL IMAGE PROCESSING J.-L. Dugelay, E. Polidori and S. Roche Institut EURECOM, MultiMedia Communications dept. 2229, route des Crates, B.P. 193, F-06904 Sophia Antipolis Cedex E-maih {dugelay,polidori, roche} @eurecom.fr URL: http ://www.eurecom.fr/-qmage
Abstract Iterated function systems have been recently applied to the field of image coding. This paper exploits the fractal properties included in the coding and decoding schemes, in order to add useful tools for image processing. The first contribution consists of an improvement of the classical fractal zoom which allows, with a unique code, to increase the image resolution without loss in sharpness. The second one, in addition to compression, aims at enhanced security of still images thanks to the protection of few code parameter bits.
1
Introduction
The publication of Arnaud Jacquin's article on still image coding using Iterated Function Systems (IFS) [2], stimulated much research on IFS for image processing and coding [7]. Current work on this topic can be classified into four main categories, depending on the problem considered : basic theory, implementation, extension and functionality. The category "basic theory" mainly covers work linked to contractivity constraint notions, general formulation and basic aspects of the algorithm. Although some theoretical problems in using IFS for image compression remain [3], the majority of studies currently available in the literature deals with implementation aspects such as segmentation, domain blocks classification, reduction of computation complexity, and combination of IFS with some other techniques such as DCT. Several relevant papers have proposed improvements on Jacquin's algorithm [3]. By extending the basic algorithm adapted for still grey-level images, some authors also work on a possible design of the method for video, color, and multispectral images. After a brief review of Jaquin's algorithm in section 2, we focus on the last category of studies. In particular, this paper deals with a possible implementation of some tools such as zoom [4], described in the section 3, and some security functionalities such as access control [6], described in section 4, as part of a coding scheme based on the IFS technique.
Brief review of image compression by IFS Given an original image g, the goal is to build a lossy representation of this image via a transform z. The reconstructed image ga is obtained by the iterative process of recursively applying the transformation z [ 1]: Pa = lira /.tn = v(# a ) and/.tn = T(~n_ 1 ) n---~ oo
from an arbitrary initial image go. Note that ga is called the attractor of the transformation ~. In order to obtain ga~ g, the encoding stage consists of selecting x that minimizes the collage error ~c=d(g,z(g)), where d is the distance measure. By the collage theorem [ 1], the reconstruction error ~Td(g, ga) is upper bounded by: 6r<_
OCC
1-s
(1)
where s is the contractivity factor of the transform x. In order to reduce the coding complexity, the image g is divided into N non-overlapping blocks [2], called the range blocks. Each range block Ri, for i ~ { 1.....N}, is coded independently by matching it with a bigger block Di in the image g , called a domain block. This match defines a transformation x~, and the global fractal code is then given by the union z= ~_:z~of local transforms. Moreover, each local code ~ is restricted to consist of a reduction, a discrete isometry and an affine transformation on the luminance. Hence, "c~can be modeled by:
570
ri
=
di 0
:)Iil 9
si
+|ti, 2
(2)
\ oi
where a i, bi, cj, d~, ti, 1, ti, 2 represent the geometric transforms and si, oi the grey-levels transform ; x, y are the pixel coordinates and z the corresponding luminance value.
3
Zooming using IFS
By virtue of the iterative decoding process which uses a "fractal" transform z, the corresponding attractor ILI,a is a fractal object. Although the original image p has a fixed size defined by its number of pixels, the code z, built by taking advantage of image self-similarities, has no intrinsic size and is theoretically scale-independent. Thus, by applying the transformation x on an initial image ~to, we may obtain a reconstruction l.ta of the original image with the same resolution as ~t0 (see Fig. 1). Hence, thanks to this coding scheme, we eliminate the fixed resolution aspect of digitized images.
The fractal zoom is mainly based on this remark. But the result of such a zoom is visually rather poor (see Fig. 2 (c)), although there is no "pixelization" as that due to pixel duplication (see Fig. 2 (a)). Indeed, the fractal zoom causes an
571
important "blocking effect" due to independent and lossy coding of the range blocks. To obtain a good visual quality, some improvements have been made, such as the use of overlapping range blocks [4]. In this case, the coding becomes redundant in overlapping regions of the image It, and then we average these parts in order to smooth the block effect and to reduce the collage error by chosing among several values for each considered pixel. This yields a zoomed image (see Fig. 2 (d)) with a sharper quality than the classical linearly interpolated image (see Fig. 2 (b)), obtained using a luminance continuity hypothesis. Unfortunately, this improvement is done at the present time at the price of degrading the compression performance. However, this method allows the image to he displayed, from a unique code, at different levels of resolution according to the application requirements. Hierarchical
access
control
using
IFS
The advent of multimedia applications has brought new requirements, especially in the security field [5]. Here we propose a hierarchical access control, a system that allows different levels of quality according to the access fee paid. All the receivers of a broadcast channel can display an image, but only at low quality with no commercial value. This message, received through the public channel, remains partially readable in order to attract potential customers who would apply for the commercial service to get the higher quality image. The IFS method is suitable for this because it offers the possibility to control the image quality during the iterated reconstruction process. More precisely, the contractivity parameter s (see Eq. (1)) is modified throughout the luminance scale-parameters si (see Eq. (2)) according to the access level desired. By partially hiding the value of s~ through encryption, several access levels can be obtained. For instance, if s~ is quantized with 8 bits, each of these bits could be let readable (see Fig. 3(a)). In this case the classical reconstructed image with highest resolution is obtained (see Fig. 5(d)). On the other extreme, encrypting all 8 bits of s~ (see Fig. 3(g)) leads to the image of Fig. 5(a) that is unreadable. Between these two configurations, encryption of s~ at intermediate degrees leads to intermediate levels of visualization quality (see Fig. 5(b),
5(c)).
(a)
MSB
s bits
LSB
1
0
0
1
1
1
1
(b)
1
0
0
1
1
x
x
•
(c) (d)
1
0
0
1
x
x
x
x
1
0
0
•
•
•
x
•
(e)
1
0
x
x
x
x
x
x
I
x
x
x
x
x
x
x
X
X
X
X
X
X
X
X
(f) (g)
0
Figure 3: s i parameters masking from (a) no encryption, to (g) full encryption Nevertheless, we note that any domain block can be viewed as a set of range blocks. Each range block of the decoded image [.t a is thus highly dependent on the block mappings performed on a pyramidal fashion during the previous iterations (see Fig. 4). Hence the s i values, associated with each of the blocks involved with these mappings, strongly affect the final decoded image. Thus si appear to be the key parameter in order to provide access control.
Then, we propose a hierarchical access control scheme which provides both compression and security functions within a single algorithm. Note that the multi-resolution access can be achieved without any degradation of compression performance. The security evaluation of this scheme is performed in [6].
572
Concluding remarks Although IFS is not a fully understood technique, fractal image coding has been used successfully to encode still grey level images. Until now, efforts have mainly focused on the compression aspect of IFS. Nevertheless, with the growth of multimedia applications and communications, some future coding schemes, such as MPEG-4, are in progress. These schemes consider some functionalities and tools in addition to compression. It is admitted that IFS is a new and interesting technique for image coding. In addition, in this paper, we have tried to demonstrate that IFS may also be a very useful technique for simultaneously performing image coding and processing. Two methods have been described: the first one provides a way to zoom a picture and the second one to control its access. These methods exploit some particular properties inherent in fractal signal processing (scale-independency) and more specifically to the IFS technique. In practice, both zoom and security could be combined for a single framework to control the image resolution. One future direction for this work could consist in extending the use of these tools and functionalities from still images to video. Moreover, this study may also indirectly contribute to improvements in the general understanding for the use of IFS in the field of image coding.
6
Acknowlegments
This work is supported in part by AEROSPATIALE (service Td6d6tection & Traitement d'images, 6tabtissement de Cannes) and DGA/DRET (groupe T61ecommunications & Detection).
References [1]
M. Barnsley & L. Hurd, Fractal Image Compression, AK Peters, Wellesley, 1993.
[2]
A. E. Jacquin, "Image Coding Based on a Fractal Theory of Iterated Contractive Image Transformations", IEEE trans, on Image Processing, Vol. 2, No. 1, pp. 18-30, jan. 1992.
[3]
Y. Fisher, Fractal Image Compression - Theory and Application, Springer-Verlag, New York, 1994
[4]
E. Polidori & J.-L. Dugelay, "Zooming using Iterated Function Systems", NAT() ASI Conf. Fractal Image Encoding and Analysis, Tromdheim. Norway, July 1995. To appear in a special issue of Fractals.
[5]
B. Macq & J.-J. Quisquater, "Digital Images Multiresolution Encryption", IMA Intellectual Property Project Proceedings, Vol. 1, jan. 1994.
[6]
S. Roche, J.-L. Dugelay & R. Molva, "Multiresolution Access Control Algorithm based on Fractal Coding", IEEE ICIP'96, Lausanne, Switzerland, September 16-19, 1996.
[7]
D. Saupe & R. Hamzaoui, "A guide Tour of the Fractal Image Compression Literature", ACM SIGGRAPH'94 Course Notes, 1994.
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
573
Sensing Surface Discontinuities via Coloured Spots Colin J. Davies* and Mark S. Nixon** Digi Media Vision Ltd. Gamma House, Enterprise Road, Chilworth Hampshire, SO 16 7NS, United Kingdom. Email: [email protected] **ISIS Research Group, Department of Electronics and Computer Science University of Southampton, SO 17 1BJ, United Kingdom. Email: [email protected]
Abstract: Discontinuities in 3D surfaces are often the features of interest for object recognition, positioning and measurement. 3D sensing via a pattern of coloured spots has advantages over structured light systems based on other pattern elements in the presence of 3D step discontinuities. When such surface features bisect the projected pattern elements, incomplete shapes are imaged; for squares or stripes this can lead to ambiguity and erroneous measurements. For 3D sensing via coloured spots the incomplete imaged pattern elements can be reliably detected, facilitating the accurate location of important surface step discontinuities.
Introduction Structured light systems for three dimensional (3D) sensing that are based on a single projected pattern can capture the 3D data for dynamic scenes at video-rate. Previously, video-rate structured light systems have used projected patterns of stripes [2][8], grids [5][6] or square elements [7][9][10]. Using coloured spots retains the video-rate advantage but allows for measurement of more than depth, with inherent practical advantages [3][4]. In this paper, sensing via coloured spots is shown to have the potential to detect directly 3D surface step discontinuities. Blake and Zisserman [1] argued that marking such discontinuities in the visible surfaces is the main purpose of producing a depth map.
3D Sensing via Coloured Spots Figure 1 shows the system geometry for 3D sensing via coloured spots. A projector illuminates the 3D scene with a spatially-efficient hexagonally tessellated array of spots; a camera images the scene from a 3D location offset from the projector. The projected pattern of spots is spatially encoded using colour to enable the imaged spots to be matched with those projected. The imaged position of each spot's centre gives, via triangulation, a measure of the 3D location of the illuminated surface. If the illuminated surface is planar,
Figure 1: The system geometry for 3D sensing via coloured spots.
574
Figure 2: Example results for a single frame from a video-rate sequence of a face during speech. each spot is imaged as an ellipse. From the parameters describing the imaged ellipse shape, a measure is made of the normal direction to the illuminated surface. Thus, a measure is made of the 3D location and orientation of the illuminated surface, for each imaged spot. A specially developed formulation of the Hough Transform (HT) is used to extract each imaged spot, the imaged ellipse is described by only three implicit parameters instead of the conventional five. The parameters are the centre position along its epipolar line and two which describe the variation in shape. Constraints based on colour and brightness reduce the number of uncorroborated votes cast in the HT accumulator array and limit the chances of false peaks arising. Since several peaks can arise in the HT accumulator array formed to extract each spot, a decoding algorithm is used to select the correct spot parameters. Figure 2 shows a single frame from a video-rate sequence of a face during speech, the spots extracted via the 3D HT and a profile view of the reconstructed 3D surface. Currently, partial spots are removed during decoding as they can lead to erroneous 3D measurements.
Sensing Surface Step Discontinuities The use of spots offers distinct advantages over other pattem elements in the presence of 3D surface step discontinuities, with particular advantage arising from their smooth shape. Often 3D step discontinuities bisect the projected pattem elements in such a way that incomplete shapes are imaged. For pattem elements based upon stripes or square elements, the incomplete imaged shapes are not uniquely identifiable. A square projected onto an inclined surface can be imaged as a rectangle. When the projected square intersects a 3D surface step discontinuity the imaged shape may still be rectangular, but there is no evidence concerning the lost information. Accordingly, erroneous depth measurements may be made. In 3D sensing via coloured spots, the projected spots that illuminate smooth surfaces are imaged as ellipses, with smooth edge outlines. The projected spots that intersect a surface step discontinuity are imaged as incomplete ellipses, whose edge outlines contain comers. The comers, or discontinuities in the normal direction of the edge outline, can be detected by a comer operator. Reliable detection of the incomplete imaged spots facilitates the accurate location of surface step discontinuities. Figure 3a shows part of an image of an illuminated surface with a significant step discontinuity, the corresponding edge image is shown in Fig. 3b. The output of a simple comer operator, based on the rate of change of the edge normal direction, is shown in Fig. 3c; the value output for the sharp comers is much greater than for the outlines of the smooth ellipse shapes. The comers of the incomplete spots have been isolated in Fig. 3d by thresholding the output of the comer operator. Unwanted edges, that do not form the borders of the imaged spots, produce unwanted comer outputs. This suggests that the comer detection needs to be incorporated within a process that uses higher-order information.
575
Figure 3: Part of a real 3D step image; (a) the input image, (b) the edge image, (c) the rate of change of the edge normal direction, (d) the previous image thresholded. It could be possible for a surface step discontinuity to intersect the projected pattern of spots such that no incomplete spots were imaged. This is less likely to occur for the hexagonally tessellated pattern of spots than for a pattern based on a square tessellation, due to the greater percentage of surface area illuminated by the spots in the hexagonal tessellation. Were such a surface discontinuity to occur, it could be simply detected from the 3D surface location and normal direction measures calculated from the imaged spots adjacent to it. If the surface step discontinuity was known to pertain to a parametrically defined shape, such as a circular hole in the surface or a straight edge, then the parameters for the discontinuity could be captured directly by conventional feature extraction techniques applied to the output of the comer operator. For example, the surface discontinuity of Fig. 3 has a straight edge which can be extracted by applying a HT formulated for straight lines to the image of Fig. 3d.
Conclusions Using a projected pattern of circular spots provides advantages over patterns based on other shapes. 3D sensing via coloured spots has the potential to detect directly 3D surface step discontinuities and to be immune to spatial error, due to the projection of a smooth shape. The direct detection of 3D surface step discontinuities via the coloured spots can benefit techniques for 3D object recognition, positioning and measurement by providing accurate data for these important 3D object features.
References [1] A. Blake and A. Zisserman. Visual reconstruction. MIT Press, 1987. [2] K.L. Boyer and A.C. Kak. Color-encoded structured light for rapid active ranging. IEEE Trans. Pattern Analysis and Machine Intelligence, 9(1): 14-28, 1987.
576 [3]
C.J. Davies. Three dimensional sensing via coloured spots PhD thesis, Department of Electronics and Computer Science, University of Southampton, 1996.
[4]
C.J. Davies and M.S. Nixon. Feature extraction for video rate three dimensional imaging via coloured spots. In IEEE Canadian Conference on Electrical and Computer Engineering, volume 2, pages 916-919. 1995.
[5]
S.M. Dunn, R.L. Keizer, and J. Yu. Measuring the area and volume of the human body with structured light. IEEE Trans. Systems, Man, and Cybernetics, 19(6):1350-1364, 1989.
[6]
G. Hu and G. Stockman. 3D surface solution using structured light and constraint IEEE Trans. Pattern Analysis and Machine Intelligence, 11(4):390-402, 1989.
[7]
M. Ito and A. Ishii. A three-level checkerboard pattem (TCP) projection method for curved surface measurement. Pattern Recognition, 28(1):27-40, 1995.
[8]
T.P. Monks. Measuring the shape of time-varying objects PhD thesis, Department of Electronics and Computer Science, University of Southampton, 1995.
[9]
P. Vuylsteke and A. Oosterlinck. Range image acquisition with a single binary-encoded light pattern. IEEE Trans. Pattern Analysis and Machine Intelligence, 12(2): 148-164, 1990.
[ 10] S.R. Yee and P.M. Griffin. 3-Dimensional imaging system. Optical Engineering, 33(6):2070-2075, 1994.
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
577
I M A G E ANALYSIS AND SYNTHESIS BY L E A R N I N G F R O M E X A M P L E S S a n d r a B r u n e t t a ~ Nicola A n c o n a
Robotics and Automation Laboratory - Tecnopolis CSATA Novus Ortus Strada Prov.le per Casamassima km 3 - 70010 Valenzano (BA) -Italy Eom. b r u n e t t a @ a r e a . b a . c n r . i t - a n c o n a @ m i n s k y . r o b . c s a t a . i t Abstract This paper describes a system for the analysis and synthesis of a sequence of images. The system relies on Motion Analysis and Learning techniques and is able to extract from an image sequence some control parameters describing the motion contained in the sequence, and to reconstruct the original images of the sequence by using only such control parameters and a small set of the input images. The realized system has many practical applications in the following domains: computer graphics, data compression, teleoperation and object recognition system, virtual reality.
1. Introduction This paper describes a system for realizing image analysis and synthesis of an image sequence representing a moving object in front of a camera. The aim is to extract from a set of finae-varymg 2D images some useful information regarding both the object shape and the motion parameters, and then to use such information to reconstruct the original image sequence. The realized system has many practical applications that range from image processing problems such as video compression (only the extracted information has to be transmitted allowing very reduced transmission costs and times), to passive machine vision problems such as inspection tasks, to active perception applications in which a robot must explore its environment, for instance in order to identify the shape of a manipulation target and to determine the best strategy to reach and grasp it for teleoperation tasks. The information to be extracted are application dependent. For example, in case of teleconference, the meaningful parameters can be the expression and pose of face images (e.g. degree of smiling, the in-plane and off-plane rotation of the head and so on), while for a manipulation task the extracted parameters can be the rotation and/or translation degree of a moving target with respect to a fixed coordinate system together with a scale factor. Two main aspects has to be pointed out: the analysis problem, that involves the estimation of a vector of control parameters (pose parameters) describing the meaningful motion underlying the observed scene, from a sequence of images given as input and - the synthesis problem, that is the inverse problem and involves the generation of novel images corresponding to an appropriate set of control parameters starting from a set of example images of a certain object. The approach presented in this paper for solving these two problems relies on the explicit computation of the motion field between the images and on the assumption that there exists a mapping between the computed motion field and a vector of control parameters for the analysis problem and the inverse mapping for the synthesis problem. Such two non-linear relationships can be reconstructed through approximation techniques by using a small set of examples, i.e. pairs (motion field, control parameters) for the analysis problem and pairs (control parameters,motion field) for the synthesis problem. Such examples pairs are used to train two neural networks based on Radial Basis Function that are able to generalize and reconstruct the mappings also for novel images and novel values of the control parameters. The two obtained networks are caved the Analysis Network and the Synthesis Network, respectively. The main contribution of this paper is that the proposed approach directly refers to images that are used as examples for training a network. No preliminary assumption about the shape of the object has to be made and no three dimensional or physical model is required, unlike the traditional approaches (see for example [6] for face animation). A second contribution is that the method is independent on the image contents owing to a preliminary operation of motion field computation. The system can be applied to any type of 3D objects. In the following sections the method for motion field computation together with the analysis and synthesis networks will be described. Experimental results will be also shown in the last section. -
2. Motion Field Computation Given an image sequence representing an object rigidly moving in front of a camera, the computation of the motion field allows to detect the overall motion underlying the images by estimating the optical flow field, that is an 1 Supported by a grant from the Italian Space Agency.
578
approximation of the perspective projection of the real 3D motion of the object on the image plane [9]. There exist several methods for estimating the motion field. They can be distinguished in methods that compute the motion only for sparse image features with a semantic meaning (such as object comers, edges or boundaxies)[5, 7] and methods that compute the motion for pixel-level features related to the local grey-level structure of the images[l, 8, 11 ]. Methods based on motion computation of points or comers features are the simplest methods but they produce poorly localised features. Methods for edge detection are more advanced than isolated point detection but give no good results for densely textured images. On the contrary, methods based on motion computation for pixel-level features allow to establish correspondences among each point in the images but they generally require a small interframe motion. In this paper a method based on motion computation at pixel-level has been chosen since it allows to give an overall view of the motion present in the images. In this case, motion information are not limited to the chosen features as in the former class of methods. Moreover, the requirement of small interframe motion can be overcame by the appropriate choice of the motion computation technique through the joined use of a multiresolution image representation technique. The problem to compute a dense (at pixel-level) motion between two images is a common problem in computer vision that can be solved by using standard techniques for the optical flow [1, 8, 11] measurement. Optical flow computation permits to match features in two frames just using the local grey level structure of the images, no other information regarding the semantics have to be supplied. The correspondence is found for any pixel, not only for a particular set of features with a semantic meaning, for example the comers of an object. In the work presented m this paper the optical flow computation is performed via a coarse-to-fine gradiem-based technique. The starting point is the well-known optical flow constraint equation or constant brightness equation (see [9]) that holds for each image point: VE.v+E, =0 (1) where V E , E, are respectively the spatial and temporal derivative of the image brightness function E and
v = (v.,, vy ) is the optical flow vector. Equation (1) is obtained assuming that the changes in the image brightness are due only to translation of the local image intensity (i.e. are induced purely by motion). The methods for the optical flow computation based on the solution of equation (1) are also known as differential or gradient techniques since they involve the derivatives of image intensity over space and time. The particular technique chosen to compute the motion field allows to overcome the two limitations of the differential formulation for the optical flow problem: - the aperture problem, from equation (1) only the component in the direction of the brightness gradient can be determined. the pixels displacements among the images has to be small, due to the first order approximation embodied in the truncated Taylor expansion. To solve the aperture problem (for each image point there are one equation and two unknowns) a local smoothness constraint can be added, so regularizing the problem. Infact making the additional assumption that a small neighborhood of the point p=(x,y) (a window S) has a uniform displacement a system of equations can be obtained by imposing the minimization of the error term Err = ~ ( V E . v + E,)2 over the window S. Such a system, in -
s
generally over-determined, can be solved by the least squares method. Unreliable estimates of optical flow are identified (through eigenvalue analysis) and avoided in order to obtain accurate and meaningful results. The joined use of a differential technique with a multixesolution image representation allows to overcome the requirement of small interframe motion. Such a representation, also known as coarse-to-fine strategy, starts up computing a gaussian pyramid representation of the images [1, 4]. This consists of a set of progressively reduced resolution versions of the original images produced by repeated low-pass filtering and subsampling. At each step, this reduction in resolution also reduces the size of flame to frame displacement. Consequently any size displacement in the original images will be reduced, through repeated subsampling, to the range (1-1.5 pixels) within which the optical flow constraint equation can be used. More details on this particular implementation are reported on [ 1]. The descripted technique for optical flow computation allows to automatically compute a dense motion field revealing the overall motion present in a scene requiting just two views and with a subpixel accuracy. 3. A n a l y s i s a n d S y n t h e s i s N e t w o r k s The idea underlying the Analysis Network is to reconstruct the mapping from an image to the appropriate vector of control parameters describing that image by using a set of examples (motion fields and associated parameters) and a Regularization Network for learning from them. The problem is to approximate the vector field F:X ---> y2 from a set of sparse data, the N examples pairs (x,, y, ), for which F ( x i ) = y~ holds and that are used as known instances of the function. In this case F stands for the non-
2Where X is the space of the motion fields and Y is the space of the control parameters.
579 linear mapping we want to learn, x; are the motion fields and y;are the control parameters. Once the function has been reconstructed, it will be possible to obtain an estimate of the control parameters also for novel motion fields. In this way the analysis problem, that is the problem to estimate a set of pose parameters from a sequence of images, becomes the classical problem of interpolating of multivariate function. There are several methods to solve this problem. In this work a method based on neural networks has been chosen for several reasons. First of all neural networks are very suitable to describe non-linear functions, are more robust with respect to noise and have good generalization properties. Moreover, the chosen approximation scheme, based on Radial Basis Function [ 10], has the property of best approximation of continuous function unlike the classical multflayer networks used in backpropagation, and so are well suited for solving multivariate approximation problems. Moreover, it allows parallel implementations, simply handles n-dimensional problems and is characterized by algorithm simplicity. The Radial Basis Function method [ 10] consists in choosing the approximating function F' as linear combination of N functions depending on the examples x,.. Thus, the approximation is: N
i-1
I1-11
with G being the chosen Green function, which may be a gaussian or a linear function or a spline and the euclidean norm. The coefficients c;of the linear combination can be 'learned' from examples by solving the following linear system:
AC=Y where (Y)i = F ( x i ) is the vector containing the control parameters, (C); is the vector of the coefficients and II
(A)q
a i,j= 1, N II
matrix of the chosen Green function evaluated at the examples. The linear
||
system is obtained by imposing the N conditions on the examples, i.e.F ( x i) = Yi i = I ....N .
Once computed the coefficients c: the approximation of the unknown function F has been achieved. It suffices to
evaluate the approximating function F" in a new input value (a motion field)to obtain the corresponding estimate of the output value. The idea underlying the Synthesis Network is to recconstruct the mapping from a vector of control parameters to the associated grey-level images by using a set of examples and a regularization network for learning from them. The synthesis network can be thought of as the dual network of the analysis network above described. In this case, the function to reconstruct is F* :Y ---, X: F*(y, ) = x, that relates the vector of pose parameters and the motion field. In this case y; is the vector of pose parameters and x,. is the optical flow field that, once applied to a reference image, will allow to generate a novel image. Such a function can be reconstructed by using an appropriate set of sparse data, the N examples pairs (y~,x i ) for which F * ( y i) = x i holds. Also in this case a regularization network based on Radial Basis Function has been used. 4. E x p e r i m e n t a l r e s u l t s The analysis and synthesis networks have been tested on different kinds of images representing a single rigid object rigidly moving in front of a camera. In this section the experimental results of a particular practical application will be described. It will be simulated a particular situation in which a camera mounted on the gripper of a robot end effector has to track the object moving on the plane in order to keep constant its attitude with respect to the position and orientation of the object. In this case the object motion consists of a translation on a plane parallel to the image plane and a rotation around an axis normal to the image plane and passing through the center of the image. This motion can be modeled by an affine transformation and the needed affme parameters are (O,tx,ty) where 0 stands for the rotation degree of the object around the axis normal to the image plane, and (t x , t). ) represents the components of the translation vector. Experiments on human faces are reported on [3]. Figg. 1-2 show the results obtained by testing the analysis and synthesis networks on some test images representing a book translating and rotating on a table. Given the test image 3 shown in Fig. 1 the analysis network is able to derive the corresponding affine parameters that describe the motion of the book on the table with respect to a given reference image. The real motion underlying the test image and the reference image is highlighted by showing the superimposed edges of the two images, produced by using an edge detector with the same parameter setting. Fig. 2 shows the results obtained giving as input to the synthesis network 4 the affine parameters estimated by the analysis network. The obtained synthesized image is shown in order to check the lack of visible deformations. Moreover, to assure that no information are lost, also the superimposed edges between the reconstructed image and the
3 More properly the motion fieldcomputed between a reference image and the testimages. 4 Trained on the same example set of the analysis network.
580
reference image is displayed. It is clearly shown that the two images quite coincide showing that all the motion has been recovered.
Fig. 1 The analysis network.
Fig. 2 The synthesis network. The obtained results show as the reconstruction of the images is very good. This means that 1) the estimate of the affine parameters by the analysis network ~.s good (the extracted parameters are meaningful and completely describe the motion content of the images) and that 2) the reconstruction of the images by the synthesis network does not cause deformations. 5. C o n c l u s i o n s In this paper two networks for realizing image analysis and synthesis of a sequence of images have been presented. Both networks have several practical applications. The analysis network can be seen as a generalized interface that learns the motion from an images sequence and synthesizes it in a vector of control parameters, while the synthesis network generates novel images starting from a small set of example images, without any 3D model, just assigning new values to the control parameters. The analysis and synthesis networks putted together can be used as a system for data compression or low-bandwidth teleconferencing, with the transmitter equipped with an analysis network and the receiver provided with a synthesis network. Moreover, training the synthesis network on different examples pairs as regards the analysis network, the whole system can be used for object recognition (novel views of an object can be created by replicating the motion of another object) or as a method for creating cartoon animation driven by an actor (see [2] for other applications). References [! ] J.R. Bergen and R. Hingorani. Hierarchical motion-based frame rate conversion. David Sarnoff Research Center. Princeton. Internal Report. April i 99O. [2] D. Beymer, A. Shashua and T. Poggio. Example based image analysis and synthesis. AI-MEMO 1431 - Artificial Intelligence Laboratory - MIT. December 1993. [3] S. Brunetta and N. Ancona. Apprendimento da Esempi: Analisi di Immagini. Proceedings of the 39 ANIPLA Conference (Automation 95). Bari (I). November 1995. [4] P.J. Bun and E.H. Adelson. The laplacian pyramid as a compact image code. IEEE Transactions on Communications. COM-31, N. 4:532-540. APril 1983. [5] R. Deriche and O.D. Faugeras. Tracking line segments. Proceedings of the 1st ECCV. Antibes (FR). Lecture Notes in Computer Science. Voi. 427. Springer Verlag Editor. 1990. [6] I. Essa. Analysis, interpretation and synthesis of facial expressions. TR 303 - MIT Media Lab. February 1995. [7] O.D. Faugeras, F. Lustman and G. Toscani. Motion and structure from motion points and line matches. Proceedings of the 1st ICCV. London (UK). Pages 25-34. IEEE Computer Society Press. 1987. [8] B.K.P. Horn and E.J Schunk. Determining optical flow. Artificial Intelligence. 17:185-203. 1981. [91 B.K.P. Horn. Robot Vision. M1T Press. Cambridge. 1986 [10] T. Poggio and F. Girmi. A theory of networks for approximation and learning. AI-MEMO 1140 - Artificial Intelligence Laboratory - M1T. 1989. [I !] A. Shashua. Correspondence and affme shape from two orthographic views: Motion and recognition. Artificial Intelligence Laboratory MIT. AIMEMO n. 1327, December 1991.
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
581
A STABILIZED MULTISCALE ZERO - CROSSING IMAGE REPRESENTATION FOR IMAGE PROCESSING TASKS AT THE LEVEL OF THE EARLY VISION Shinji WATANABE, Takashi KOMATSU and Takahiro SAITO Department of Electrical Engineering, Kanagawa University, JAPAN
ABSTRACT We present a new stabilized multiscale zero-crossing image representation at the level of the early vision, and develop an iterative algorithm for image reconstruction from the stabilized zero-crossing image representation. The algorithm provides a reconstruction image with subjectively high fidelity after some dozens of iterations. Moreover, we introduce a threshold operation based on edge intensity to reduce the amount of information in the stabilized zero-crossing image representation, and experimentally demonstrate that the threshold operation works well. 1. INTRODUCTION Many previous studies on the early vision have proved that multiscale zero-crossing representations are well adapted for extracting local important features such as edges from images[I]. The justifiability of multiscale zero-crossing representations originate in the Logan's theorem, which proves that if a signal does not share any zero-crossings with its Hilbert transform, then it is uniquely characterized by its zero-crossings[2]. The multiscale zero-crossing representation is complete, but not stable in the sense that a small perturbations of the representation may correspond to an arbitrary large perturbation of the original signal. In order to stabilize the reconstruction of a signal from its zero-crossings, recently Mallat has developed the stabilized waveform representation based on both zerocrossings of a dyadic multi-scale wavelet transform with the property of local second derivative operation and additional information such as the value of the wavelet transform integral between two zero-crossings, and he conjectured that the stabilized zero-crossing waveform representation might be complete and stable[3]. In addition, he has formed an algorithm for reconstructing signals from the stabilized zero-crossing representation. The Mallat's signal reconstruction algorithm, which is based on the POCS(projections onto convex sets) formulation, iterates on a nonexpansive projection onto a convex set and an orthogonal projection onto a Hilbert sub-space, and hence the convergence is guaranteed. Since then, many studies have described variants of a stabilized zerocrossing waveform representation. For instance, some studies have analyzed the variant which as a complement of information uses a position and
amplitude of each local extremum that is defined as maximum absolute value lying between two consecutive zero-crossings[4]. For almost all the variants presented so far, signal reconstruction problems have been formulated as a non-linear optimization problem, and the resultant signal reconstruction algorithms are based on the POCS formulation like the Mallat's reconstruction algorithm. With the POCS-based alternate projection algorithms we can reconstruct signal waveform from the stabilized zero-crossing representation with high fidelity. These types of zero-crossing representation is considered almost complete, but they have common drawbacks. The first drawback is that it is neither straightforward nor trivial to extend their signal reconstruction algorithms from one-dimensional signal to two-dimensional signals. The difficulty of the extension results from the non-linearity of the reconstruction problems. The second drawback is that for almost complete signal reconstruction the positions of zero-crossings and/or local extrema should be represented with fractional sampling interval accuracy, which makes it difficult to apply the stabilized zerocrossing representations to various practical image processing problems at the level of the early vision. The third drawback is that the POCS-based alternate projection algorithms for signal reconstruction involve a large amount of computational efforts. To cope with the above-mentioned drawbacks, in this paper we present a new stabilized zero-crossing representation in the wavelet transform domain. For the new stabilized zero-crossing representation, we represent the positions of zero-crossings as a certain sampling point, and to stabilize the representation we add a complement of information that is defined as an inner product between an original signal and an integrated basis function of the dilated and shifted basic wavelet function at each zero-crossing point. The new stabilized zero-crossing representation has a salient feature that the problem of how to reconstruct signals from it reduces to a typical minimum-norm optimization problem, the solution of which is formulated as a linear simultaneous equation[5]. In this case, however, the dimension of the resultant linear simultaneous equation is very large, and hence we employ an iterative relaxation method for solving the
582 simultaneous equation. The theory for linear operators guarantees the convergence property of the iterative signal reconstruction algorithm. Furthermore, in this paper, we extend the new stabilized zero-crossing representation and the iterative signal reconstruction algorithm to the two-dimensional case, that is to say, a multiscale image representation at the level of the early vision. Experimental simulations demonstrate that the two-dimensional iterative signal reconstruction algorithm, that is to say, the iterative image reconstruction algorithm can reconstruct an original image from its stabilized multiscale zero-crossing image representation almost perfectly. ?.DYADIC MULTISCALE WAVII~I~I' The wavelet transform is performed by applying dilation and translation to a basic wavelet function ~ ~ and then estimating an inner product between a given input signal ~o) and the distorted wavelet function:
Wa(b) = ~a . ~lll* ( -~a ) 9f ( x ) dx
(1)
When we set a = 2j, b = n, where j and n are integers, Wa(b) is reduced to the discrete wavelet transform with the shift invariant property. We refer to this type of wavelet transform as dyadic multiscale wavelet transform. The dyadic multiscale wavelet transform is more redundant than the usual standard discrete wavelet transform where a = 2j, b = n x 2J. The dyadic multiscale wavelet transform has the perfect reconstruction property that an original signal is perfectly reconstructed from the wavelet transforms at all the scales j = 1, 2 ..... but it is not feasible to use the wavelet transforms at all the scales. Instead, we limit the scale j within the range j = 1, 2 ..... M, and reconstruct an original signal from both the wavelet transforms at the scales j = 1, 2 ..... M and the smoothed signal at the coarsest scale j = M. In this paper we use a basic wavelet function ~ o ) which is defined as a second derivative of a short-length smoothing function and forms a bi-orthogonal basis with the properties of the linear phase and the local second derivative operation. In this case, the detection of zero-crossing points corresponds to the extraction of edges. In order to reduce the detection of false zerocrossing points caused by ripples of the basic wavelet function, we employ a short-length smoothing function derived from the B-spline function[4]. The unit impulse responses hL(o), hu(o) of the band-splitting low-pass and high-pass filters used for the dyadic multiscale wavelet transform with the second derivative operation are as follows: (a) Low-pass analysis filter : { hr.( 1 ) = 0.25, hi.( 0 ) = 0.50, hL( -1 ) = 0.25 } (b) High-pass analysis filter :
{ hH(1 )=-0.25, hH(0~= 0.50, hH~-I )=-0.25 }
3. S T A B I I , I ~ Z E R O - C R a G REPRESENTATION In the new stabilized zero-crossing representation, we represent the position Z~ of each zero-crossing point with integral sampling interval accuracy, that is to say, for the true position of each zero-crossing point we substitute the sampling point which is nearest to the true position. In order to stabilize the reconstruction of a signal from its zero-crossings, we add a complement of information that is defined as an .inner product Pi between an original signal f( x ) and an integrated function ~( x ; Zi ) of the dilated and shifted basic wavelet function ~ ( x - Z~ ) / 2j ) at the position Z~ of each zero-crossing point: Pi = ( f ( x ) ,
~ (x" Zi ))
_~
(2)
2J
where N is the normalization factor. When we use a basic wavelet function ~ o ) with the second derivative property, the inner product defined by Eq. 2 corresponds to applying a first derivative operation to an original signal flo). Moreover, the definition of the integrated basis function ~(~ is obtained by applying wavelet transformation to a unit step function U(~ hence the function o)(~ serves as an edge model function and the value of the inner product Pi corresponds to intensity of the edge model function included in an original signal J(~ 4. SIGNAL R E C O N S T R U C T I O N The problem of how to reconstruct signals from the new stabilized zero-crossing representation reduces to a typical minimum-norm optimization problem where a vector with a minimum norm is selected as an optimal solution vector under the constraint that inner products of a solution vector with the multiple basis vectors are given. The solution of the minimum-norm optimization problem is easily formulated as a linear simultaneous equation. In the case of the signal reconstruction problem, however, the dimension of the resultant linear simultaneous equation is equal to the number of detected zero-crossings, and too large to solve the equation directly, non-iteratively. Instead, we employ an iterative relaxation method for solving the equation. The mathematical theory for linear operators guarantees that the iterative signal reconstruction algorithm has the convergence property. The iterative signal reconstruction algorithms is as follows.
[ Iterative Signal Reconstruction Algorithm ]
(1) Adopt a smoothed signal at the coarsest scale j = M as an initial function of a reconstruction signal
R(x). (2) Apply the procedures of the steps, (2-1) and (2-2), to all the zero-crossings at all the scales. (2-1) For each zero-crossing Z, at the scale j, compute
583
an inner product Q, between the present reconstruction signal R( x ) and the integrated basis function ~( x;Z~ ). (2-2) Update the reconstruction signal R(x)as follows: R(x)6-- R(x)+(Pi-ai)'tYj(x;Zi) (4) (3) Repeat the above operations of the step (2), until convergence. 5. EXTENSION TO IMAGE REPRESENTATION We extend the new stabilized multiscale zero-crossing representation and the iterative signal reconstruction algorithm to the two-dimensional case, that is to say, a multiscale image representation at the level of the early vision. For the extension, we employ the multiscale pyramidal wavelet transform as a two-dimensional multiscale wavelet transform. Firstly we decompose an input signal into four multiscale dyadic wavelet transforms WLL( X, y ), Wu-l( x, y ), WHL( X, y ), WHH( X, y ) by performing horizontal wavelet transform and vertical wavelet transform sequentially, and then we compose an objective multiscale pyramid wavelet transform WH( X, y ) by reconstructing it from only the three different wavelet transforms WEE x, y ), WHL( X, y ), WHn( X, y ). The multiscale pyramidal wavelet transform has the same data structure as the Laplacian pyramid image representation[6], and has the property of perfect reconstruction. For a given two-dimensional signal, zero-crossings of its multiscale pyramidal wavelet transforms make zerocrossing lines. We represent the zero-crossing lines with integral sampling accuracy, that is to say, for the true zero-crossing line we substitute a sequence of the sampling points which best approximate to the true position. Moreover, as a complement of information, at the position Zi = ( Z~.x, Z~.y )' of each approximate zerocrossing point on the zero-crossing line at the scale j, we employ the inner product Si between an original two-dimensional signal f( x, y ) and the basis function pj( x, y 9Z~ ) which works as a first derivative operator.
Si = ( f ( x , y )
, P j(x,y "Zi ) I~
(5)
As for the basis function pj( x, y" Z~ ), we prepare four different candidate functions for the basis function pj( x, y " Z~ ) in advance, and then we select the proper function from the four candidates according to the connection of neighboring approximate zero-crossing points in the vicinity of the position Zi on the zerocrossing line. We define the four different candidate functions ~( x, y 9Z~" nl, l'12 ) with the two parameters
/'/I, 1"/2"
7j (x,y ;Zi "hi,n2 ) -
"~1 { 1]j (lC," ;Zi,.,., Zi, y)//1 Ni, j + qj (v, x "Zi, v, Zi,x ) r/2 }
(6)
,( nl ,n2 )= (1,0),( 0,1 ),(1,1 ), ( I,-I ) Ni,j " Normalization coefficient "qj(tll ,U2 ;Zi,l , Zi,2 )--
u'u/
_oo
2j
dv 9 exp -
_,2j_2
where we choose the proper set of values for the two parameters n l, n2 from the four possible sets according to the connection of neighboring approximate zerocrossing points as follows: (a) Horizontal connection 9 ( nl, n2 ) = ( 1, 0 ) (b) Vertical connection" ( n,, n2 ) = ( O, 1 ) (c) Diagonal connection with the upper right direction 9 ( nl, n 2 ) = ( 1, 1 ) (d) Diagonal connection with the lower right direction" ( nl, n 2 ) = ( 1,-1 ) (e) Vague connection 9We use the two different basis functions defined by the two sets of parameter values, ( nl, n2 ) = ( 1, 0 ) and ( n l, n2 ) = ( O, 1 ), and compute the two different inner products with the chosen two basis functions. The function rlj( u l, u2 ; Zi.~, Zi.2 ) appeared in the definition of the candidate function ?j (o) is defined by Eq. 6. The function rb( u l, u2 ; Z~.1,Zi,2 ) works as a first derivative operator along the Zi.l-axis, whereas it behaves as a Gaussian smoothing operator along the Z~.2-axis. We introduce the Gaussian smoothing operator to suppress the noticeable blockwise artifacts in the reconstructed image. The two-dimensional iterative signal reconstruction algorithm, that is to say, the iterative image reconstruction algorithm differs from the onedimensional algorithm in that we choose the proper basis function for the inner product according to the connection of neighboring approximate zero-crossing points on the zero-crossing line and additionally in that in the case of vague connection we use two different basis functions and their corresponding two different inner products, but does not differ in the general outline of the iterative procedure. The two-dimensional signal reconstruction algorithm, that is to say, the iterative image reconstruction algorithm provides an almost perfectly reconstructed image. If we measure the fidelity of image reconstruction with the rms signal-to-noise ratio (SNR) computed with respect to the original image, the twodimensional algorithm achieves image reconstruction with extremely high fidelity, typically 60 dB or more. Fig. 1 gives the value of SNR of the reconstructed image after n iterations, where the number of scales M is set to 4. High-fidelity image reconstruction with SNR
584
of 55 dB or more involves a large number of iterations, but the algorithm gives a reconstruction image with subjectively high picture quality even after some dozens of iterations, which is a especially preferable characteristic for its practical applications to various image processing tasks at the level of the early vision. 6. DATA ZERO-
COMPRESSION
CROSSING
OF
MULTISCALE
IMAGE REPRESENTATION
To reduce the amount of the complementary information in the stabilized multiscale zero-crossing image representation, we introduce a threshold operation based on edge intensity which is defined as the absolute value of the foregoing inner product Si at each approximate zero-crossing point. We eliminate the complementary information for the approximate zerocrossing point whose edge intensity is below the threshold value: if
REFERENCES
[ 1] D. Marr and E. Hildreth, "Theory of edge detection", Proc. Roy. Soc. Lon., vol.207, pp. 187-217, 1980. [2] B. Logan, "Information in the zero-crossings of band pass signals", Bell Syst. Tech. J., vol.56, pp.487-510, 1977. [3] S. Mallat, "Zero-crossings of a wavelet transform", IEEE Trans. Inform. Theory, vol. 37, pp.1019-1033, 1991. [4] C.K. Cheong, K. Aizawa, T. Saito, and M. Hatori, "Image reconstruction based on zero-crossing representations of wavelet transform", IEICE Trans., vol.J77-A, pp.992-1005, 1994. [5] D.G. Leunberger, "Optimization by vector space methods", John Wiley & Sons, Inc., 1969. [6] P.J. Burt and E.H. Adelson, "The Laplacian pyramid as a compact image code", IEEE Trans. Commun., vol.31, pp.532-540, 1983.
Isil < ~-. j r~ ' then eliminate that zero-crossing pointZ i (8)
60
where the threshold value is dependent on the scale j and 6j is the possible maximum of the absolute value of the inner product Sj at that scale. The threshold operation makes much of coarse edges with high edge intensity and reserves them, whereas it makes little of fine edges with weak edge intensity and eliminates them. Fig. 2 gives the value of SNR of the image reconstructed from the compressed multiscale zerocrossing image representation after n iterations, where the number of scales M is set to 4 and the value of Tr and their corresponding data compression ratios Cr are as follows: Tr= 0.05 ( G = 51.6 % ), T~= 0.20 ( Cr= 21.7 % ) T r = O . 5 0 ( G = l O . 4 % ) , Tr= 0.80 ( C r = 6.0 %) As we increase the value of Tr, visible smear artifacts resulting from image reconstruction errors grow wider and clearer in a reconstructed image, and from the standpoint of picture quality the value of Tr should be set to below 0.2. 7. CONCLUSIONS This paper presents a new stabilized multiscale zerocrossing image representation at the level of the early vision and an iterative image reconstruction algorithm. With the iterative image reconstruction algorithm we can almost perfectly reconstruct an original image from the stabilized multiscale zero-crossing image representation, and after some dozens of iterations the algorithm provides a reconstruction image with subjectively high picture quality. Furthermore, we introduce a threshold operation based on edge intensity to reduce the amount of the complementary information in the stabilized multiscale zero-crossing image representation, and experimentally demonstrate that the threshold operation works well.
50-
'
f
940-
30-
20-- ~)
l ObO 20bO 3JO0 40bO 50bO Iteration Figure 1 SNR versus the number n of iterations for image reconstruction of the test image "Lady". 4540-
I f
m,,,,,i
35~2
r~ 30Tr = ---*- Tr = " Tr = Tr =
0.05 0.20 0.50 0.80
"" () 21JO Iteration 460 Figure 2 SNR versus the number n of iterations for image reconstruction from the compressed multiscale zero-crossing representation of the test image "Lady".
Proceedings IWISP '96," 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
585
FINDING GEOMETRIC AND STRUCTURAL INFORMATION FROM 2D IMAGE FRAMES R. Jaitly and D. A. Fraser Vision & Robotics Laboratory, Dept. of Electronic & Electrical Engineering King's College, Strand, London, WC2R 2LS, UK e-mail- [email protected] Abstract This paper describes the process of obtaining geometric and structural information, in the form of relational tables, from two-dimensional intensity image frames. The process is divided into two subprocesses: the heuristic edge follower and the structure finding algorithm. The first subprocess searches an intensity image to locate and extract potential edges satisfying some criteria and fits linear equations to them using linear regressional analysis. The second subprocess finds valid intersections between these linear equations and hence obtains the structural relationship between these intersections. The outcome of the process is a relational table, containing all vertices found and the relationships between them, for the given intensity image. The process has been tested on a number of rigid, planar-faced three-dimensional objects, producing accurate relational tables from the given views of the objects. The process is described here with reference to one such object and the performance in terms of processing time is reported.
I. Introduction An important part of solving feature-based correspondence problems [1], [2], [3], [6], is obtaining accurate feature and structural information from the 2D image frames used in the correspondence process. Many methods for extracting features such as comers, edges and lines [4], [5], [7], [8] using a wide variety of processes, have been described in various literature but there have been very few explicitly dealing with structural information. Despite the large amount of research in this area, effective extraction of linear boundaries and their corresponding structural connectivity has remained a difficult and inaccurate problem in many image domains. This paper presents a novel approach to processing 2D images to find geometric and structural information using simple heuristics. The technique presented here was motivated by the need for an extraction method which can:find straight lines with a minimum specified length, find the intersections of these lines, and find the relationship between these lines and their intersections (structural information). The performance of the method, in terms of processing time, is given with respect to a test object (see figure 2).
H. The Heuristic Edge Follower Edges are usually defined as local discontinuities or rapid changes in a particular image feature, such as texture. These changes are detected by a local operator, usually of small spatial extent with respect to the image, that measures the magnitude of the change and its orientation. Lines are commonly defined as collections of local edges that are contiguous in the image. Thus many algorithms rely on a two-step process for line-extraction: detection of local edges that are then aggregated into more globally defined lines on the basis of various grouping criteria. One observation of many line extraction algorithms is that they relegate information about edge direction to a secondary role in the processing. In most edge and line extraction algorithms, the gradient magnitude of the intensity change is used in some manner as a measure of the importance of the local edge. While gradient direction information may be used to modulate the grouping process applied to the strong edges, the gradient magnitude usually has the central and dominating influence. The method described in this paper, for finding edges, uses a single step for line-extraction. This process groups edge pixels together that satisfy some criteria involving the gradient magnitude and direction data equally. Edge information is extracted using a Sobel-based heuristic technique. The technique involves calculating the Sobel gradient magnitude and gradient direction data for a given intensity image and using this information to rasterscan the image to find all possible edges. A transputer-based frame-grabber with a high speed Plessey 2D convolver
586
unit, using a 3x3 Sobel mask, produces two gradient buffers: one for the x-gradient and one for the y-gradient for the given image. The gradient magnitude data is calculated by adding the positive values of these two gradient buffers while a look-up-table, using the two gradient buffers as indexes to the table, is used to calculate the direction data. Note:- The images used are 512x512, 8 bit grey-scale images captured using a CCD camera. The edge follower raster-scans the gradient magnitude buffer and flowchart 1 explains the process of finding and accepting edges.
Once an edge pixel of a potential edge has been found that is greater than a threshold TI (A), a check is made to discover whether the edge pixel has been processed already. This is achieved by checking to see if the edge pixel is in the edge data buffer bu~hk. If the edge pixel has already been processed, the pixel is discarded and the process continues raster-scanning the image. If the edge pixel has not previously been processed, the edge pixel is added to buf~hk and, a mask used to find the next edge pixel, is chosen depending on the current edge pixels' direction (B). Flowchart 1 shows the positive and negative masks and their corresponding angle thresholds Tpo, and T,,,,8. The edge pixel angle is stored so that all following edge pixels can be checked to see if they are out of range of T,~. Therefore, the algorithm uses the mask to move along the edge, checking that the magnitude and direction of the edge remain within the set criteria (C). If an edge changes direction abruptly, the edge pixel and its direction are noted and stored in a deviation buffer buf,~,. If buf,~v, becomes greater than a threshold T2, the algorithm assumes that the edge has changed direction and hence stores the first edge pixel to deviate from the original direction, as a possible intersection of two edges (D). If the number of current edge pixels is above a threshold ?'3, the average of the gradient direction values of these edge pixels is used to tighten the constraint on deviation from T , 8 (E). The next connected edge pixel is found using the aforementioned mask. From either mask in flowchart 1, if X represents the current edge pixel, the next edge pixel to be chosen as X (i.e. to follow the edge), is determined by comparing the A, B, and C positions in
587
the mask to find the next edge pixel with a high gradient magnitude and very similar gradient direction. If the gradient magnitude of the new edge pixel is above a threshold T~, the edge pixel is positioned at X in the mask (F), and the process loops back to check the criteria set in steps (A), (B), (C), and (D) for this new edge pixel. Once all edge pixels for a given edge have been found, a threshold/'5 determines whether the number of edge pixels representing the edge (the edge pixel set), is valid or not (G). If the edge is valid, the edge pixels are sent to a linear regressional analysis algorithm, that fits a straight line to the edge pixel set (H). If the edge is invalid, it is stored as noise (I) in the noise buffer buf,~is,. If the end of the image has not been reached, the algorithm continues to raster-scan the image to find other potential edges. The outcome of applying the heuristic edge follower to the image is a set of linear equations E,as, representing all the edges found. The noise buffer buf, oi,, is subtracted from the edge buffer bufc~.. The edge buffer is then thinned using a parallel implementation of a multipass thinning algorithm with a limit parameter. After the thinning process, the buffer is passed through a white speckle removing algorithm. After this last step, the linear equations E,as,, can be checked against the edges they represent by superimposing each equation onto the image edges and searching for edge pixels using a five pixel wide window. Once this has been achieved, linear regressional analysis is again used to form a new linear equation E,~ and this equation is compared to E,,~, to find the difference. In practice, the linear equation set E,ds,, accurately represents the image edges and hence superimposition and equation checking provides negligible improvement.
HI. The Structure Finding Algorithm There are a number of line drawing algorithms that are mainly based on the use of local features of images instead of placing emphasis on using a priori information of object models [9], [10]. This results in unsatisfactory extraction of line drawings which describe the geometrical and topological structures of objects. Most current methods, using a priori information, tend to search the whole image space which increases processing time [11], [12]. In the subprocess described here, the search is concentrated on local areas centred around potential junctions, and thus search time is significantly reduced. The a priori information used in this subprocess is that each of the edge lines in a view of a rigid, planar-faced object is a straight line connecting two junctions and around each junction the directions of the edges have large changes. The subprocess is described below. The edge equations, Eeag,, are compared to each other to find all possible intersections and hence find all valid vertices. To speed up the checking process, the number of equations are distributed over the transputer network (four transputers) with each T805 transputer containing a copy of the edge buffer. The following steps are carried out to achieve success or failure in finding a true intersection:If an intersection lies outside the permitted edge buffer space, the comparison is terminated. An angle validity check is made. If, 170 < angle < 10, the comparison is terminated. A line following algorithm checks to see if both current lines converge to the given intersection. The rules for accepting lines converging to intersections or being present near intersections are described with reference to figure 1. The intersection point (i~) is located and the lines intersecting at this point are checked. Firstly, line 1 (L1) is checked. A search window from the intersection point moves downwards (Lidos,,) a preset distance and then moves from the intersection point upwards (LI~,), again, a preset distance. If the window finds the line in either direction, the rules shown in figure 1 are applied. As shown in the rules in figure 1, if the line is present or not present in both directions from the intersection point, the comparison is terminated. Otherwise, the same procedure is carried out for line 2 (L2). From figure 1, Lldo,~ and L2dow~are true so the intersection of this line pair is valid and hence placed into an intersection list containing the equation pair and the (i~) position of the intersection of the equation pair. Once all possible intersections have been processed, the intersections are checked and verified with those found using the edge follower and then the relationship between vertices is found. Vertex relationships (connectedness) are found by processing the table containing the equations and intersections list and linking intersections sharing the same equations.
Fisure 1. - The magnifiedcomerof an object
588 IV. Results
The two aforementioned subprocesses have been tested on a variety of rigid, planar-faced 3D objects. Figure 2 shows one such object. Figure 3 shows the contents of the edge buffer after the edge following algorithm has been used. Figure 4 shows the edge buffer after the noise buffer has been subtracted, the buffer data thinned and then white Slr~kles removed. Figure 5 shows the geometric and structural information overlaid onto the original object shown in figure 2. Nine vertices V1 to V9 have been recovered from the object. The white lines represent the line equations fitted to the edge pixels. Table 1 shows that the processing time (in seconds), of the parallel implementation of the two subprocesses, compared favourably to their sequential equivalent. Table 2 is the relational list formed from figure 5.
Fieure 2. - Orieinal lmaee
Fit, ure 3. - Edee Buffer
Fima~e4. - ThinnedImage
Figure5. - Strucmr~ of Object
Relationship Process Sequential Time Parallel Time V2, V4, V9 Edge Follower 15.571458 2.747836 Vl v2 VI~ V3 A method for obtaining i Structure Finding 2.606776 0.436754 v3 V2~ V4 geometric and structural Table 1.ProcessingTimein seconds V4 V1, V3~V8 V6 information, in the form of relational tables, from two-dimensional intensity image frames v5 v6 V5~V7 has been presented in this paper. The method is divided into two subprocesses: the heuristic v7 V4, V6, V8 edge follower and the structure finding algorithm. The first subprocess uses a set of heuristics v8 V4~V7, V9 V1, V8 to fit linear equations to extracted edges acquired from an intensity image The second v9 subprocess obtains all valid intersections and structural relationships between these linear Table 2.-Relationships equations. The outcome of the process is a relational table, containing all vertices found and the relationships between them, for the given intensity image. The process has been tested on a number of rigid, planar-faced three-dimensional objects, producing accurate relational tables from the given views of the objects. Table 1 shows that the parallel implementation of the subprocesses provides a fast reliable method. V. Conclusions
References
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]
R. Shapira, "A technique for the reconstruction of a straight-edge, wire-frame object from two or more central projections", in Computer Graphics and Image Processing, Vol. 3, No. 4, pp. 318-326, December 1974. T . O . Binford, "Visual perception by computer", in Proc. IEEE Conf. on Systems and Control, Miami, December 1971. D. Marr and T. Poggio, "Cooperative computation of stereo disparity", in Science 194, pp 283-287, 1976. J. Cooper, S. Venkatesh and L. Kitchen, "Early Jump-Out Comer Detectors", in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, No. 8, pp 823-828, August 1993. J. Matas and J. Kittler, "Junction detection using probabilistic relaxation", in Image and Vision Computing, Vol. 11, No. 4, pp 197-202, May 1993. R. Jaitly and D. A. Fraser, "Automated 3D Object Recognition and Library Entry System", Neural Network World, Vol. 6, No. 2, pp 173-183, April 1996. L.S. Davis, "A Survey of Edge Detection Techniques", Computer Graphics and Image Processing, Vol. 4, No. 3, pp 248-270, 1975. V. Torre and T. A. Poggio, "On Edge Detection", in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-8, No. 2, pp. 147-163, March 1986. A. Martelli, "Edge detection using heuristic search methods", in Computer Graphics and Image Processing, Vol. 1, pp. 169-172, 1972. M. Nagao, S. Hashimoto and T. Sakai, "Automatic model generation and recognition of simple threedimensional bodies", in Computer Graphics and Image Processing, Vol. 2, pp. 272-277, 1973. A . K . Griffith, "Edge detection in simple scenes using a priori information", in IEEE Transactions on Computing, Vol. C-22, pp. 371-375, April 1973. F. O'Gorman and M. B. Clowes, "Finding picture edges through collinearity of feature points", in IEEE Transactions on Computing, Vol. C-25, pp. 449-455, April 1976.
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
589
D e t e c t i o n of small changes in intensity on images corrupted by s i g n a l - d e p e n d e n t noise by using the wavelet transform Yasmina C H I T T I and Paul G O G A N C N R S , U P R 9041 Marseille, F R A N C E
1
Introduction
Cell imagery is generally based on the use of molecular fluorescent probes [1]. These probes bind to the e• membrane of the cell and their optical properties (birefringence, intensity, spectra, ...) vary with one of the electro-chemical properties of the cell such as calcium concentration or electrical membrane potential. In using a voltage-sensitive probe, we can image the spatial changes in the electrical potential on the membrane of an excited neuron [2]. Due to the low transduction power of such probes, the changes in potential induce very small changes in intensity. The purpose of this work is to detect such changes between two charge coupled device (CCD) images corrupted by a signal-dependent noise and with nonuniform illumination.
2
Noise analysis
A typical CCD image providing by our imaging system is shown in Fig. 1,A. CCD images can be significantly affected by noise arising from the random thermal motion of the electrons in resistive circuit elements. This thermal noise is modeled as an additive Gaussian process with zero mean [3]. To reduce this noise the CCD chip is cooled with liquid nitrogen. Images can also be significantly affected by the photoelectronic noise resulting from the quantal nature of light and the photoelectronic conversion process. This noise is well approximated by a random Poisson process [4]. If the first noise is pixel independent, thesecond one is not. For each pixel, the variance produced by the photoelectronic noise depends on the number of incident photons and thus varies spatially since the intensities are not uniform. This implies that the image processing has to be a local image processing to take into account the properties of every pixel. In these conditions, an image where the studied phenomena induce changes in intensity is compared to a reference image of the same field where no change occurs. The image comparison is based on a computation of the pixel to pixel relative variation of intensity between the two images:
~(~, y) = g(~,y) -/(~, y) f(~,y)
(1)
where 9 (z, y) are the pixel coordinates.
9 f(x, y) is the intensity of the reference image at the coordinates (x, y). 9 g(x, y) is the intensity of the image where changes occur at the coordinates (z, y). 9 r(x, y) is the intensity of the result of the relative variation between f and g at the coordinates
(~,y).
590 Although this method was often applied in biology to detect large changes in electrical potential of neurons [5], it increases the noise and impairs the detection of small changes in the pixels of the image resulting from the comparison. The variance a~2(z, y) of the relative variation r(x, y) is expressed as"
=
1 9~ ( ~ y)+ 9:(~, y) f~(~, y) ' f~(~, y)
where
9 a~(x, y)is the variance of f(x, y) at the coordinates (x, y). 9 a2(x, y)is the variance of g(x, y) at the coordinates (x, y). We show a 2D-map of the variance computed from the Eq. (2) in Fig. 1,B. In our images, a~(x, y) is inversely proportional to the intensity f(x,y). The consequence is that the noise of r(x,y) will be larger on the background than on the objects. Furthermore the spatial variation of the variance is larger in the background than in the objects and impairs the detection of areas where intensity changes occurred. We show that it possible to adapt a filtering algorithm to remove the noise and efficiently detect small changes in intensity. Due to the signal-dependence of the noise, the filtering and the detection will have to take into account local properties of the pixels. In the following paragraphs, we propose an efficient method based on the wavelet transform to detect such changes in the noisy r(x,y) image, with removal of the spatial noise arising from the background pixels.
3
Filtering
by wavelet
transform.
Assuming that changes consist on kierachical structures, we can use the wavelet transform [6] to filter the r(x, y) image. First, the image is transform in several planes corresponding to different scales using the 'k trous' algorithm [7]. In this algorithm, every coefficient c(k,x, y)at the scale k (k = 1, ..., N)corresponds to the following scalar product:
~(k, ~, y) -
< ~(~, m), r ~ 2k-~ ' m2k- y ) >
(3)
where r is a scaling function satisfying the dilatation equation [8]. We get the recursive relation:
~(k,~,y)
= ~ h(~,,~) ~(k- 1,~ + 2k-~,y + 2k-~m)
(4)
~Trt,
where h is the wavelet filter corresponding to the r function. The wavelet function use to be a cubic Bspline function but we can adapt many other function. Then, the wavelet coefficients w(k,x,y) at scale k (k = I,...,N) corresponds to the difference between two successive approximations c(k - 1, z, y) and c(k,x, y) of the image r(x, y): _
w(k,x,y)
= c ( k - 1, x , y ) - c ( k , x , y )
k = 1,...,N
(5)
where c(0, x, y)is equal to the image r(x, y). The image restoration is based on the analysis of the statistical significance of the w(k,x,y) coefficients [9]. Taking into account the distribution law of the coefficient w(k, x, y), non significant coefficients axe rejected. In the case of our images, it is not possible to obtain a single variance value
591 for each wavelet plane since the variance is spatially non uniform in the image. We have modified the algorithm to take into account the heterogeneous variance. If the w(k, x, y) coefficients of the plane k are related to the coefficients c(O,z,y) of the plane 0 (i.e. to r(x,y) itself) by a filter g such that:
(6) n~m
2 (k,x,y) of the then the variance a w r(x,y):
O" w
~
y)
-"
w(k z,y) coefficients can be easily obtained from the variance of
_
_
(7)
Every coefficient w(k,x, y) in each plane k is then tested against its standard deviation a~(k,z, y) obtained from Eq. (6). It is considered as significant if it is larger than na~(k,z,y), where n depends on the choosen significance's probability. The reconstruction of the restored image is obtained by adding together the planes of only significant w(k,z, y) coefficients and the last smoothed plane,
c(N,~,y).
4
E x p e r i m e n t a l results
This method was applied to detect local electric fields on the membrane of an excited neuron stained with a voltage sensitive dye and the biological methods and results were published in details elsewhere [2]. In Fig. 1, we can see examples of the detection obtained by using the wavelet transform that removes the spatial noise generated by the non biological background. When the neuron under study is excited (Fig. 1, C and E), we detect pixel clusters corresponding to the operation of groups of ionic channels in the membrane of the active neuron (Fig. 1, C). When the neuron under study is not stimulated (Fig. 1, D and F), the small number of clusters detected in the control images (Fig. 1, F) correspond to spontaneous activities of the biological membranes in the field.
Figure 1: (A) Image of the microscopic field showing the fluorescent neurons. Only the neuron shown by the arrow is excited. (B) 2D map of the variance in image A. (C, D) Relative variation image before filtering. Image C corresponds to a relative variation image when the neuron is excited. Image D corresponds to a control relative variation image when the neuron is at rest. (E, F) Filtering of C and D, respectively, by using the wavelet transform. The top scale corresponds to B (variance): full scale is 3 10 -3. The bottom scale corresponds to C, D, E and F (relative variation images): full scale is 3.2%.
592
5
S t a t i s t i c a l significance
First, the use of the wavelet transform makes possible to evaluate the significance, at the single pixel level, from the distribution law of the w(k,z, y) coefficients. Assuming that the w-law is well approximated as a normal law with zero mean, the significance of the detected pixel will depend on the n factor (see sec'c• 3). The thresholding of planes with n = 2 gives a confidence better than 95% and with n = 3 a confidence better than 99%. Then, to evaluate the significance of the results and the resolution in intensity changes that our system provides, the lowest significative distance (i.e. the best resolution in intensity) between two successive samples of intensity must be computed. The first step is to classify pixels of the filtered relative variation image into N samples with the same chosen intensity range. Then, we compute the mean and the variance in each sample of pixels from the raw pixel intensities and variances on the unfiltered relative variation image. Using the Bienaim~-Chebichev theorem we compute the limit of the probability that two successive samples have different means [10]. If this probability is greater than 80% the two compared samples are taken as significantly different. Filtering in wavelet space detects changes in intensity with a resolution of 0.3% and a confidence greater than 95%.
6
Conclusion
This method is a complete image processing tool that gives significant results but it is unfortunatly limited by its computational efficiency, both in time and in computer storage space. However, this algorithm conserves the photometrics and provides a calibration in size and intensity of active sites by extraction of significant structures.
References [1] L.M. Loew, S. Scully, L. Simpson, and A.S. Waggoner. Evidence for a charge-shift electrochromic mechanism in a probe of membrane potential. Macmillan J., 281:497-499, 1979. [2] P. Gogan, I. Schmiedel-Jakob, Y. Chitti, and S. Ty~-Dumont. Fluorescence imaging of local electric fields during the excitation of single neurons in culture. Biophysical J., 69:299-310, 1995. [3] J.A. Jamiesien. Infrared Physics and Engineering. McGraw-Hill, New York, 1963. [4] M. Gasden. Some statistical properties of pulses from photomultipliers. Applied Optics, 4:14461452, 1965. [5] A. Grinvald, R..D. Frostig, E. Lieke, and P~. Hildesheim. Optical imaging of neuronal activity. Physiol. Rev., 68:1285-1366, 1988. [6] J. Morlet, G. Arens, E. Fourgeau, and D. Giard. Wave propagation and sampling theory - I and II. Geophysics, 47:203-236, 1982. [7] A. Grossmann, R.. Kronland-Martinet, and J. Morlet. Reading and Understanding Continuous Wavelet transform. In Wavelets: Time-Frequency Methods and Phase-Space (J.M. Combes et al., Eds), Berlin, 1989. Springer Verlag. [8] G. Strang. Wavelets and Dilation Equations: A Brief Introduction. SIAM Review, 31:614-627, 1989. [9] J.L. Starck, A. Bijaoui, and F. Murtagh. Multiresolution Support Applied to Image Filtering and Restoration. Graphical Models and Image Processing, 57:420-431, 1995.
[10]
W. Feller. An Introduction to Probability Theory and its Applications I and [I. Wiley, New-York, 1964.
Proceedings IWISP '96,"4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
593
Deterioration Detection in a Sequence of Large Images O. Buisson, B. Besserer, S. Boukir & L. Joyeux obuisson @gi.univ-lr.fr, bbessere @gi.univ-lr.fr, sboukir @gi.univ-lr.fr, ljoyeux @gi.univ-lr.fr
Laboratoire d'Informatique et d'Imagerie Industrielle (L3i) Universit6 La Rochelle, avenue Marillac, F-17042 LA ROCHELLE cedex 1
Abstract This paper presents a robust technique to detect local deteriorations of old cinematographic films. This method relies on spatio-temporal informations and combines two different detectors : a morphological detector which uses spatial properties of the deteriorations to detect them, and a dynamic detector based on motion estimation techniques. Our deterioration detector has been validated on several film sequences and turned out to be a powerful tool for digital film restoration.
1.
Introduction
Most of the techniques in use today for cinematographic film restoration are based on chemical and mechanical manipulations. Applying digital techniques to the field of film restoration let us expect results beyond today's limitations such as automated processing, correction of photographed dust spots and scratches (i.e. after film duplication), the removal of large defects, etc. Our research institute is involved, beside the Laboratoires Neyrac Films company, in the European LIMELIGHT project which aims at designing a complete digital processing chain suitable to restore old films (film scanner, processing workstation, imaging device). Our main work concerns software development for automatic detection of defects like dust spots, hair and small scratches ). Because the processed picture will be imaged back to film, preserving the visual quality in the software process is essential. Thus, the scanning provides high resolutions images (2200 x 1640 pixels or 4000 x 3000 pixels). Of course, these resolutions are uncommon in classical computer vision problems. This involves great difficulties especially when financial viability is aimed at by user of the LIMELIGHT chain. Keeping the processing time short is a significant problem which requires very fast algorithms. Many approaches on defects restoration can be found in previous papers [4], [6]. In these works, the authors consider the "blobs" as impulse distortions or noise. Thus, deteriorations are restored using filtering techniques. These "blind" filters are applied to the entire image, removing deteriorations, but also deteriorating the regions which are not corrupted. A solution to cope with this problem consists in isolating first the regions with defects and then treating only these regions [7]. The following sections describe our detection algorithms.
2.
Dust and scratch detection based on a single image
What are the origins of a dust or hair that is visible on an image ? Mainly, it is a dust particle on the film which shades light during film-to-film copy operation or during film scanning. By the use of a specific high-tech film scanning device, ensuring a high resolution (less than 10 ktm, approximately the film grain size), the digital "signature" left by a dust particle is slightly different from photographed detail of the image, even a sharp, defined one (light dispersion within the sensitive layers). Overall, the characteristics of the defects tend to be : 9 Small surface (varying from 1 to 50 pixels, which is small in a 2200 x 1640 image), 9 The edges of defects have strong gradients. 2.1.
G r a y scale morphology
The four fundamental binary morphological transformations (erosion, dilatation, closing and opening) are all extended to gray scale morphology, without thresholding, via the use of local maximum and minimum operations. Given a gray scale image, I, and a structuring element (SE), B, the following neighbourhood operators ~ and G form the basis of classical mathematical morphology [8], [9], [ 11] : l(x, y) G) B = l(u, v): MAXB~x,y)(I(u, v) - B(u, v)) I(x, y) E) B - I(u,v):MINs(x,y)(I(u,v)+ B(u,v))
Dilatation Erosion
(l(x, y))B = (l(x, y) G) B) G B)
Closing
(l(x, y))B = (l(x, y) 0 B) ~ B)
Opening
594 2.2.
A morphological detector of local deteriorations
The closing operator has the attractive propertie of deleting local minima. Therefore, we can use it to detect black deteriorations. Similarly, the opening operator appears well suited to the detection of white deteriorations. Both morphological detectors of black and white deteriorations are then expressed as a simple difference between successive closing operations (or successive opening operations) and the original image :
Dal,ck(l(x, y),Bo,B n ) = ((((l(x, y) i~ B0) (1) Bn) (~) nn) ~) B0)- I(x, y) Owhite(I(x, y),Bo,B n ) = I(x, y)-((((l(x, y ) 0 B0) O Bn ) I ~ B n ) 1~ B 0) Where SE Be and B,, are defined as : 0 Bo =
0
0
0
0
ooo 000
B.=
~
2n
2n
2n
2n
2n
2n
n
n
n
2n 2n
2n
n
0
n
2n
n
n
n
2n
2n
2n
2n
2n
2n
The use of multiple SE Be and Bn permits to take into account the slope n of image curve gradients. Indeed, defects are generally characterized by very strong gradients. On the other hand, Be allows the detection of defects having smoothest gradients. For example, figure 2 left shows the result of the deteriorations detection using n=O on the image depicted on figure 1. We can notice that for n=O, i.e. without integrating curve gradient properties, the defect profiles are hardly distinguishable Figure1: An ambigous image part from their neigbourhood profiles. On the contrary, using n=30, no ambiguity remains between peaks corresponding to "real" defects and other peaks (see figure 2 right). This result is very satisfactory and demonstrates the robustness of our morphological defects detector.
Figure 2 : Defects detection using n=O(left) and detection using n=30(right)
3.
A dynamic detector of local deteriorations
Working on digitized film sequence gives us a great advantage, because we can use the information on the preceding and following frames. Our second defects detection algorithm uses this spatio-temporal information. Unlike long linear scratches, dust particles appear in a random manner. However, we can't use simple frame substraction or "XORing" to detect them, because, within a sequence, camera and actors or scene elements move around, objects may overlap other objects and/or background details (so-called occlusions or disocclusions). Our dynamic detector relies on both motion-flow estimation and grey-level conservation. There are two main methods to estimate the optical flow of a "noisy" sequence of images : 9 pro-filter the "noisy" sequence and use a classical motion estimation (block matching, regression, etc.). 9 develop a motion-estimation algorithm which is robust to noise or image alteration. We have chosen the first solution for two reasons : 9 It is difficult to know the real sensitivity to noise or to image alteration of a motion estimator. 9 In a high resolution image sequence, motion induces large displacements, up to 200 or 300 pixels. One of the best solutions to quickly estimate such motions is to use a hierarchical structure (image pyramid) [2], [3]. The filtering process is then included in the creation of this hierarchical structure. Having organized image information in a hierarchical manner, a recursive block matching technique is used to estimate the optical flow [5], [10].
595 3.1.
Hierarchical structure
The basic idea is to create a pyramidal representation of an image [1] using the following algorithm"
if (x mod 2 = O) and (y mod 2 = O) then I t+~(x / 2, y / 2,t)= f* I t (x, y,t) where * denotes the convolution operator and f is a given filter. /~ is interpreted as a family of images, where I indicates the level of resolution (or scale). The larger l, the more blurred the original image I is, finally showing the larger structures in the image. Our hierarchical image structure is built using a low-pass filtering such that film grain and deteriorations disappear at heigher levels of the pyramid. Indeed, such high spatial frequencies disturb the motion estimation process.
3.2.
Hierarchical motion estimation
Our method combines the principle of hierarchical motion estimation with a block matching algorithm. In first step, the global motion is estimated allowing only a coarse velocity field to result and, in lower hierarchical levels, the details of the vector field are calculated as relatively small update around the estimated vector resulting from the previous higher level. At each level, displacements are estimated using a recursive block matching algorithm [ 10]. For each pixel of the current grid, we search for the displacement vector which yields the minimum value of a common criterion based on the socalled displaced frame difference (DFD) 9
E(p,d,t) = Z ( D F D ( p i 'd't))2 s
with DFD(p,d,t) = l ( p i , t ) - l(pi - d , t - d t )
w_p
representing a neighbouring window of n x n pixels centered at pixel p , and
d the displacement of p from time t to t - d t . More formally, this recursive search consists of the following steps 9 First, the estimated displacement from the higher level is used as the prediction of the present location 9 d~, (p) = d'+l (p) x 2 To economize the calculational effort, rather than doing full block matching search, we check only 5 vectors (around the predicted position) in the first step and at the very most 3 vectors in the following steps. Figure 3 illustrates this procedure. Then, our algorithm selects the best displacement candidate ~tl, 3 I, ~ {(0,0),(-1,0),(1,0),(0,1),(0,-1)} according to criterion E(p+dto(p)+611,t). The current displacement is then updated" d[ = d~ + SI~ In the next search steps, 3 new candidates are evaluated. Their position depends on the best previous candidate : fl-, = (1,0) ~ 51 e {(O,O),(1,O),(O,1),(O,-1)}
Figure 3" Adaptive block matching search
51_, = (-1,0) :=, &l e {(0,0),(-1,0),(0,1),(0,-1)} 51_, = (0,-1) ~ 61_, = (0,1) ~
&l ~ {(0,0),(1,0),(-1,0),(0,1)} 51 ~ {(0,0),(1,0),(-1,0),(0,-1)}
where i denotes the search iteration number, and displacement (0,0) is related to the best previous selected candidate. Notice that the candidates that have already been checked do not need further evaluation. The current displacement is then updated with the best candidate 9 d~ = d,l_~+ &l The updating process is stopped at the moment the update falls under a threshold, or in case the previous selected candidate remains the best (local minimum), or after a fixed number of iterations. Finally, we design an adaptive search strategy preventing that all possible vectors need to be checked and thus providing fast block matching search. For a maximum displacement magnitude of +3 pixels, this method checks only 20 candidates, while an exhaustive method checks 49. So, the processing speed increases by almost a factor 2.5.
3.3.
Detection of local deteriorations
Once the optical motion flow is correctly estimated, the next frame could be rebuilt without any deteriorations. The absolute value of the DFD is considered as a measure of the quality of the estimated motion. Outliers, usually corresponding to deteriorations, occlusions or disocclusions, are detected when this criterion is higher than a threshold S. These outliers are potential deteriorations.
596 To deal with occlusions and disocclusions, we use a third image in our estimation scheme. The same process as described above is performed between the image at time t and the image at time t+dt. Common spurious points from the two independent motion estimation and comparison processes are selected as deteriorations (fig. 4).
4.
Combination of the two previous detectors
A very good detection rate can be achieved by combining the morphological and the dynamic detectors. The main problems of the latter detectors - false detections and threshold tuning - are bypassed with the double evidence provided by "ANDing" the results of the two detectors. Therefore, the thresholds are fixed at low values in order to detect every deteriorated pixel, but this also increases the number of wrong detections. However, these are not the same for the first and the second detector, and the double evidence eliminates them.
Figure 4 : Frame
5.
I(t) of a film sequence (La belle et la b~te, 1946), and defects detection on I(t)
Summary and Conclusions
We have presented an efficient detector of local deteriorations of old films. This detector combines two different detectors: a morphological detector and a dynamic one. Using a usual criterion in the motion estimation step, we obtain a rate of 3 % of false detections and 5 % of undetected deteriorations. Defects detection is acheaved in about 230 sec. per 2200 x 1640 frame on a standard workstation: 15 sec. for the morphological detection and about 215 sec. for the dynamic detection (which uses 3 images) in an early unoptimized version. Future work will concern detection of oversized defects, intensity distortions, image unstability and, of course optimization, eventually parallelization, of our algorithms.
6.
Acknowledgements
We thank Franqois HELT for helpful assistance. Image reproduction by courtesy of NEYRAC FILM.
7.
References
[ 1]
ANANDAN P. A computational framework and an algorithm for the measurement of visual motion, Int. Journal of Computer Vision, 2:283-310, 1989.
[2]
BAAZIZ N. Approches d'estimation et de compensation de mouvement multir6solutions pour le codage de s6quences d'images, PhD thesis, Universit6 de Rennes I, octobre 1991.
[31 [4]
[5] [6] [7] [8] [9]
BURT P.J. Fast filter transform for image processing, CVGIP, 16:20-51, 1981. GEMAN S., GEMAN D. and McCLURE D.E. A nonlinear filter for the film restoration and other problems in image processing, Graphical Models and Image Processing, 4, 1992. HAAN G. Motion estimation and compensation, PhD thesis, Delft University of Technology, Dept. of EE, Delft, the Netherlands, Sept. 1992. KLEIHORST R.P. Noise filtering image sequences, PhD thesis, University of Delft, 1994. KOKARAM A.C. Motion picture restoration, PhD thesis, University of Cambridge, May 1993. MUELLER S. and NICKOLAY B. Morphological image processing for the recognition of surface defects, Proceedings of the SPIE, 2249-58:298-307, 1994. SERRA J. Image analysis and mathematical morphology, Academic press, 1982
[ 10] SRINIVASAN R. and RAO K.R. Predictive coding based on efficient motion estimation, IEEE Trans. on Communications, COM-33(8):888-896, august 1985. [ 11 ] STENBERG S. Grayscale morphology, Computer Graphics and Image Processing, 35:333-335, 1986.
Invited Session U: COLOR PROCESSING
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K.
B.G. Mertzios and P. Liatsis (Editors)
9 1996 Elsevier Science B.V. All rights reserved.
599
Segmentation of multi-spectral images based on the physics of reflection. N.H. Kroupnova Department of Electrical Engineering, University of Twente The Netherlands Abstract The paper describes an algorithm for multi-spectral images segmentation that takes into account the shape of the clusters formed by the pixels of the same object in the spectral space. The expected shape of clusters is based on the Dichromatic reflection model ([1]), and it's extension ([2]) for optically homogeneous materials. Further the influence of the illumination and image formation by a color CCD camera are considered. Based on expected shape of clusters we propose a criterion of similarity/homogeneity for the extended region merging algorithm. This criterion works successfully in case of objects of voluntary shape and illumination by one or several sources of the same spectrum.
1
Introduction
To develop segmentation algorithms, it is important to understand how the process of reflection of light by different materials causes the changes of the color and intensity in color images. The shape of the color clusters for the purpose of segmentation was considered also in [3], but the resulting algorithm was constructed for the case of one point-like light source and scene composed of objects made from inhomogeneous materials. We consider the process of light reflection for the scene, composed of several objects made from different materials, as is often the case for the real images. We also analyze how the image formation and interactions of objects influence the color clusters shape. Based on the structure of clusters in a color space we propose a criterion of similarity/homogeneity for the region merging (RM) algorithm. The criterion works in 2D color spaces obtained by two different kinds of projections which allow to eliminate the influence of either highlights or shadows and shape variations on the segmentation results. The algorithm works successfully in case of objects of voluntary shape illuminated by one or more sources of the same spectrum.
2
Expected
2.1
and real shape of color clusters
Theoretically expected shape of color clusters
The expected shape of clusters is based on the Dichromatic reflection model ([1]), and it's extension ([2]) for optically homogeneous materials. In accordance to this model, the reflected light can be described as a sum of 2 vectors, one accounted for body reflection and one for the surface reflection. Both the specular and the body reflection are decomposed into two factors - an "intensity factor", which depends on geometry, and "spectral factor", which depends on wavelength. So, power of light reflected by the surface towards the camera is given by I(A) = L(A)(ms(g)cs(A) + mb(g)cb(A)) (1) where:
L(A) - spectral power distribution of the incident light, g indicates dependence on the geometry, A - wavelength, ms(g) and rob(g) are geometry-dependent and cs(A) and cb(A) are spectral factors of reflectance for the surface and body components respectively. (1) works for both optically homogeneous and inhomogeneous materials, but the behavior of c~(A) and Cb(A) differ. Metals do not have body reflection component, so cb for them is equal to zero. For dielectrics and most metals c~(A) is approximately constant over the visible wavelength range, so the surface reflection component is a vector in the direction of the incident light. The exception are color metals like copper or gold, for them c~(A) varies considerably over the visible wavelength range, causing the color different from silver-grey. A color camera transforms the spectrum of a incoming light into a color space, for instance into 3D RGB (red,green,blue) space. This process is called spectral integration [3]. The output of every sensor si with response function fi can be written as p
8i
=/ s
(2)
600 Or, substituting I(A) from (1)
(3) So, the output vector is a linear combination of two vectors - one is a scaled light source vector in the basis fi (if cs(A) is constant over the visible wavelength range) and the other is a scaled product of the spectral power distributions of light and body reflectance of the object in the same basis. In the ideal c a s e - single point-like light source, no noise or imaging artifacts - the color clusters for inhomogeneous materials consist of 2 lines - the matte line and the highlight line and have a shape of skewed "T" or "L", as described, a.o. in [3]. Because cs(A) for dielectrics and white metals is approximately constant in the visible wavelength range, highlight lines go in the direction of the illumination color. Metals don't have the matte line, since they don't have the body reflection. Color metals as copper have highlight lines in different direction, determined also by Cs(A) which varies considerably over the visible wavelength range. It should be noticed, that dependent on the shape of the object and illumination geometry the color cluster can have several highlight lines or even "loops" instead of highlight lines. Diffuse illumination "spreads" the highlights, giving clusters looking more like an area in the dichromatic plane. The same effect has a very rough surface of an object.
2.2
Distortions of the theoretical shape
We consider theoretically and experimentally the influence of the illumination and image formation process on the shape of the clusters. It can be summarized as follows. In an ideal case of no noise and one or more light sources of the same spectrum color clusters can have different shape, varying from line or skewed "T" to an area in the dichromatic plane, but points of one cluster lie in the same dichromatic plane. Noise of CCD camera makes this plane "thicker". Point spread function (PSF) of the camera, chromatic aberrations, and inter-reflections can cause small parts of the cluster to lie outside the dichromatic plane. Inter-reflections and PSF cause "bridges" between different clusters.
3 Segmentation 3.1
Normalization on the white image
After subtraction of dark current and white balancing images are normalized on the white image. Since the white balance is performed so as to be good in the middle of the field, but gains of the sensors are set for the whole image, the normalization compensates for the site-dependent scaling due to the non-uniformity of illumination and the non-uniformity caused by the beam splitter and fixed pattern noise.
3.2
Projections of the color clusters (2D spaces)
We want to design an algorithm for color images segmentation that takes into account the shape of color clusters The shape of the clusters is simplified by projecting RGB space into 2D color space. We consider 2 kinds of projections here, both on the plane going through (1,0,0), (0,1,0) and (0,0,1) (Fig. 1). M1 and M2 are 2 orthogonal axes perpendicular to the intensity axis, M1 goes through(I,0,0).
~~,l,O~o M ....
M2
Matteu,. e
Parallel projectio-~/21 line _ ~ _~I " Hi~,hlight "line
/z .... /Perspectiveprojection line
M2
~
~ atte line
M1
M2
Mattepoint
~Highlight line ........ M1
Parallel
projection
Perspective
projection
Figure 1: Projections of the color clusters One projection is a parallel projection in the direction of the light source, as is also described in [4]. In coordinates M1, M2 highlight lines will project into points on the matte line, and the matte line will project into line going to the light source projection or to (0,0) for normalized images. So in this projection the highlights are actually eliminated and don't disturb the process of segmentation. However, the shadows and shape variations of the objects continue to play role.
601 Another projection is a perspective projection with the center in (0, 0, {)), the same as implemented when transform an image into HSV color space. The matte line is projected into one point and the highlight line is projected into line going to the light source projection, or, if images are normalized, to (0,0). In this projection the highlights still play the role, but the influence of the shadows and shape variations on the matte color vector are eliminated. One can see, that these two kinds of projections are somehow "complimentary" in the sense of eliminating different influences on the segmentation result. It should be noticed, that in both kinds of projections we consider Cartesian coordinates rather than polar to deal better with objects with low saturation.
3.3
Region merging using 2D color space criteria
We perform the segmentation by RM algorithm using a quad tree structure [6], as described in [5]. There is a distinction made in the size of the regions in regard to the criterion used. When both regions are relatively large, so that we can speak about the distribution of feature vectors, the criterion should reflect kind of distance between the two distributions. The regions R1 and R2 are merged when (see Fig. 2):
M2
(~-u2) ~1+~2 < threshold where #1 - average feature vector (M1, #2 - average feature vector (M1, al - standard deviation of R1 in a2 - standard deviation of R2 in
M2) of the first region M2) of the second region the direction to #2 the direction to #1
Figure 2: Distance between two distributions
The measure used here ranges from zero, indicating definite merging, to infinity, indicating no merging. When one region small, one large, we take Mahalanobis distance of the mean of the small region to the large region as a criterion whether two regions should be merged. When both regions are small, similarity and homogeneity criteria are applied and the result is combined as logical 'and'. R1 and R2 are merged if (#1 - #2) 2 < thresholdu and aRluR2 < threshold~ To calculate aR1uR2, first the largest eigenvector A1 of the covariance matrix of R1 U R2 is calculated and then aR~uR2 is defined as a square root of A1. It gives the largest variance of the resulting region. The RM is implemented using gradual relaxation of merging criteria, which gives a hierarchical sequence of segmentations. It allows sufficiently decrease the dependence on the order of merging, but it also gives possibilities for interpretation of images, since the algorithm merges first regions with strongly "overlapping", then with less and less "overlapping" distributions. RM on the parallel projection tends to give shadows and parts with different orientation as separate segments, RM on the perspective projection tends to distinguish highlights. Depending on the application, the results obtained on both projection can be combined, giving the segmentation independent on either shadows and orientation, or highlights, or both. The common problem of 2 projection is difficulty with dealing with achromatic objects of different value, like black and white. To distinguish them, value of intensity has to be used.
3.4
Segmentation example
Fig. 3 shows an image of several objects made from different materials: aluminum and copper cylinders, plastic blue duck and red and blue plastic caps on red and yellow background. Fig. 5 and 4 show histograms in M1, M2 coordinates for both kinds of projections, reflecting how complex are the clusters shapes even for a comparatively simple scene. Fig 6 and 7 shows results of RM by two different threshold values. Note the difference in segmentations. Fig. 8 shows the combination of results to get segmentation independent on the shadows, orientation and highlights.
4
Concluding remarks
In this paper we propose an algorithm for multi-spectral images segmentation that takes into account the shape of the clusters formed by the points of the same object in a color space. We provide physical foundation for the algorithm, analyzing the influence of the image formation process, illumination and interactions of objects on the shape of the clusters. The proposed algorithm is RM with similarity/homogeneity criterion that works on two different kinds of projections, allowing to eliminate the influence of either highlights or shadows and shape variations on the segmentation result. The algorithm hierarchically merges less and less "overlapping"
602
Figure 3: Image of several objects of different materials
Figure 4: Parallel projection histogram
Figure 5: Perspective projection histogram
Figure 6: RM on parallel projection by different threshold values
Figure 7: RM on perspective projection by different threshold values
Figure 8: Projections combination
distributions of color vectors, thus finding first very dense clusters corresponding to uniform parts of objects and then less dense clusters formed by the parts of objects where the color is influenced by some factors. One of the future research topics is to investigate the possibilities for interpretation of the images that can be driven from the hierarchical sequence of segmentations. Another interesting topic is to use for image interpretation the differences and correspondences in the results of RM on two different kinds of projections.
References [1] S.Shafer "Using color to separate reflection components", Color research and application, Vol.10, pp.210-218, 1985. [2] G.Healey, "Using color for geometry-insensitive segmentation", J. Opt. Soc. Am., Vol.6, pp.920-937, 1991. [3] G.Klinker, S.A. Shafer, T.Kanade , " A physical approach to color image understanding", Int. Journal of Computer Vision, Vol.4, pp.7-38, 1990. [4] S.Tominaga, "Surface identification using the dichromatic reflection model" IEEE Trans. PAMI, Vol.13, pp.658-670, 1991. [5] N. Gorte-Kroupnova, B.Gorte, "Method for multi-spectral images segmentation in case of partially available spectral characteristics of objects",Proceedings of "Machine Vision applications in Industrial Inspection IV", (ISeJT/SPIE Symposium on Electronic Imaging), 28 January- 2 February 1996, San Jose,CA,USA. [6] S.L.Horovwitz, T.Pavlidis, "Picture segmentation by a tree traversal algorithm", J.A CM, Vol.23, pp.368-388, 1976.
Proceedings IWISP '96," 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
603
U s i n g Color Correlation To I m p r o v e R e s t o r a t i o n Of Color I m a g e s Daniel K e r e n A n n a Gotlib D e p a r t m e n t of M a t h e m a t i c s and C o m p u t e r Science, T h e U n i v e r s i t y of Haifa, Haifa 31905, Israel dkere n @ mat hcs 2. haifa, ac. il Hagit H e l - O r D e p a r t m e n t of Psychology, Jordan Hall, Stanford University, C A 94305, U S A gigi @whit e. st anfo rd. ed u 1
Abstract
The problem addressed in this work is restoration of images that have a few channels of information. We have studied color images so far, but hopefully the ideas presented here apply to other types of images with more than one channel. The suggested method is to use a probabilistic scheme which proved rather useful for image restoration, and incorporate into it an additional term, which results in a better correlation between the three color bands in the restored image. Initial results are good; typically, there's a reduction of 30% in the RMS error, compared to standard restoration carried out separately on each color band.
2
Introduction
A rather general formulation of the restoration problem is the following: Given some partial information D on a image F, find the best restoration for F. Obviously, there are many possible ways in which to define "best". One way, which proved quite successful for a wide variety of applications, is probabilistic in nature: Given D, one seeks the restoration which maximizes the probability P r ( F / D ) . Following Bayes' rule, this is
604 equal to
Pr(D/F)Pr(F)
Pr(D) . The denominator is a constant once D is measured; is usually easy to compute. Pr(F)is more interesting, and more difficult to define. Good results have been obtained by following the physical model of the Boltzman distribution, according to which the probability of a physical system to be in a certain state is proportional to the exponent of the negative of the energy of that state - that is, low-energy, or "ordered" states, are assigned higher probability than high-energy, Or "disordered", states [3, 7]. It is common to define the energy of a sign al by its "smooth-
Pr(D/F)
ness"; the energy of a one-dimensional signal F is often defined by f
F~2dx'
etc. Such integrals are usually called "smoothing terms", as they enforce the resulting restoration to be smooth [5, 8, 4, 6]. Note that here "smooth" does not mean "infinitely differentiable", but "slowly changing".
3
Main B o d y
To see how the probabilistic approach naturally leads to restoration by socalled "smoothing", or regularization, let us look at the problem of reconstructing a two-dimensional image from sparse samples, which are corrupted by additive noise. Suppose the image is sampled at the points {xi, yi}, the sample values are zi, and the measurement noise is Gaussian with variance a 2. Then n Pr(D/F) c( e x p ( - ~ [F(xi,2___~yi)-zi] 2 ) i----1
and, based on the idea of the Boltzman distribution, one can define P r ( F ) as being proportional to
ll
+
+
so, the overall probability to maximize is
exp(-(~[F(x"Yi)-zi]2 2--a92 + A/ I n
(F~,~ + 2F~2~+ F~%)dudv))
i=1
which is, of course, equivalent to minimizing n
i=l
2a 2
(1)
This leads, via calculus of variations, to a partial differential equation, which can be effectively solved using multigrid methods. Other problems - s u c h as deblurring- can be posed First, let us look at the problem of deblurring a single-channel image (for instance, a gray level image). One is given a gray-level image D, which is a corrupted version of the true image F, and the goal is to reconstruct this F. Typically, one assumes that F was blurred by convolution with a kernel H, and corrupted by additive noise, which results in the mathematical model D = F , H 4- N, where 9 stands for the convolution operator and N is
605 additive noise. Proceeding as in the paradigm described above, one searches for the F which minimizes
let us proceed to shortly describe how this idea is extended to restoring multi-channeled images. Now, suppose we are given a color image, with RGB channels, that underwent degradation by convolution with H (for simplicity's sake, assume it is the same H for all channels, although it doesn't have to be so in the general case). One obvious way to reconstruct the image is to run the deblurring algorithm described above, for each of the separate channels, and combine the restored channels into a color image. Such an approach, however, does not work well in general. Usually, the resulting image is still quite blurry, and contaminated by false colors; that is, certain areas contain streaks of colors which do not exist in the original image. This problem is more acute in highly textured areas. The proposed solution to these problems is to incorporate into the probabilistic scheme a "correlation term", which will result in a better correlation between the RGB channels. Formally, if C~,y is the covariance matrix of the RGB values at a pixel (x, y), the probability for the combination of colors (R(x, y) G(x, y) B(x, y)) is proportional to e x p ( - 89 y) G(x, y) B(x, y))C~,~(R(x, y) G(x, y) B(x, y))) . Multiplying over all the pixels results in adding these terms in the exponent's power. Exactly as in the interpolation problem above, this exponential term combines with the other exponential terms, and we get a combined exponential that has to be maximized; therefore, we have to minimize the negative of the power, which simply results in adding the "correlation term",
i / ( R ( x , y) G(x , y)B(x , y) )Cx,y -1 ( R(x, y) G(x, y) B (x , y ))tdzdy, to the ex~g
pression of Eq. 1 (after subtracting the averages of the RGB channels). In effect, this term makes use of the fact that, in natural and synthetic images, the RGB channels are usually highly correlated. The "correlation term" penalizes deviations from this correlation, thus "pushing" the restored image towards one whose channels are "correctly correlated". Therefore, the combined expression to minimize is
liD- F 9HII2 + )~1(/ /
Jl
2) dxdy (R2~ + 2R2y + Ryy
II
We have implemented a simple iterative scheme for minimizing this functional. A substantial improvement was obtained using the "correlation term". A color photograph was blurred, and restored with and without the correlation term. When using this term, the resulting restoration is
606 sharper, and contains less "false colors". Comparing it against the original image shows that the RMS error is about 30% smaller than when restoring each channel separately. We have also used the "correlation term" to solve the "demosaicing" problem, in which one has to reconstruct a color image, given only one color band at each pixel [1, 2]. This was accomplished by incorporating the "correlation term" into the solution to the interpolation problem described above; usually, this also resulted in a reduction of about 30% in the RMS error.
4
Summary
An algorithm was suggested to restoring multi-channel images; it uses the correlation between the different channels to improve results. The algorithm was applied to color images and it usually resulted in an improvement of 30% or so in the RMS error as compared to standard restoration applied separately to each channel.
References [1] D.H. Brainard. Bayesian method for reconstructing color images from trichromatic samples. In Proceedings of the IS 8~T Annual Meeting, 1994. [2] W.T. Freeman and D.H.Brainard. Bayesian decision theory, the maximum local mass principle, and color constancy. In International Conference on Computer Vision, pages 210-217, Boston, 1995. [3] S. Geman and D.Geman. Stochastic relaxation, gibbs distribution, and the bayesian restorat ion of images. IEEE Trans. on Pattern Analysis and Machine Intelligence, 6:721-741, June 1984. [4] B.K.P Horn and B.G. Schunck. Intelligence, 17:185-203, 1981.
Determining optical flow. Artificial
[5] D. Keren and M. Werman. Probabilistic analysis of regularization. IEEE Trans. on Pattern Analysis and Machine Intelligence, 15:982-995, October 1993. [6] J. Skilling. Fundamentals of Maxent in Data Analysis. In Maximum Entropy in Action, Edited by B. Buck and V.A. Macaulay. Clarendon Press, Oxford, 1991. [7] R. Szeliski. Bayesian Modeling of Uncertainty in Low-Level Vision. Kluwer, 1989. [8] D. Terzopoulos. Multi-level surface reconstruction. In A. Rosenfeld, editor, Multiresolution Image Processing and Analysis. Springer-Verlag, 1984.
Proceedings I W I S P '96," 4-7 N o v e m b e r 1996; Manchester,
U.K.
B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
607
Colour Eigenfaces Graham
D. F i n l a y s o n t, J a n e t D u e c k * , B r i a n V. F u n t * , a n d M a r k S. D r e w *
t D e p a r t m e n t of C o m p u t e r Science, U n i v e r s i t y of Y o r k , York YO1 5DD email [email protected],
* S c h o o l of C o m p u t i n g S c i e n c e , Simon Fraser University, Vancouver, Canada. {janet,funt,mark}@cs.sfu.ca
Abstract
Images of the same face viewed under different lighting conditions look different. It is no surprise then that face recognition systems based on image comparisons can fail when the lighting conditions vary. In this paper we address this failure by designing a new lighting condition independent face matching technique. We begin by demonstrating that the colour image of a face viewed under any lighting conditions is a linear transform from the image of the same face viewed under complex (3 lights at 3 locations) conditions. Our new matching technique solves for the best linear transform relating pairs of face images prior to calculating the image difference. For a database of 15 (complexly illuminated) faces and 45 test face images the new matching method delivers perfect recognition. In comparison, matching without accounting for lighting conditions fails 25% of the time. I. INTRODUCTION
One of the most successful and widely used technique for face recognition is the eigenface method of Turk and Pentland[9], [8]. The basic idea in that method is, that the g r e y s c a l e images of the same face seen in different circumstances should be quite similar. Recognition takes place by comparing the image of an unknown face with face images stored in a database. The closest database image identifies the face. Because, in general, images are very large, image matching is very expensive. In order to reduce matching cost, Turk and Pentland approximated each face image as a linear combination of a small set of basis faces called
eigen/aces. Unfortunately, images of the same face viewed under different lighting conditions rarely look the same i.e. their shading fields will differ. This problem can be mitigated by viewing each face under a variety of lighting conditions and storing this variation in the face database[I], [3], [5]. The multiple image approach succeeds because each separate image encodes a certain amount of information about the shape of the face; that is, at an implicit level, the multiple image approach is concerned with matching shape. However, it is not clear how the notion of shape can be made explicit. We certainly do not want to solve for shape since although this can be done[10] highly specialized calibrated conditions are needed. In this paper we show that shape information is easily obtained so long as face recognition is based on colour images. Specifically we show that: t h e i m p l i c i t n o t i o n o f s h a p e is e x p l i c i t l y c a p t u r e d in a single 3 - b a n d c o l o u r i m a g e . This result follows from Petrov's[6] seminal work on the relationship between illumination, reflectance, shape and colour images in which he demonstrated that, so long as a Lambertian surface is viewed under a complex illumination field (at least 3 light spectrally distinct light sources at different locations), the rgb pixel triplets in an image are a linear transform from scene surface normals: c o l o u r is a linear transform from shape. In our method each database face is created with respect to a complex illumination field. Face recognition simply involves matching the image of an unknown face to the face database. Each database face image is first transformed (by a linear transform) to best match the image colours of some unknown face. Thereafter, the residual difference is calculated. The database face with the smallest residual difference overall identifies the face. In line with Turk and Pentland the cost of matching is reduced by approximating face images using a small number of colour eigenfaces. II. FACE RECOGNITION USING EIGENFACES Let us represent an n • n greyscale image by the function I such that I ( x , y ) denotes the grey-level at Face recognition location x, y. Suppose we have a database A/I of m face images: M = { I i , I 2 , . . . , I m } . is all about finding the image Ic in M which is closest to some unknown face image Iu. Mathematically we might define a function ~ which takes I~ and M as parameters and returns the closest match Ic:
9 (Iu,.M)
-
Ic
: Ic e M ~: [[Iv - Iq[[d < [[Ii -- Iq[[ (i - 1 , 2 , . . . , c - - 1, c4- 1 , . . . , m )
(1)
where ][.][d is a distance measure (usually Euclidean) which quantifies the similarity of two images. To reduce computation a face image I can be represented (approximately) by a linear combination of basis faces (which
608
Turk and Pentland call
eigenfaces). n i--1
here Bi is the ith (of n) eigenface and fli are weighting coefficient chosen to minimize" n
III- ~ ~,B, IId
(3)
i=1
Clearly the error in the approximation defined in (3) depends on the set of eigenfaces used. In general the eigenfaces are selected to minimize the expected residual difference in (3). This is done using standard statistical techniques (e.g. principal component analysis[4]). However, eigenfaces based on other error criteria are sometimes used[7]. Turk and Pentland have shown that a small number of eigenfaces (just 7) renders the error in (2) reasonably small. Denoting eigenface approximations with the superscript ', the function ~' is defined as:
9'(I~,M)
-
I='
9 I~' e M ' ~ Ilia' - rqlld < III~ - rqll (i = 1 , 2 , . . . , c -
1, c + 1,.-. ,m)
(4)
Because each of I ' and Iq are defined by just n numbers (the coefficients fl in (2)) it is staightforward to show that the cost of each image comparison is proportional to n. Usually n < < # pixels in an image, so matching is very fast. Turk and Pentland[8] have shown that the function (I)' suffices for face recognition so long as illumination conditions are not allowed to vary too much. III.
C O L O U R AND SHAPE
The light reflected from a surface depends on the spectral properties of the surface reflectance and of the illumination incident on the surface. In the case of Lambertian surfaces (these are the only kind we consider here), this light is simply the product of the spectral power distribution of the light source with the percent spectral reflectance of the surface. Illumination, surface reflection and sensor function, combine together in forming a sensor response"
P-~= e-'n=/w S=(A)E(A)R---(A)dA
(5)
where A is wavelength, _p is the 3-vector of sensor responses (rybpixel value) __Ris the 3-vector of response functions (red-, green and blue- sensitive), E (assumed constant across the scene) is the incident illumination and S ~ is the surface reflectance function at location x on the surface which is projected onto location ~ on the sensor array. The relative orientation of surface and light is taken in account by the dot-product of the surface normal vector n_~ with the light source direction _e (both these vectors have unit length). Let us denote fw S=(A)E(A)R_(A)dA as q=. It follows that (5) can be rewritten as:
_p~ = qe'n =
(6)
---P&-- ql ~ nx ~t_ q2_et2~ x
(7)
where t denotes vector transpose (e_.nx = etn=). Now consider that a scene is illuminated by two spectrally distinct light sources at distinct locations. If we denote illumination dependence using the subscripts 1 and 2 then equation (6) becomes
Assuming k lights incident at x: k
e ~ - [ Z q,~-~]~-~
(s)
i--1
So long as k -> 3 the term [~]ki=1 qi-e~] will define a 3 x 3 matrix of full rank. In this case there is a one to one correspondence between the colours in an image and the normal field of a scene. Shape and colour are inexorably intertwined. It is important to note that the relationship between surface normal and camera response depends on the reflective properties of the observed surface and the particular set of illuminants incident at a point. C1 .anging ~he reflectance or the illumination field changes the relationship between surface normal and image colour. Henceforth we will assume that faces are composed of a single colour and that faces are illuminated
609
by a homogeneous illumination field and as such a single 3 • 3 matrix relates all surface normals and image colours. IV.
FACE RECOGNITION USING COLOUR EIGENFACES
Let us represent an n x n colour image by the vector function / such that I(x, y) denotes the (r, g, b) vector at location x, y and records how red, green and blue a pixel appears. As before let us suppose we have a database AA of m images: M {Zl,I2,...,Zm}. Crucially, we assume that each database face image is created with respect to a complex illumination field and is thus a linear transform from the corresponding normal field. This relationship is made explicit in (9) where where Ni(x, y) is a vector function which returns the surface normal corresponding to Zi(x,y). The 3 x 3 matrix relating normal field to image colours is denoted T~. =
I~(x,y) = T~Ni(x,y) , (i = 1, 2 , - . - , m )
(9)
Suppose/~ denotes the image of an unknown face viewed under arbitrary lighting conditions. Clearly,
L~(~, y) = ~ N ~ ( x , y)
(10)
Suppose that I_j is an image of the same face (in AA). It is unlikely that Tj will equal T~. However, it is immediate from (9) and (10) that I_j must be a linear transform f r o m / ~ :
T~Tj-II_j = I .
(11)
where -1 denotes matrix inverse. It follows that a reasonable measure for the distance between a database image Li and L, can be calculated as:
117-(L,L)L
-
LII~
(12)
where T(L~, Iu) is the 3 • 3 matrix which best maps Li to L~. In the experiments reported later T() returns the matrix which minimizes the sum of squared errors and is readily computed using standard techniques[2]. Relative to (12) a closeness function 9 for colour face images can be defined as:
V(L,M)
= L
Lc E,M &
"
II'r(L,L~)L-LIId < I I T ( L , L , ) L , - L I I
(i-l,2,'",c-l,c+l,'",m)
(13)
To reduce computational cost of computing (13) we represent (in a similar way to the greyscale method) each band of a colour image as a linear combination of basis vectors:
Z~ ~ ~2-~~ " B ~ , (a=r,g, b)
(14)
i--1
where r, g and b denote the red, green and blue colour bands, the coefficients 13~ (a = r, g, b) are chosen to minimize the approximation error. To derive the eigenfaces to use in (14) a training set of colour face images is compiled. Each image is split into its 3 component band images and thereafter principal component analysis on the entire band image set. Denoting colour eigenface approximations with the superscript ', the function ~' is defined as:
v'(z'~,M) = L
9L c M
IIT(L'~,L')L'~- L ' l l d
< IIT(L',,
Lu)L,-/~ll (i--1,2,...,c-l,c+l,..-,m) ' ' '
(15)
It can be shown that the cost of calculating (15) is bound by the square of the number of eigenfaces used" matching costs O(n 2) (instead of O(n) for black and white faces). V.
RESULTS
The colour images of 15 people (see Figure 1) viewed under 3 complex illuminations provide a training set for eigenface analysis. We found that 8 eigenfaces provide a reasonable basis set (the approximation in (14) is fairly good). The eigen approximations for the 15 faces viewed under one of the complex illuminations comprises the face database. A further 45 test images were taken (the same faces under 3 more illuminants) under non-complex illuminations.
610
Fig. 1. Colour Face Images Each test image was compared with each database image using equation (15). The closest database image defines the identity of the face in the test image. We found that all 45 faces (a 100% recognition rate) were correctly identified. Importantly we found that faces were matched with good confidence; on average the second closest database face was at least twice as far from the test image as the correct answer. We reran the face matching experiment in greyscale using Turk and Pentland's original eigenface method. Greyscale images were created from the colour images (described above) by summing the colour bands together ( g r e y s c a l e - r e d + g r e e n + b l u e ) . We found that 7 eigenfaces is sufficient to approximate the training set. As before the face database comprises eigen approximations of each of the 15 faces viewed under a single illuminant. Test images were compared with the face database using (4). We found that only 32 of the faces were correctly identified (a recognition rate of 73%). This is quite poor given that the face database is quite small. VI. CONCLUSION Shape and colour in images are inexorably intertwined. A single coloured Lambertian surface viewed under complex illumination conditions is a linear transform from the surface normal field. It follows that the image of a face observed under any lighting conditions is a linear transform from the same face viewed under a complex illumination field. We use this result in a new system for face recognition. Database faces are represented by colour images taken with respect to a complex illumination field. Matching takes place by finding the linear transforms which takes each database face as close as possible to a query image. The closest face overall identifies the face (in the query image). To speed computation all faces are represented as a linear combination of a small number eigenfaces. Experiments demonstrated that the colour eigenface method delivers excellent recognition. Importantly recognition performance, by construction, is unaffected by the lighting conditions under which faces are viewed. That this is so is quite significant since existing methods[9] require the lighting conditions to be held fixed (and fail when this requirement is not met). REFERENCES [1] RussellEpstein, Peter J. Hallinan, and Alan L. Yuille. 5• eigenfaces suffice: an empirical investigation of low-dimensional lighting models. In Workshop on Physics-Based Modelling, ICCV95, pages 108-116, 1995. [2] G.H. Golub and C.F. van Loan. Matrix Computations. John Hopkins U. Press, 1983. [3] Peter W. Hallinan. A low-dimensional representation of human faces for arbitrary lightin conditions. In Proceedings o/the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 995-99, 1994. [4] I.T. Jolliffe. Principal Component Analysis. Springer-Verlag, 1986. [5] Shree K. Nayar and Hiroshi Murase. Dimensionality of illumination manifolds in eigenspace. Technical report, Columbia University, 1995. [6] A.P. Petrov. On obtaining shape from color shading. COLOR research and application, 18(4):236-240, 1993. [7] D.J. Kriegman P.N. Belhumeur, J.P. Hespanha. Eigenfaces vs fisherfaces: recognition using class specific linear projection. In The Fourth European Conference on Computer Vision (Vol I), pages 45-58. European Vision Society, 1996. [8] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, March 1991. [9] M. Turk and A. Pentland. Face recognition using eigenfaces. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 586-91, 1991. [10] R.J. Woodham. Photometric method for determining surface orientation from multiple images. Optical Engineering, 19:139-144, 1980.
Proceedings IWlSP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
611
Colour quantification for industrial inspection Maria Petrou and Constantinos Boukouvalas
Department of Electronic and Electrical Engineering, University of Surrey, Guildford, GU2 5XH, United Kingdom
Abstract In this paper we discuss the application of some of the most recent advances in the Psychophysics of colour, for the development of a colour grading system capable of replacing the human expert inspector in colour based quality control of manufactured products. In particular, we are discussing the problem of replacing the spectral sensitivity of the electronic sensor with that of the human visual system, so that agreement to the sub-level accuracy between the recordings of the electronic and the human sensors can be achieved. We demonstrate our methodology by grading automatically some coloured ceramic tiles previously graded by human experts operating at the threshold of human colour perception. 1. I n t r o d u c t i o n The greatest success of the vision research has been in developing vision systems that perform a specific task, because by narrowing the field of operation, the quality'of the performance can be greatly improved. Visual industrial inspection, however, has not become a matter of routine yet. Several industrial tasks have already been fully automatic, but the aspect which seems to present most resistance to the process of automation is that of final product quality control. The reason is that the automatic inspection in order to be acceptable to the manufacturer has to be at the level performed by the trained human inspectors at the peak of their performance. Part of the process of inspection of the final product is the inspection of colour and in particular the categorisation of the products in grades of colour, i.e. in "lots" or "batches". To achieve this automatically, one has to overcome a series of problems: The distortion caused to the recorded colour by the temporal variation of the illumination. Indeed, experiments have shown [1] that while the colour differences one has to detect are of the order of half a grey level (in a full scale of 0 to 255), the temporal variation of illumination even when it is controlled, could be several grey levels from one inspected object to the next. 9 The distortion caused by the spatial variation of illumination. On a fiat surface, like a tile, the illumination can very by as much as 10 grey levels from one end of the object to the other [2]. 9 The thermal noise in the image capturing device that can be random with variance of several grey levels. 9 The non-linear response of the sensors over the range of colours that might be present on the same object. The spectral response of each sensor which is not a pure delta function and which is clearly different from the spectral response of the sensor used by the human observer which an automatic system is expected to replace [4].
612 We have presented elsewhere the methodology for coping with the variations of the illumination, the non-linear responses of the camera and the thermal noise of the devices. Here we present methodology that uses established results of the Psychophysics of Vision to cope with the demands of an industrial application, and allows the identification of colour grades that correspond to the threshold of human colour perception which are discriminated by human inspectors working at the peak of their performance. To achieve this, the proposed methodology had to be able to measure colour differences at least one order of magnitude smaller than the various types of noise involved in the process of colour recording. We shall demonstrate our methodology for the particular application of ceramic tile colour grading. 2. C o l o u r G r a d i n g a n d t h e s e n s o r s ' r e s p o n s e s The visible part of the electro-magnetic spectrum can be discretized and represented by the values at n equidistant wavelengths. Then the true spectral reflectance of a tile is given by a set of n (unknown) numbers, (one for each sample wavelength chosen): R(A)
=
(rl,r2,..-,r~)
(1)
where ri is the reflectance at wavelength Ai. Similarly, the spectrum of the illumination used can be represented by: A(1)
=
(al,a2,...,an)
(2)
Assume also, that we have three sensors with known spectral sensitivities:
QI(A)
=
Q
=
(A)
(Q11,Q12,...,Qln)
=
(3)
The three sensors will record the following values: ql
=
rlQlla1 + r2Q12a2 + ' " - t - r n Q l n a n
q2 =
rlQ21al + r2Q22a2 + ' " + rnQ2na~
q3 =
rlQ31al + r2Q32a2 + ' " + rnQanan
(4)
In the above expressions, qi, Qij and ai are known and ri are unknown. As we only know the recordings of the three colour sensors, we have n - 3 degrees of freedom. Ideally, we would like to solve for the unknown reflectances and then blend them again using the sensitivities of the retina cones to work out what intensities the human sensor would have recorded from the same surface. A straightforward solution to this problem is not possible due to the fact that it is under-determined as typically n = 31. We make, however, the following assumption: The transformation between the three recorded intensities by the electronic sensors and the three intensities the human sensors would have had recorded, is atone. This assumption may not hold for the whole 28 dimensional space. However, as we are interested in surfaces which are very similar to each other, we are really concerned with a very small subspace of the colour space. No matter how complicated the relationship between the electronic and the human recordings is, locally it can always be approximated by a linear transformation. With the help of a spectrometer, we measured the reflectances of some typical tiles in the 31 wavelengths of interest. We then randomly chose hundreds of 31-tuples of intrinsic reflectances that complied with the restrictions of the sensor recordings and were confined in the colour subspace of interest as indicated by the spectrometer. For each one of them we found the signals expected
613 to be recorded by the electronic and by the human sensors. Thus, we created a large number of corresponding triplets of recordings. We then identified the elements of the atone transformation between the two sets of recordings in the least square error sense. This transformation then can be used to predict what the human sensor would have recorded, given what the electronic sensors have recorded. Knowing, however, what the human sensor records is not equivalent to knowing what the human brain sees. There is an extra non-linear process which converts the sensory recordings to perceptions. In Lab coordinates we know that the Euclidean distance between any two points is proportional to the perceived colour difference between the two colours represented by these two points. Thus, after the data have been spectrally corrected and the effects of the spatial and temporal variations of the illumination have been removed as described in [3], they are finally converted into the perceptually uniform colour space Lab, where the colour grading is performed by clustering. 3. E x p e r i m e n t a l r e s u l t s a n d C o n c l u s i o n s The above process was applied to several series of tiles graded by human experts. Figures 1 and 2 illustrate the grading of two sets of uniformly coloured tiles. For the purpose of presentation, tiles classified to the same grade by the human observer are represented by the same symbol. Each tile is represented by its mean values in the Lab system. In panels a we show the tiles without the spectral correction proposed here, while in panels b after the proposed correction. In both panels the orientation of the axes is the same and we can see that after the proposed correction the clusters identified by the humans become more distinct. This conclusion was confirmed by similar experiments with other sets of tiles. We conclude by stressing that when the vision system developed has to replace the human inspectors operating at the threshold of their vision ability, effects like the one discussed in this paper become significant and they have to be taken into accoun-t.
Figure 1: Colour Shade Grading of Linz tiles. Tiles represented by the same symbol were classified to the same colour class by human experts.
614
Figure 2: Colour Shade Grading of Koala tiles. Acknowledgements This work was carried out under the A S S I S T project, B R I T E - E U R A M H 5638. We also want to thank Dr. K. Homewood for his help in taking the spectrophotometric measurements.
References [1] Boukouvalas, C, Kittler, J, Marik, R and Petrou, M (1994). "Automatic grading of ceramic tiles using machine vision". Proceedings of the 1994 IEEE International Symposium on Industrial Electronics, Sandiego, Chile, pp 13-18. [2] C. Boukouvalas, J. Kittler, R. Marik & M. Petrou, "Automatic Colour Grading of Ceramic Tiles Using Machine Vision", to appear in IEEE Transactions on Industrial Electronics, February 1997. [3] C. Boukouvalas, J. Kittler, R. Marik & M. Petrou, "Automatic Grading of Textured Ceramic Tiles", Machine Vision Applications in Industrial Inspection, SPIE 2423, San Jose, 1995. [4] Wyszecki, G. & Stiles, W. S. "Color Science", 2nd Edition, NewYork 1982, Wiley.
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
615
C O L O U R O B J E C T R E C O G N I T I O N USING PHASE C O R R E L A T I O N OF L O G - P O L A R T R A N S F O R M E D F O U R I E R SPECTRA. A L Thornton, S J Sangwine The University of Reading, England.
Abstract The knowledge of the rotation, scaling and translation of an object in comparison with a reference object is important in the recognition process. The work which is described below uses fourier transforms, log-polar coordinate transformations and phase correlation, together with a complex number representation for colour, to determine these variances and recognise a coloured object.
Introduction Much research has been conducted into the recognition of objects in monochrome images using frequency domain processing. This, however, ignores the useful information that can be contained in colour representations. The work which is described in this paper uses a novel colour representation together with a new combination of Fourier and Log-Polar transforms to make possible colour object recognition with invariance to translation, scaling and rotation. The importance of phase in signals has been shown [ 1] and this has led to the idea of using phase to locate coloured objects. An established frequency domain technique for locating objects, Phase Correlation, is described and the advantages of combining the colour representation, the Log-Polar Transform and the phase correlation technique are demonstrated.
Overview The Fourier transform can be thought of as a translation invariant algorithm but it will not overcome problems associated with the scaling and rotation of an object in an image. One method to remove these variations is the use of the Fourier-Mellin Transform which has been well documented in the literature, [2]. This procedure consists of an FFT followed by a log-poIar transform followed by another FFT. The first FFT removes translation variance since the spectrum of an object will be similar no matter where the object is located in the image. The Log-Polar Transform, [3,4], reduces rotation and scaling to translations which are then made invariant by the second FFT. To achieve recognition, the result is then correlated with another image which has undergone the same process. However, a disadvantage with this process is that it does not make the best use of the information available as the result will only determine if there is a similar object in both images. It would be more useful to be able to quantify the scaling, rotation and translation. The process of Phase Correlation, which is described below, has therefore been introduced and this is the main novel feature of the work reported in this paper. This processing is inspired by the Fourier-Mellin transform, but is able to quantify the rotation, scaling and translation and recognise object colour. The block diagram of the system is shown in figure 1 and will be discussed later.
Complex Log-Polar Transformation The log-polar coordinate transformation is a method of sampling an image in such a way that if an object is rotated this causes the log-polar transformed image to move up or down in comparison with a reference image. In a similar manner, if an object is scaled this causes the log-polar transformed image to move right or left in comparison with a reference image. The amount of shift on either axis is indicative of the amount of scaling or rotation undergone by the object of interest. A constraint on the complex log-polar transform procedure is that the object of interest must be near the centre of the image. However, if a Fourier transform is calculated before the coordinate transformation this constraint is overcome, since the data is inherently centred in the spectrum. Thus, by applying a LogPolar transformation to the spectrum, we avoid the need to locate the centre of the object of interest.
616 (It should be remembered that the rotation and scaling of a object in an image causes rotation and inverse scaling of the components of the spectrum due to the object.) I G[II]
FFT I
...._l
"-I
Log-Pohr
Transformation
| Phase Cm'relatiml
Phase Cocrelatim
Rotatian& .--I Rotate ScalingPeak"- I & Scale
Translaficm Peak "-
Log-Polar Transfmmafiml
Figure 1. Translation, Scaling and Rotation Invariance System Diagram Phase Correlation If object recognition is to take place, the location of the object in the image must be found. Phase correlation, [5,6], is a method for determining the translation of an object between one image and another. The result of the computation produces a peak corresponding to the spatial displacement of the object which can be used to locate the object in an image. A reference image is compared with another which we call the object image by multiplying the FFT of one (G1) by the complex conjugate of the FFT of the other (G2*). The normalised cross-power spectrum is obtained, and from this the phase correlation surface (P) is calculated by taking the inverse Fourier transform (F l) of the spectrum. Assuming that an object is contained in both the reference and object image, the result is an intensity peak in the phase correlation surface (P) whose position can be used to determine the displacement between the reference and object image. The calculation is shown in eqn. 1.
-](G1G2* ) P=F
LIG1G2.I
(1)
The same method that is used for the phase correlation of intensity images can be used for the phase correlation of colour images by using the IZ colour representation. This method, which has been more thoroughly discussed in [7], uses a complex number, Z, to represent the colour information, where hue is the argument of Z and a value related to saturation is the modulus of the complex number. The intensity, I, is represented separately. Because a complex number containing colour information is used to represent the image, the result of the phase correlation can discriminate between the different colours of similarly shaped objects. The argument of the displacement peak (which is complex) gives an angle whose value corresponds to the difference in colour between the object in the reference image and that in the object image. The advantage of using the complex colour representation is that the colour of the displaced object is calculated as part of the location procedure with no extra processing. If a monochrome image were used in the location procedure, the object would be found but it would not be possible to estimate its colour. Therefore, extra information is gained for no extra processing than a monochrome image would require.
Translation, Scaling and Rotation Invariance
The amount of scaling and rotation between the two images can be determined by combining phase correlation and the log-polar coordinate transformation. The block diagram of figure 1 illustrates the processes involved; the letters within circles in the diagram indicate images which appear as outputs
617 in figure 2. Some of the processing which is required to implement the block diagram can be performed before the capture of the object image. The reference image can be previously captured and stored, and its FFT, Log-Polar transform and FFT required for phase correlation can be calculated. This will save processing time when object images are to be compared to a reference image. As can be seen in figure 1, after each of the two images has undergone a Fourier transform and logpolar coordinate transformation, phase correlation of these two transformed images is calculated. Information about the position of the phase correlation peak can then be used to alter the object image so that scaling and rotation variance is removed. This is a more precise method than iteratively rotating the spectrum by small angles and altering the scaling until the result is found to match the spectrum of the reference image, [8]. Once the scaling and rotation differences have been removed the translation of the object can be found by phase correlation between the original reference image and the corrected object image and, as discussed above, the colour of the object found. The outputs of these processes are shown in figure 2, where each letter indicates at which point in figure 1 the output was obtained.
Figure 2. Outputs from the processes described in figure 1
618 Figures 2a and 2c show example inputs to the system. Each spatial image is fourier transformed and a log-polar coordinate transform applied, the outputs of which are shown in figures 2b and 2d. These outputs are then phase correlated so that the rotation and scaling difference between one image and the next can be found from the correlation peak (figure 2e). In this case the peak occurs at (14,18) which corresponds to a rotation of 25 ~ and a scale change of about 0.78. Using this information, one of the spatial images is corrected for rotation and scaling (figure 2f) and the result of this is phase correlated with the untouched spatial input. The resultant correlation peak (figure 2g) indicates the translation of the reference object with the object in the second spatial image. In addition, the colour of the second object is found by calculating the argument of the complex correlation peak.
Conclusion The research presented in this paper enables the quantification of rotation, scaling and translation between a reference object and another arbitrary image containing the object. The results show that it is possible to do this without having to perform iterative calculations to determine these values. It is also possible to determine if the object is of the desired colour due to the colour representation which is used. References 1. Oppenheim A V, Lira J S, 1981, 'The importance of phase in signals', IEEE Proceedings, 69(5), 529 - 541. 2. Li Y, 1992, 'Reforming the theory of invariant moments for pattern recognition', Pattern Recognition, 25,723-730. 3. Wilson J C, and Hodgson R M, 1992, 'A pattern recognition system based on models of aspects of the human visual system', lEE 4 th Int. conf. on image processing and its applications, 258-261. 4. Reitbock H J, and Altmann J, 1984, 'A model for size and rotation invariant pattern processing in the visual system', Biological Cybernetics, 51, 113-121. 5. Kuglin C D, and Hines D C, 1975, 'The phase correlation image alignment method', Proc. IEEE conf. on cybernetics and society, 163-165. 6. Pearson J J, Hines Jr. D C, Golosman S, Kuglin C D, 1977, 'Video-rate image correlation processor', Proc. SPIE conf. on application of digital signal processing (IOCC), 119, 197-204. 7. Thornton A L, and Sangwine S J, 1995, 'Colour object location using complex coding in the frequency domain', IEE 5th Int. conf. on image processing and its applications, Heriot-Watt University, Edinburgh, UK, July 4-6 1995, 410, 820-824, Institution of Electrical Engineers, London 1995. 8. Lee D-J, Krile T F, Mitra S, 1988, 'Power cepstrum and spectrum techniques applied to image registration', Applied Optics, 27(6), 1099-1106. Acknowledgment This research is supported by The University of Reading Research Endowment Trust Fund.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
619
SIIAC 9 Interpretation System of Aerial Color Images Salim Mouhoub, Michel Lamure and Nicolas Nicoloyannis URA 934- CNRS, Universit~ Claude Bernard Lyonl 43, bd 11 Novembre 1918. 69622 Villeurbanne France 1. Introduction
In this paper, a general methodology is presented to resolve the problem of the interpretation of aerial color images. This problem must be divided in several levels of abstractions corresponding to different classes of methods or low and high level algorithms. In our work, we are particularly interested to the high level part. 2. General presentation
According to the diversity of the information (color, texture, geometry) used in order to identify the different objects contained in an image and the importance of some types of information as the color knowledge which demands a particular process, we preferred to adopt a strategy based on the blackboards structure [ 1]. We associated, thus, to every type of information a knowledge source (KS) or "specialist". These specialists cooperate around a common facts base called the blackboard which contains all the data concerning the image. In SIIAC, the control is realized by the blackboard's monitor. SIIAC is constituted of three main parts: the knowledge sources, the blackboard and the control. It's general architecture is presented in (Fig. 1).
[KS,'R~,a~ation"~
......i...............
~
lackboard
< .....i..............
KS"Color"
I
1
i_ KS"Texture" [
: .....................................................................
,
Contol
....................9~
i
i
Control Data
Control flow Data flow
Fig. 1 SIIAC architecture diagram. 2.1 The control
In SIIAC, the identification of a area is realized by the cooperation of the different KSs. These KSs are distributed in two different classes: - The first class contains the low level KSs (KS "Color," KS "Texture" and KS "Geometric"). These KSs assign some labels to the different areas of the image according to their low level features. These KSs take into account the geometry, the color and the texture of every area in consideration. - the second class is represented by the KS "Relaxation". This KS uses the spatial arrangements of the different areas of the image in order to reduce the number of labels assigned to every area. The KS "Relaxation" is based on a system of constraint propagation (discret relaxation) which allows to construct a consistent labeling between the areas. 2.2 The knowledge sources
The knowledge sources contain two parts; a condition part and an action part. The condition part specifies the conditions of application of this KS and the action part contains the knowledge of the field of abstraction level which this source is destined. The knowledge sources read and write some information in the blackboard, they don't communicate directly
620
between them but through the common facts base. The SIIAC system is constituted by four knowledge sources 9the KS "Color," the KS "Texture," the KS "Geometry" and the KS "Relaxation". In the following, we are going to give into details the KS "Color" and "Relaxation, which are the most important KSs in SIIAC. 2.2.1 The KS "Color" To define the color, we took four radiometric parameters : the minimum, the maximum, the average and the variance of gray levels in the three bands R, V, B (min_ng, max_ng, ave_ng and var_ng). In order to construct the color recognition rules, we've used 600 representative area samples distributed into six groups. These groups have the following denominations: "clear roof, .... dark roof, .... brown roof, .... tarmac," vegetation," "shadow." In every group, we dispose of 100 areas. We notice that, every group is identifiable by a color and vice versa, every color corresponds to a group. We have therefore, in all, six different colors. For every area we calculated the parameters previously cited in the three bands. We extracted then, for every color, the confidence intervals corresponding to the three based components R, G, B. For example, the 12 confidence intervals corresponding to the red clear color are : .
.
.
.
~
R e d
A v e ng
' Min_ng Max_ng Var__ng
[
[ 511" 66.4 ]
1
[61." 86.] [ 0.00543"0.123 ]
~
II
11
'!
[4i. "54. ]
'R
G re e n
[ 60.'69. [44. "55
[I
]
]
II
Ii
[76. " 92. ] [ 0.079 0.1616 ]
Blue
[ 52.5"63.] [41." 55.] [63. "78. ] [ 0.05825"0'.124.] .
.
.
.
Where Ave_ng, Min_ng, Max_ng and Var_ng are respectively the average, the minimum, the maximum and the variance of the gray levels. We note that to every parameter corresponds three intervals. These constitute a parallelepiped in the R3 space representation. In order to construct our recognition rules, we adopted a strategy based on production rules. Bearing in mind that for every parameter (average, minimum, maximum and variance of the gray levels), a color is represented by a point in the R.G.B representation space. We have therefore associated to every parameter and for every color three rules (correspondenting to the three basis components R, G and B). The following is an example of rules (only integrating the average of gray levels) which permits the green color recognition : I._f R_ave_ng of the area >= 51.
and
R_ave_ng o f the area <= 66. 4
Then
the area is R Green
I...f G_ave_ng o f the area >= 60.
and
G_ave_ng o f the area <= 69.
Then
the area is G Green
I_.f B_ave~lg o f the area >= 52.5 and
B_ave_ng o f the area <= 63.
Then
the area is B Green
m
R_Green, G_Green, B_Green are logical variables. After extraction of the confidence intervals corresponding to all the colors in our possession, we've constructed a set of 72 rules (24 rules by band). In fact, any area of the image has the label of a given color, if all the conditions of the 12 rules associated to this color are, at the same time, satisfied (i.e if the 12 logical variable are all true at a time). We notice that this constraint is difficult to achieve. It is for this reason that we have defined a threshold that called satisfaction threshold. This threshold determines, for every color, from how many condition parts must been satisfied so that the selected area will have the label of the corresponding color. We will see later how to fix this satisfaction threshold. Satisfaction threshold :
In order to fix experimentally the value of satisfaction threshold, we have applied our recognition color rules on a learning sample of 600 areas (100 areas of every type of object). We did several tests on this sample varying every time the threshold. The satisfaction threshold chosen is the biggest threshold which verifies the following condition 9 A G . R . R ( t ) < 1%. Where 9
A G.R.R(t)=
G.R.R(t) - G . R . R ( t - 1) G.R.R(t-1)
621 A.G.R.R(t) is the relative difference of the good recognition rates. G.R.R(t) is the Good Recognition Rate corresponding to the threshold t. At last, we got the results summarized in the following table. Thresh~
I 'i2
II
il
II 1~ ....9
G.Reconigition 11 Rate(~176 II Non Reco'gnit. 11
......
, Rate(~176 _11
Errors ,,Rate(% ) Total
817 611~ 14 31~lr' 86.33I186.66 87.33 87.5 87.66 87.66 6
533
4.15 !] 3.66 ![ 3.33 [
4.66
,..
3
]
3
II5,67 II6.83 II 8 833 8,67 8.83 ]l 9 [] 9.16 ] 9,33 i 9,33 10, I[ 100 I! 100 i l0011100H100II :00 1,00 !1loo F1,00l loo l,oo
1,67
3.67
I[ ~.~:g.g,t, I" 17.111~.7 II ~.~ I ~.~ 10.4.1.0.4 0.4 [[ 0.4 Ii 0.~.110.~ 0.0 ','!
'"
We notice that 9 - One area is recognized, if the system attributes it one label and the good one (good recognition). - One area is not recognized, if the system attributes it one or several labels including the good one (non recognition). - We have a labeling error, if the system assigns to the area one or several labels which are not the good ones. We remark that the good recognition rate increases when the satisfaction threshold decreases. In our case the value of satisfaction threshold is fixed to eight because it is the biggest threshold which verifies the condition above-mentioned. One area has the label of a given color if at least eight condition parts among twelve are satisfied at the same time. Here, below, the results in details, corresponding to a satisfaction threshold equal to eight 9
II clearre~H0arkre~ Recognition~telI 86% 4~ I! 92% 46 Non Recognition II 4 rate , ,, 8%
Errol~ rate
Total
3
6% 100
[
2 4% 2
4% 100
Gray ' 38% 76
rr'~
'[[
.
Dark .... 11 'll chestnut It
80%40 ,
,,
14%
14%
10% 100
6% 100
5
II II
3
47 94%
[I
258
84 84 % " ,,
4%
2
loo
I 'Total
.
2%
4%
Black
14 ,,.
8% lOO
' rt
86,%
23 7.67% 19 6.33%
,,
100
Fig. 12: Detailed results of the test. In this article, we don't present the KSs "Texture" and "Geometry" because the principle of construction of these two KSs is identical to the one of the KS "Color", but we will present on the other hand the KS "Relaxation," which allows to reduce the multiple labeling. 2.2.2 The KS
Relaxation
Every area could possess one or several labels (the unrecognized areas). The problem is therefore to reduce the multiple labeling in omitting the contradictory labels between adjacent areas. We are, therefore, confronted to a satisfaction constraint problem. In our system, we've preferred to use the discret relaxation as a method of constraint propagation because it's not expensive and is part of parallel techniques. We present, below the principle of this method. We have a constraint graph in which every node (variable) is associated to a set of labels (possible values of the variable) and which the edges are the constraints (binary constraints), the principle of the discret relaxation consists of finding a consitent labeling which is not ambiguous, that we called labeling solution wich every node has one label. We define our problem as follows :
622
- We have an adjacency graph of areas constituted of n nodes representing the different areas of the image, R = { R 1..... Rn}. - The nodes which are represented by the areas having a common boundary are bounded by an edge which carries the boundary's length and the link's type, - To every node is associed a set L of labels representing the given interpretation to every area L = {L 1..... L 6 }. - A set of local constraints C where the constraint Ci is definite by C i = (Lm,Ln) with m r n. Example of adjacency constraint "
A shadow is not adjacent to an 3D object in the sense of sun radiuses, (in this way our system must exploit some global data on the image : the position of the sun, the position of the plane, etc.). The shadow plays a very important role in the detection of the 3D objects (buildings, houses, hangars, etc.). Example of inchision constraint :
A tree could not be included in a house roof. After the construction of constraints graph, the second stage of the problem consists to find the biggest consistent labeling. For it, we used the algorithm AC4 ofR. Mohr and Henderson [2] to eliminate all the neighboring labels by local constraints propagation. 3 Conclusions
In this paper, we presented our methodology for the interpretation of aerial color images. In our work, we are particularly interested in the color information because it brings a very important extra knowledge. The KS "Color" facilitated the recognition of some areas (vegetation, roofs in tiles, tarmac) which, in the grey-scale images requires the biggest consideration of geometric features and contextual information, which complicates a lot the interpretation. To reduce the problem of the multiple labeling, we've used the discret relaxation to the detriment of a production rulebased system because the methods based on production rules are often sequential and of exponential complexity whereas the discret relaxation is part of parallel techniques and therefore greatly more efficient. 4 References [ 1] R. Engelmore and T. Morgan, "Blackboard systems", Addison-Wesley, 1988. [2] R. Mohr and T.C. Henderson, "Arc and path consistencyrevisited". Artificial Intelligence, 28 : 225-233, 1986.
Session V:
INDUSTRIAL APPLICATIONS
This Page Intentionally Left Blank
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
625
Nodular Quantification in Metallurgy using Image Processing Virginia Ballarin* - Emilce Moler*- Franco Pessana* Sebastifin Torres* - Manuel Gonzalez* *LPMS, Electronics Dept., Engineering School, University of Mar del Plata J. B. Justo 4302 - (7600) Mar del Plata - Fax: +54-23-810046
e-mail: [email protected] 1. Introduction In the process of classification and analysis of different alloys in metallurgy, the extraction of metric properties of the involved shapes is a very important step [1]. The analysis and the later classification depend on the accuracy of the obtained measures. The present work describes some feature extraction techniques in Digital Image Processing which give the final user an useful tool for alloy classification in metallurgy. Metric properties of certain particles are obtained and analysed to get relevant statistical distributions in order to correlate them with some mechanical properties of the system [2]. Measurements obtained with higher accuracy allow for better results in elaborating optimum analysis of the material. Therefore, precision is a very important issue in obtaining the metric properties such as area, perimeter, inertial moments, shape factor and svelte factor. The images to process are photographs of different alloys in metallurgy obtained by a microscope. Particles being quantified are called spheroids. The first step is the pre-processing of the images in order to isolate the particles of interest. To accomplish this task we used a morphological filter for noise suppression [3]. We describe theoretical concepts, computational difficulties, and the errors in measuring metric properties which are caused by the discrete nature of the digital image. We analyse in detail the main metric properties and show they are a better choice than a set of commonly-used analytic properties [4].
2. Extraction of Features Once the images have been already filtered using the morphological filter discussed in the paper by Ballarin et. al. (1995), we are able to segment them in two regions: the background and the objects of interest. Thus, the resulting images have only two gray levels, i.e., we work with binary images. Once we have segmented the image we proceed to choose one specific particle and then extract the most significant features of the spheroids. That is to say, we are only interested in those features that allow us to achieve a later material classification. The main metric properties of the spheroids we want to measure are area, perimeter, inertial moments, shape factor, and svelte factor.
2.1. Metric Properties Metric properties are based on the distance ~ (V1, V2 ) between two points Vl and I/2 in the image plane. This distance must be a real function of the co-ordinates (xi, Yi). For this work we choose the Euclidean distance because it is the one that achieved the minimum error when going from the continuous to the discrete mathematics. Analytic properties, on the other hand, merely deal with the spatial content of the image without concerning about any of the involved shapes. These properties consider the image as an n-dimensional vector whose elements are the invariant moments. Because of the analysis we want to perform of the image in this case, the choice of the metric properties is straightforward.
2.2. Area and Perimeter As a first step of the characterisation of the spheroids in the image, we consider their areas and perimeters. These features allow a later classification of the particles in different alloys. As we have already mentioned, we first applied a morphological filter in order to remove the spurious particles. This filter was shown to preserve the shapes of the spheroids allowing for a precise measurement of the area and perimeter of these particles. The area and perimeter can be calculated by the following continuous expressions: A=
[[..dxdy
P = j,0/t-~- s
+
dt
(1)
R(x,y)
to evaluate the metrics of the particles the former expressions must be formulated in discrete domain:
A = ~ m ~ . Pixels, (m,n) ~ Shape
(2)
P = ~'m~', Pixels, (m.,n) ~ Contourn
(3)
626 That is to say, both the area and the perimeter of the shape are a number of pixels. This is the first difference between the continuous and the discrete mathematics, area and perimeter have the same units.
2.3. Svelte Factor We relate the area and the perimeter of each object in the image by defining a coefficient called svelte factor orfi: Area
fs=4n
2 (4) (Perimeter) This coefficient gives an idea of the kind of shape we are working with, how thick or thin the particle is. fs is maximum when particles are circles (Is = 1) and it is minimum when we are working with segments (fs = 0). This factor has the useful property of being invariant under linear transformations, such as rotation, translation and scaling. The svelte factor is highly correlated to the nature of the spheroid under consideration. Therefore, its invariance is useful when dealing with images obtained from different alloy taps.
2.4. Form Factor The form factor is another important metric property relevant to a later shape classification in metallurgy images. Other related expressions must be defined before we can introduce the concept of this feature. 2.4.1. Mass Centre The mass centre of an arbitrary shape is defined as the point where all the mass of the form can be considered to be concentrated and where the resultant of all the forces applied to this particle is exerted. In the continuous domain it is defined as: ff
Xm = - - JJ x ~ d y , Yn : - - JJ ydxdy A R(x,y) A R(x,y) The equivalent discrete version of Eq. 5 is defined as:
1 Xm = ~ E m E n m A
,
where
1 I~n = --= E m E n n , A
A=
fl" JJ dxdy
(5)
R(x,y)
where
A = E m E n Pixels
(6)
2.4.2. Inertial Moment The inertial moment can be defined as the value in the equations of angular movement equivalent to that of mass in the equations of linear movement for a system of n particles. The inertial moment is referred to a reference axis and the corresponding continuos domain expression is: = En= l m/~2
(7)
where B is the inertial moment; mi is the mass of the i-th element; ri is the vector radius of the i-th particle referred to the considered axis. If there are infinite particles, the sums become integrals. Furthermore, working with constant densities equal to one, the measure of the mass is equivalent to the measure of the area. The inertial moment can be generalised to any order. In the discrete domain we can define the inertial moment of pq order as:
~tp,q = ~m~n(m-~Im)P(n-~'n)q,
(m,n)~ Form
(8)
2.4.3. Surface Orientation The surface orientation can be considered as the principal orientation of a region or surface, and it is the angle 0 that minimizes its inertial moment. Consequently, to calculate this moment, the derivative of this quantity with respect to 0 must be equated to zero in order to obtain the value of the angle. As it is seen in Fig. 1 the inertial moment of the shape with respect to 0 is: "0
=
Y~m~n D2(O)' where D2(O) = [(n- Yn)cosO-(m - Xm)sinO]
(9)
Derivating Eq. 9 with respect to 0 and equaliting this to zero, the following expression can be obtained: 1 2~tl,1 0 = -- arctan ~ 2 ~1'2,0-1"1"0,2
where Bp,q is obtained from Eq. 9.
(1 O)
627
2.4.4. Circumscribed Rectangle of a Shape The circumscribed rectangle of an irregular form is the smallest rectangle that encloses the form and that is oriented in the principal direction of the object 0. In Fig. 2 it can be seen the circumscribed rectangle of a shape. There must be a change of co-ordinates in order to calculate the length of the sides of the rectangle. 0 is calculated by using Eq. 10 and applying the following transformation on the points of the object:
{
,~ = x cos 0 + ysinO
( 1 1) = -xsinO + y cos 0 Finally, we are able to define the form factor as the quotient between the sides of the circumscribed rectangle of the shape, i.e., Lrc ~(12) Arc
Fig. 1 Distance to the principal axis for the computation of the inertial moment.
Fig. 2 Circumscribed rectangle of a shape.
3. E x a m p l e of Application Fig. 3 shows a pre-processed image of a metallurgy alloy in which the spheroid particles can be seen. In Table 1, the values of the form factor and the svelte factor of some particles have been calculated. Ideally, both the svelte factor and the form factor must be equal to one for perfect spheroids particles. However, some differences can be seen due to the discretization of the image. They are analysed in the following section.
Svelte factor fel=O.710210 fe2=0.697099 fe3=0.687223 fe4=0.740874 ,fe5=0.753692
Form factor ff1=1.083333 ,0'2=1.083333 if3= 1.000000 ff4=1.111111 ff5=1.076923
Table 1 Calculatedsvelte and form factors for 5 spheroids.
Fig. 3 Metallurgyimage.
4. E r r o r A n a l y s i s The errors generated in measuring the metric properties are originated by the discrete nature of the images. We also have to take into account that the particles are not geometrically perfect. However, both considerations do not have an effect in material classification because instead of considering the exact measures, a comparative analysis of the samples is rather performed.
4.1. Errors in the Svelte Factor When going from the continuous to the discrete mathematics, it is unavoidable to incur in errors. In a digital image, which is acquired either with a scanner or with CCD's, the measures of the metrics of an irregular shape must be calculated with a finite resolution. The resolution then reduces the accuracy in the measurement of these features. In Fig. 4a a circle with a radius R=3.5u is shown. Figs. 4b and 4c depict the same circle after being digitalized. From these figures the continuos and the discrete svelte factors are obtained. A fe c : 4 ~ - ~ =
xR 2 41t (21'R)2 = 1
A 37 fel d = 4 1 ~ - ~ = 4~ (16)2 = 1.82
(13)
From Eq. 13 we calculated the error, obtaining the value of e = 82%. If we duplicate the resolution of the digitalized circle as it seen in Fig. 4c, the svelte factor is A 148 fe2 d = 4~:'~-- = 4 ~ : - ~ = 0.96 (15)
628
Now calculating the error by using Eq.14, we get e = 3.93%. Comparing Eqs. 14 and 15, we can see that duplicating the resolution in the adquisition stage, the error considerably decreases. In the limit, when the resolution goes to infinite (or very high), the error approximates to zero. Fig. 4 Circleswith differentresolutions
4.2. Errors in the Form Factor Before we can calculate the form factor, we need to know the principal orientation of the shape. In order to evaluate this metric it is necessary to measure the centre of mass and the inertial moment. In this case we are also working with digital images so we incur in errors on account of discretization. In Fig. 5 the principal orientation of the shape using Eq. 10 is 0d = 42.69 ~ For this shape, we observe that the real principal orientation is error is 5.13%. After the tracing of the boundary and the change Eq.ll, we obtain the maximum and minimum values of the l(4min =-3.92u" /Vmax= 3.10u" l~lmin=-2.74u. With these values the
l~lmax- l~'lmin
~= ,,
Nmax- lqmin
Fig. 5 Principalorientation
45 degrees. Then the of co-ordinates as in shape:Mmax=6.09u" form factor becomes,
~ ~= 1.71
(17)
The error of the angle is always present because of the discretization. As in the former section, the error could be reduced by means of the increasing the resolution of the images at the acquisition stage.
5. Conclusions Digital image techniques allow for feature quantification of irregular shapes. The measures performed manually, i.e., without the use of DIP techniques, show great amounts of errors. The application of these techniques, while it does not eliminate the errors, considerably reduces them. Working with a resolution of 600dpi the errors can be despised. Obtaining metric properties accurately, allow for better results in elaborating optimum material analysis. We specially mentioned the form factor and the svelte factor, because they are invariant to linear transformations like rotation, translation and scaling. These special properties make the two factors useful tools in material classification. We also discussed the unsuitability of the analytic properties due to the nature of the analysis to be performed. We developed the algorithms in C++ and obtained from them the quantitative information of the image for later material analysis. The metric properties that could be obtained with these techniques are a lot, but most of them are based in these two mentioned factors. This work is only a starting point towards the use of more sophisticated and specific properties for each material under analysis.
6. References [ 1] F.B. Pikering, "Basis of Quantitative Metallography" London: Institute of Materials Publication, 1994. [2] F. Pessana, V. Ballarin, E. Moler, S. Torres, M. GonzAlez, "Nodular Quantification by Metric Properties using Digital Image Processing", Proc. 3rd Internat. Computaci6n Aplicada a la Industria de Procesos, Villa Mafia, Argentina, 12-15 November 1996. [3] V. Ballarin, E. Moler, M. Gonz~ilez, S. Torres, "Noise suppression by Morphological Filters", Proceedings of the 2nd International Workshop on Image and Signal Processing: Theory, Methodology, Systems and Applications, Budapest, Hungary, 8-10 November 1995, Vol. 1, pp. 128-132. [4] M.K. Hu, "Visual Pattern Recognition by moment invariants." IRE Trans. Info. Theory, Vol. IT-8, pp. 179-187, August, 1962. [5] A. K. Jain, "Fundamentals of Digital Image Processing", Englewood Cliffs, NJ: Prentice-Hall, 1989, pp 392-394.
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
629
Image Processing in the Measurement of Trash Contents and Grades in Cotton Boshra D. Farah School of Electrical Engineering, The University of New South Wales, Sydney 2052, Australia. (Phone: +61 (2)9385 5375; Fax: +61 (2)9385 5993, E-mail: [email protected])
ABSTRACT This paper presents the application of digital image processing and analysis in the measurement of trash contents and grades in cotton fibre assemblies. A measuring system was constructed using digital image processing and analysis techniques. Three modes of measurement were investigated namely, transmitted light, reflected light and compound light (a combination of the two). The paper discusses the different modes and the optimum conditions for the measurements. MATLAB and its image processing toolbox were used in thresholding, segmentation, histograms, determination of the trash content by area. Mathematical expressions were derived to translate the trash content by area into the trash content by mass. The correlations and regressions between the trash content by mass and the different measurement modes were determined and discussed.
1. INTRODUCTION The existing standardised methods and means for the measurement of trash content in cotton are mainly based on mechanical separation of the trash from cotton fibres using the Shirley Analyzer [1] and weighing the trash relative to the total mass of the cotton sample. Currently, the determination of cotton grades still depends on the subjective assessment of the cotton classers. These methods are time consuming, labour-intensive and therefore expensive. Furthermore, they are not suited to on-line or field measurements. Attempts were made by Lieberman and Patil [2] using computer vision and pattern recognition techniques to discriminate among three trash categories; bark, stick, and leaf/pepper. However, Lieberman and Patil's approach was not a feasible way of assessing the trash content and accordingly it was not successful in assessing the grades of cotton.
2. MEASURING SYSTEM This paper presents the application of digital image processing and analysis in the measurement of trash contents and grades in cotton fibre assemblies. The measurements may be carried out in the laboratory on selected samples, on-line for automatic quality control of fibre cleaning processes (eg. ginning, blowroom processes, carding or combing), or in the field (eg. for testing cotton crop during its harvesting). Initially a measuring system based on the patent [3] (the present author is the main inventor of this patent) was constructed. Fig. 1 shows a schematic diagram of the measuring system. Cotton specimens of known mass (2g) were prepared and evenly distributed onto a pre-assigned area (175mmx125mm). Similarly, control samples for the background fibres and for the grading standards were prepared and tested. Three modes of measurement were investigated - namely, measurements with transmitted light, reflected light and compound light, ie. a combination of the two. An image was taken from the sample using a video camera interfaced to a computer through a video capture card. The image was digitised into pixels (256x256 or 512x512) and for each pixel the brightness level, from 0(00H) for black to 255(FFH) for white, was stored in a binary file. The hypothesis is that, for the control sample, which consists of clean cotton fibres (which are neither black nor white), the histogram of the brightness level will have a single peak, as shown in Fig. 2. On the other hand, the histograms of the actual samples, which include the fibres along with the trash, will
630 have more than one peak. The threshold is defined as the brightness level which discriminates between the trash and the background fibres. It is determined by the brightness level at which the histogram of the background fibres (control sample) and that of the trash are separated. If the two histograms are completely separated (as is the case in Fig. 2), the threshold can be easily determined and will be any brightness level lying between the two histograms. In the case of small or medium overlapping histograms (Fig. 3 and 4), it is still possible to determine the threshold and this will be the brightness level at the minimum number of pixels between the two histograms. If the two histograms are highly overlapping (which is the likely case for transmitted light mode measurement as can be seen later), the threshold may not be possible to determine in the same way. In this case, the threshold is taken as the brightness level which results in the highest correlation coefficient between the measured and the actual values of the trash contents. The trash content by area (Ta) can be estimated by the ratio between the area of the histogram representing the trash (ie. the number of pixels occupied by the trash), to the total area of the scanned sample (ie. 256x256 = 64k pixels or 512x512 = 256k pixels). The trash content by mass (Tm) can be determined as follows: Tm =
KT
p a
[1 + Ta (Kp - 1)]
(1)
where Kp is a calibration factor which can be experimentally determined by carrying out a calibration test and using the following equation: _ Z m (1 - L ) KO-Ta(1-Tm )
(2)
If Kp __--1, (ie. T~(Kp - 1) = 0 in equation 1), then:
Tm -~Kp To~
(3)
Through the correlations and the regression lines between Tm and T,,, the values of Kp can be determined for the three modes of measurement.
Fig. 1 Measurement System.
Fig. 3 Small Overlapping Histograms
Fig. 2 Histograms of the Brightness Level
Fig. 4 Medium Overlapping Histograms
631
3. EXPERIMENTAL RESULTS AND DISCUSSIONS Extensive experimental work was carried out and the following is only a part of the results obtained recently. Within the course of Nissyrios work [4] (B.E. Thesis under the supervision of the present author), tests were carried out for 8 samples including a control sample, 5 standard cotton grades: Good Middling (GM), Strict Middling (SM), Middling (M), Strict Low Middling Plus (SLMP), Strict Low Middling (SLM) and 2 actual samples: B 1 and B2. Images of each of the 8 samples were taken under the three measuring modes: ie. transmitted, reflected and compound fight modes. Using MATLAB and its image processing toolbox, the histogram of the brightness level for each image was produced, processed and the trash content by area, Ta% was determined. Table 1 Different Mode Measurements Grade
Ta% Comp Correl. 0.8291
Coeff. r
Ta% Refl 0.6586
Ta% Trans J0.7559
Thresh- Tm% old Mass (1.000)
GM
0.4889
1.5459
0.3939
106.2
0.7921
SM
0.4427
0.6077
0.4547
105.5
0.468
M
0.7795
1.1024
0.6115
106.2
0.5934
SLMP
1.6285
1.1211
1.3222
98.7
1.1203
SLM
1.5137
1.4038
1.1183
106.4
1.7085
B1
1.9489
il.5358
1.993
105.5
2.179
B2
2.8774
1.3376
3.4164
105.5
1.799
Fig. 5 Transmitted Mode Histogram for B2
Fig. 6 Different Modes Compared with Tm%Mass
Fig. 8 Trash % from Ta% Reflected
Fig. 7 Trash% from Ta% Compound
Fig. 9 Trash % from Ta% Transmitted
632 The optimum threshold at the brightness level of 158 for both compound and reflected fight mode measurements, was determined to be the brightness level at which 99.9% of the control sample was brighter than the threshold. For the transmitted fight mode measurements (see Fig. 5), due to the irregularity of the fibrous layer, the background had dark areas, irrespective of the trash area. The resultant histogram had multiple peaks and was shifted towards the low brightness side. Therefore, the threshold for the transmitted fight mode measurements was determined by the procedure previously explained (see Fig. 2, 3 and 4). Table 1 and Fig. 6 summarise the results obtained for the 5 cotton grades (GM, SM, M, SLMP and SLM) and the two actual samples (B1 and B2). The measurements included the trash content by area (Ta%) for the three modes compared with the manual measurements of the trash content by mass (Tm%). The first row in Table 1 shows the correlation coefficient (r) between the Tin% by mass and Ta% by area. Keeping the same condition for the measuring environments, the compound mode measurement had correlated best with the mass measurement (r = 0.83). Although the reflected mode measurement had a lower correlation coefficient (r = 0.66) compared with the transmitted mode (r = 0.76), the former may be more desirable than the latter in practice (eg. for on-line and field measurements). This is because of the simplicity of the reflected mode measurement - needing fewer requirements in preparing the samples. Fig. 7, 8 and 9 show the regression lines and the coefficient of determination (R2) between the Ta% by area at each mode and the Tm% by mass. R 2 --- 0.6, 0.4 and 0.2 for compound, reflected and transmitted modes respectively. I~ was determined from the linear regression equation in each case (y = K9 x). Accordingly K9 = 0.82, 1.02 and 0.76 for compound, reflected and transmitted modes respectively. The regression lines can be utilised in translating the trash content to the corresponding cotton grade. The cotton grade can thus be determined objectively, easily, more accurately and cost effectively. 4. CONCLUSIONS The measurements of the trash contents and the corresponding grades of cotton samples were successfully carried out applying the thresholding technique to their brightness level histograms. The compound fight mode measurement correlated best to the trash content by mass (R 2 = 0.6). The reflected fight mode had a lower coefficient of determination (R 2 = 0.4) but may be more desirable than the other modes for its simplicity and its fewer requirements in terms of sample preparation. The transmitted fight mode measurement was possible upon selecting an appropriate threshold depending on the brightness level histogram. However, measurement via the transmitted fight mode had the lowest coefficient of determination (R2 = 0.2).
5. ACKNOWLEDGMENT Thanks to the Special Research Grant Committee, Faculty of Engineering, UNSW and also to the School of Electrical Engineering, UNSW, for their financial support.
6. R E F E R E N C E S [1] [2]
[3]
[4]
D. S. Hamby, American Cotton Handbook. New York: Interscience Publishers, 1965. M. A. Lieberman and R. B. Patil, "Non-lint Material Identification Using Computer Vision andPattern Recognition," SPIE vol. 1836 Optics in Agriculture and Forestry, pp. 142-152, 1992. B. D. Farah, J. L. Woo, and D. H. Mee, "Measurement of Foreign Matter in Fibre Assemblies," Australian Patent Application No. 66403/86 - Australia 10/12/1986 (PH0387113/12/1985), European Patent Application No. 86309563.4 - 9/12/86 and U.S. Patent Application No. 06/940,590. Nissyrios, "Microprocessor Based Measurements of Trash Contents and Grades in Cotton", B.E. (Electrical) Thesis, University of New South Wales, Sydney, Australia, 1995.
Proceedings IWISP '96," 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
633
Automated Visual Inspection Based On Fermat Number Transform J. H a r r i n g t o n , A. B o u r i d a n e D e p a r t m e n t of C o m p u t e r S c i e n c e T h e Q u e e n ' s U n i v e r s i t y of Belfast Belfast B T 7 I N N N o r t h e r n Ireland
K e y w o r d s : I m a g e P r o c e s s i n g , Pattern R e c o g n i t i o n , N u m b e r theoretic transforms.
ABSTRACT This paper describes the application of Fermat Number Transform to the problem of automated visual inspection of 2-D Images. The transform is defined in the field of integer numbers and as such all the computations involved are exact thus eliminating errors due to rounding and/or truncation. The main features of the transform is that it is has a periodic structure and can be computed using just bit shift and add operations. This makes it well suited for use in pattern recognition problems. This technique has been implemented and tested successfully on a variation of image flaw types with promising results.
INTRODUCTION Automated visual inspection is a well established subject [1,2]. Early work has concentrated on locating and identifying objects by describing their shape and orientation. Yet it is still one of the fastest growing scientific areas with applications ranging from printed circuit patterns, automobile parts, food products, microcircuit photomasks, printing industry products, ... etc. The tasks of an automated visual inspection system are basically to remove the need for a trained operator to perform the recognition/inspection, or to enable the recognition/inspection to be performed that would otherwise be impossible. Typically, this task involves repeatedly checking the same type of object to detect anomalies and hence to treat them. This normally consists of several distinct, computationally intensive processes. One of the first processes usually employed is the pre-processing of the image so as to enhance the relevant features to aid the detection of defects. This operation may include filtering, segmentation and feature extraction. Then analysis of the image is carried out using the models previously established to provide strategies and standards for the inspection process. Many techniques are currently in use for automatic visual inspection applications. Current inspection applications are characterised by stringent requirements in terms of speed of operation, reliability and conformity, thus the need for highly efficient algorithms for finding relevant regions of images arises [3]. The apparently straightforward issue of speed of operation coupled with the current stage of concurrent system design indicates that algorithms which lend themselves to parallel implementation are promising candidates for future inspection and image processing system design. This paper is concerned with the development of a novel technique for the inspection of 2-D images. The technique is based on the Fermat Number Transform (FNT). The FNT has a very regular structure thus highly suited for use in many pattern recognition applications. This transform has properties similar to those of the Discrete Fourier Transform (DFT) but is defined in a finite field of integers thus providing exact computations. Moreover, its computation is very simple and consists of only additions and bit shifts and thus does not require costly multiplications. Finally, the technique is very sensitive to detecting small irregularities.
FERMAT NUMBER TRANSFORM: AN OVERVIEW The two-dimensional FNT of a sequence x(m,n) of size NxN for a Fermat number Ft is given by [4]:
634 N-1 N - I
whereF t - 2 2 '
X(k,l)= EEx(m,n)o~mka nlm o d F t
and
+1
m=0 n=0 N-1 N-1
x(m,n) = (N-1)E E X(m,n~-mkot-hi m o d F , k=0 /=0 It can be shown that if Ft is chosen such that OrN = 1 (mod F t ) and N 1 exists over the range (0..Ft-1), then the above pair of equations have the DFT-like structure. Furthermore, if o~ is selected to be a power of 2, then the following holds: N = fl
x
2 t+l where O~ = ~
([3 integer) and allows for a very simple implementation using left and right
shifts instead of the costly multiplications. For values of 13 of powers of 2, N is also of a power of two and the Fast Fourier Transform algorithm is applicable. All computations are carried out over rings of integers, and as such the
problem of rounding and/or truncation inherent in DFT computations is eliminated thus providing exact results. The high sensitivity of FNTs arises because changes in x(m,n) are multiplied by a number 2ink.2nl (if 0~ = 2) that will rise to be almost equal to the modulus Ft. We have developed and proven that if an image x(m,n) is periodic with period TrxTc (along rows and columns, respectively), then the transformed image contains only small number of non-zero pixel elements. In addition, these pixels are very regularly distributed along rows and columns and can be efficiently computed using a simple equation that takes Tr and Tc as inputs and given by the following equation: Tr-1 Tc-1
X(k,l):(N///Tr)(N///Tc)m~=o~=oX(m,n)otmko~"l m o d E t with k=M.~/~r,
/ = 0 , 1 , 2 ..........
Tr-1 and I-N.~Tc,
j = 0 , 1 , 2 ..........
Tc-1
This pattern can be clearly seen as a regular and periodic distribution of simple patterns on a display device. If, however, this periodicity is broken (by say, varying some data along rows and/or columns of the input image), then the above mentioned structured pattern is also broken and can clearly be seen as a random distribution of patterns. Because of the characteristics of the FNTs, it is possible to combine judiciously both images and periodicity to obtain an efficient algorithm for flaw detection. This is because, if a defect-free image is compared to itself then a perfect periodicity is always shown. Table x-1 shows a 16x16 periodic array with TrxTc = 4x8 and Table x-2 illustrates its transform using F3 and c~=2. The results clearly highlight the above periodicity. If however, a single array value (i.e., a single defect) is altered in the array (see Table x-3), the above mentioned periodic structure is destroyed thus automatically highlighting the defect(s) (see Table x-4). 13
,
~,s
13 48 93 128 13 48 93 128 13 48 93
4 109 195 222 4 1097 195 222 4 109 195
245 7 237 13 245
0 0
121
Table x-1 111
o
36 0
66 166 27 9 4 166 27
23.79 13 4 245 166 7 27 237 9
0 0
o ~ o o ~
,7
198
219
67 203 211 75 67 203 211 75 67 203 211
198 14 214 126 198 14 214 126 198 14 214
219 185 145 2378 2190 185 145 237 219 185 145
189 0
~
~o
o~
~ o
~
~
o
o
o
o
0 0 0
0
Table x-2
~ ~ .~o 0
0
~
0
~ o
0
0
148
o
o
o
~
.o ~
~
o
o
o
~
o
~ ~ ~o
~ ~
o
~
o
238
o
26
~
0 0 0
0
0
~ o
120 0
~o
4 109 195 222 4 1097 195 222 4 109 195
o
0
o
76 146 8 0 76 146
13 48 93 128 13 48 93 128 13 48 93
~
0
~,
0 76 146
~,s 245 7 237 13 245 2379 13 245 7 237
66
67
198
~19
67 203 211 75 67 203 211 75 67 203 211
198 14 214 126 198 14 214 126 198 14 214
219 185 145 237 2190 185 145 237 219 185 145
76 146 8 0 76 146
64 0 0 0 32 0 0 0 227 "0
0 0 0 0 0 0 0 0 0 0
189 0 0 0 46 0 0 0 134 0
0 0 0 0 0 0 0 0 0 0
130 0 0 0
0 0 0 0
103 0 0 0
0 0 0 0
166 27 9 4 166 27 4 166 27 9
Simulated image patternwith a Tr*Tc = 4*8
0 0
~,o
13
0 0
60 0
0
0
0
0 0
~o~o
~ ~ ~o 0
o ~ ~
0
0 0 0
0 0 0
9 0
0 0 0
0
0
~
~
0
22 0
~o
~
o
o
~ o
~ ~ ~o 0
0
0
0
0
0
.
Two dimensionalFNT of Table x-1, regular patterncan be clearly seen
0 0 0
o
o
o
o
o
0 76 146 8
o
635
128 13 48 93 128 13 48 93
222 4 109 195 222 4 109 195
13 245 7 237 13 245 7 237
4 166 27 9 4 166 27 9
75 67 203 211 75 67 203 211
126 198 14 214 126 198 14 214
237 219 185 145 237 219 185 145
8 0 76 146 8 0 76 146
128 13 48 93 128 13 48 93
222 4 109 195 222 4 109 195
13 245 7 237 13 245 7 237
4 166 27 9 4 166 27 9
75 67 203 211 75 67 203 211
126 198 14 214 126 198 14 214
237 219 185 145 237 219 185 145
8 o 76 146 8 0 76 146
13 48 93 128 13 48 93
4 109 195 222 4 109 195
245 7 237 13 245 7 237
66 27 9 4 166 27 9
67 203 211 75 67 203 211
198 14 214 126 198 14 214
219 185 145 237 219 185 145
0 76 146 8 0 76 146
13 48 93 128 13 48 93
109--7 195 237 222 13 4 245 109 7 195 237
166 27 9 4 166 27 9
67 203 211 75 67 203 211
198 14 214 126 198 14 214
219 185 145 237 219 185 145
76 146 8 0 76 146
12822213~751262378128~220475126237
Simulatedimagepatternwith a Tr*Tc = 4*8, with single flawed (noise) pixel
Table x-3 123 233 48
209 96 65
56 130 254
3 251 12
96 99 127
130 254 6 245 24 209 96 65 127
~ o ~ . , 6 202 24 209 96 20 127 3 251
233 48 161 1923 130 254 6 245 24
Table x-4
~
177 24 209
48 91 1.61 127 1923
251 173 233 48 161 116 130 254 6
24 209 96 65 127 3 251 12 233
~~
51 211 233 48 161
254 6 245
250 233 48
209 96 65
~o
~ . ,
127 3 251 12 233 48 161 192 130
6 84 24 209 96 128 127 3 251
233 48 161 1923 130 254 6 245 24
~.~, 161 144 130 254 175 24 209 96
83 130 254
96 129 127 51 139 233 48 161
Two dimensionalFNT of Table x-3, regular pattern destroyed
3 251 12
52 24 209
48 254 161 127 1923
~,~.~
,
~
130 254 6 245 24 209 96 65 127
24 209 96 65 127 3 251 12 233
161 69 130 254
9-51 239 233 48 161 65 130 254 6
254
6
245
,
91 24 209 96
127 3 251 12 233 48 161 192 130
These equations may be further generalised for any o~ and any modulus satisfying the criteria mentioned earlier, allowing similar results for the more general Number Theoretic Transforms (NTTs). In this way, similar behaviour can be investigated using NTTs such as the Mersenne Number Transform (MNT). The modulus used with the b'NT determines the maximum number of samples N that may be considered in one image and it is advantageous to choose a modulus that is a prime number. Only the first five Fermat numbers are prime and it will be necessary to investigate the suitability of these primes for detection. However, the error detection and location property is common for all Fermat Numbers and it makes sense to investigate transforms other than the FNT to see if they have especially useful properties for the purpose in hand. Obviously, the choice is Ft N N u,=2 t~--'-q2 dictated by the resolution and size of the images used. 28+1 16 32 The sequence length N may be varied by selecting appropriate values for o~ and 216+1 32 64 t as describe in the table x-5 beside. In this fashion, a wide variety of sequence 232+1 64 128 length may be selected, and various image formats accommodated. In this 264+1 128 256 work, both F3 and F4 were used in the analysis as well as or=2 and tx=~/2. Table x-5
RESULTS To gauge the effectiveness of the technique, extensive experiments were carried out using a number of images all with a resolution 512x484 8bit pixels. The test carried out include real grey level images taken directly via a CCD camera. Simple flaws such as rectangles, squares .... etc. were manually included on the images. The technique was then tested and flaws were successfully detected, assessed individually and classified according to their sizes (i.e., area and compactness). Different values of o~ can be selected in order to increase the sensitivity of the technique while retaining the applicability of the FFT algorithm for fast computation (i.e., O~- ~
with 13= 1, 2, 4, 8).
For the sake of illustration, the following two set of figures show the results achieved using simulated images of 16"16 and 32*32 pixels with 256 shades of grey, respectively. The analysis was carried out using the 3 rd Fermat number (F3) with (7.=2 and t~:-~/2, respectively. In both cases, it can be clearly seen that the regular patterns of the template (defectfree) object image (or pattern achieved by merging an image with a template) produces and highlights a regular structure in the transform domain (Fig 1(a)-(b) and Fig 2(a)-(b). If a defect(s) is (are) introduced (Fig 1(c) and Fig 2(c)), then the regularity of the FNT of the respective patterns are destroyed thus indicating the presence of flaw(s) (Fig 1(d) and Fig 2(d). Finally, the defect)s) can be easily recovered by computing the inverse FNT of the subtracted pattern of Fig 1(b) and Fig 1(d), and Fig 2(b) and Fig 2(d), respectively. This clearly demonstrates the ability of the method to isolated flaw pixels in an test image. In these examples the reader will see both single pixel flaws, and multi-pixel flaws. It is possible to classify multi-pixel flaws by further assessing their 'shapes and sizes using various methods. Currently, work is being carried out in that direction and results will be disseminated.
636
Fig l(a). Template image = A
Fig 1(b). FNT ( A ) The regular structure of the transform is clearly visible.
Fig 2(a). Template image = A
Fig 2(b). FNT(A) Again the regular pattern is obvious.
Fig 1(c). Flawed image = B
Fig l(d). FNT(B) The regular structure of the transform domain is destroyed by the single pixel flaw.
Fig 2(c). Flawed image = B
Fig 2(d). FNT ( B ) Regular structure destroyed.
Fig l(e). FNT(B) - FNT(A) = C
Fig l(f). Inverse FNT(C)
Fig 2(e). FNT(B) - FNTA)=C
Fig 2(f). Inverse FNT(C)
16"16 images
32*32 images
CONCLUSION AND FURTHER WORK This paper describes a novel technique, based on FNTs, for use in the automated inspection of flaws in 2-D images. The principle involves, firstly applying the FNT to a combination of an error-free image (master) and a flawed image (test). The errors detected are then assessed individually to ascertain the extent of the flaw(s). The technique has been successfully applied to a number of both simulated and real images. Work is currently underway to further enhance the results already obtained and develop a practical automated visual inspection demonstrator. A comparison with existing techniques will also be carried out with a view of assessing the factors of merits of the technique.
ACKNOWLEDGMENTS The authors would like to acknowledge the financial support of the Nuffield Foundation under grant SCI/180/95/204/G.
REFERENCES [1]
Bouridane, A., and Curtis, M.K., "A Parallel Processing Engine for Multi-Gray Level Flaw Detection", Parallel Computing, North Holland, pp577-584, 1992.
[2]
Vernon, D., "Machine Vision: Automated Visual Inspection and Robot Vision", IFS Publications Ltd. 1991.
[3]
TORRES, C.: 'Computer Vision-Theory and Industrial Applications' Springer,
[4]
Bouridane, A., et al., 'CMOS VLSI circuits of pipeline sections for 32 and 64-point Fermat number transformers' Integration, the VLSI Jour. 8(1989), Elsevier Science Pub.,pp.51-64.
Berlin, 1992.
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
637
SEGMENTATION OF BIRCH WOOD BOARD IMAGES D. T. Pham and R. J. Alcock. Intelligent Systems Laboratory, School of Engineering, University of Wales Cardiff, PO Box 917, Newport Road, Cardiff, CF2 1XH, U. K. ABSTRACT This paper explains a segmentation system designed to automatically detect defects on birch wood boards. A modular approach is adopted with each module being dedicated to detect different defect types. The modules are called global thresholding, row-by-row adaptive thresholding, multi-level thresholding and vertical profiling. Results are given for segmenting a large number of birch wood boards. Key words: Segmentation, Automated Visual Inspection, Wood Inspection, Computer Vision, Quality Control. 1 INTRODUCTION Birch wood veneer boards have a variety of uses including furniture, flooring and vehicle sides. A veneer board is made up of layers of wood sheets. For economic and conservation reasons, it is important for veneer board manufacturers to produce the maximum number of high quality wood sheets from a given quantity of raw material. The quality of a sheet depends upon the number, type and size of defects which it contains. To produce boards of different qualities, wood sheets which comprise the boards are manually graded into quality categories. Due to the stress of the task, human graders do not achieve a high degree of accuracy. To increase accuracy, attempts are being made to replace manual inspection by Automated Visual Inspection (AVI) which employs a camera and image processing routines. The operation of an AVI system for wood inspection can be decomposed into a sequence of processing stages called image acquisition, image segmentation, feature extraction and classification (Fig. 1). First, an image of the sheet is acquired. Second, it is segmented into two areas: clear wood and defects. Third, features are extracted from each located defective region. Fourth, each region is classified into a type using the extracted features and a classifier. Finally, the sheet is given a grade. Integral to the process of AVI is segmentation which is defined as "the process which subdivides an image into constituent parts or objects" [Gonzalez and Woods, 1992]. The defects which need to be found by the segmentation module are: cotoured steaks, hard rot, holes, pin knots, rotten knots, sound knots, splits, streaks and worm holes. These are illustrated in Fig. 2 together with an example of clear wood. 2 SEGMENTATION TECHNIQUES Many different techniques are available for segmentation. The most commonly used include edge detection, region growing and thresholding [Gonzalez and Woods, 1992]. Edge detection was tried on birch wood images but two problems were encountered. First, the edges were often found to be incomplete and then some method of edge linking was required to join the parts of the edges. Second, many erroneous edges were generated. It was then not trivial to determine that these did not represent object boundaries. To implement region growing, it is n ~ s a r y to decide upon appropriate seeds and similarity criteria for the task. One of the simplest criteria is to join pixels according to the closeness of their grey-levels. However, separate regions can overflow into one another and extra processing is needed to prevent this. The simplest and fastest of segmentation methods is thresholding and this has been used widely in the field of defect detection on wood boards [Cho et al, 1990; Conners et al, 1989; Kim and Koivo, 1994; Kothari et al, 1991]. Thresholding applied to the images of wood boards is based on the idea that defects are either significantly darker or lighter than clear wood areas. The technique is very fast and simple but there are two major problems. First, it is often difficult to determine, automatically or even manually, the points at which the thresholds should be placed. Second, from practical experiments it was discovered that not all defects differ significantly in grey-level from clear wood. Therefore, even the best automatic threshold selection algorithm would only be suitable for segmenting certain defect types. 3 SEGMENTATION MODULES For segmentation, it was found that different defects require different segmentation techniques so four modules were designed to segment the images. Global adaptive thresholding was employed for hard rot, multi-level thresholding for holes, rotten knots and splits, row-by-row adaptive thresholding for coloured streaks, pin knots, sound knots and worm holes and vertical profiling for streaks. The techniques are explained more fully in [Pham and Alcock, 1996; Alcock, 1996].
638 3.1 Global Adaptive Thresholding For very large dark defects, such as hard rot, it is possible to apply the technique of global thresholding for segmentation. The threshold can be determined from the grey-level histogram of the image. There are two basic types of histogram: unimodal and multimodal. A unimodal histogram has one central peak whereas a multimodal histogram has more than one peak. A special case of the multimodal histogram is the bimodal histogram which has two peaks. In the case of a unimodal histogram, it would normally be expected that the clear wood pixels are represented by the central part of the peak with dark defects represented by the tail on the left-hand side and brighter defects by the righthand tail. An entirely defect-free board should also be represented by a unimodal histogram. With a bimodal histogram, the right peak should represent clear wood pixels and the left peak should represent dark defects. Very bright defects, such as splits and holes, would be denoted by a peak on the far right of a multimodal histogram. To determine the threshold, it is necessary to find the number of peaks contained within it. The reason for this is that the threshold is determined differently depending upon whether the histogram is unimodal or multimodal. Finding the threshold from bimodal histograms has been widely studied and researchers have reported that the threshold should be placed at the valley in the histogram in order to segment the image [Gonzalez and Woods, 1992]. As previously mentioned, in an image of wood, if the histogram is bimodal then the rightmost peak should represent the clear wood. If the histogram is multimodal then only the two leftmost peaks should be considered. Peaks at the fight of the histogram represent open defects such as splits or holes. Determining the best threshold from unimodal histograms is a more difficult task and no individual technique is recommended by all researchers. In the case of the images of birch wood, it was noticed that, for an image containing solely clear wood, the histogram approximately forms a normal distribution. From statistical theory it is noted that in a normal distribution, 95% of values will fall within a range of +_2 standard deviations from the mean. Using this information, the threshold was set at the mean minus two standard deviations.
3.2 Multi-level Thresholding The second segmentation module used was multi-level thresholding. As well as clear wood pixels forming an approximately normal distribution, it was also observed experimentally that they fall within a range of grey-levels in the central part of the histogram. It would be extremely unlikely for a clear wood pixel to have a very small or very large grey-level. Therefore, it can reasonably be assumed that these pixels represent defects. Very dark pixels denote defects such as pin knots, sound knot centres and rotten knots. Very bright pixels, obtained when a back-light is used, represent mainly splits and holes. The advantage this method of thresholding is that it is extremely fast in hardware. It may be argued that using fixed thresholds could potentially produce incorrect results because clear wood could have grey-levels above or below the chosen thresholds. However, if the thresholds are set at sufficiently high and low values, respectively, then no problems should arise. Indeed, if clear wood pixels do fall outside these thresholds then it is likely that a failure has occurred with a part of the system such as the lighting or the camera. In this case, it is unlikely that any method of segmentation would be able to segment the image reliably and so automatic grading should be stopped until the defective part of the system has been repaired or replaced.
3.3 Row-by-Row Adaptive Thresholding The third segmentation module used was based on a row-by-row approach. In a real-time system, images from a conveyor belt are usually captured by a line-scan camera. Thus, valuable processing time would be saved if this information could be processed immediately as it is scanned. Practical work was carried out with an area-scan camera with an image size of 512 x 512 pixels but to simulate the operation of a line-scan camera, the image was split into 512 horizontal lines and these were processed in order from top to bottom. The technique operates by taking the grey-levels for the row in the image which is currently being scanned. First, very high and low grey-levels in the row are replaced with more intermediate values. Then, the grey-levels are smoothed by using averaging. Finally, the difference between adjacent pixels is reduced to below a pre-specified value. This technique gives expected values of pixels which can be compared with the actual grey-levels of the pixels. If the calculated expected value of a pixel differs widely from its actual grey level then this gives evidence that the pixel represents a defect. The more a pixel's expected grey-level differs from its actual grey-level then the more likely it is that this pixel represents a defect. This method finds defects in the horizontal direction but if there is a long horizontal defect then the expected values in the horizontal direction will not differ significantly from their actual values. For this reason pixels were also compared with previous rows to check if they differed significantly. 3.4 Vertical Profiling The fourth segmentation module which was used was called vertical profiling. The method is used to detect streaks which are a defect caused in production by a notch in the peeling knife. The streak will be caused parallel to the direction of motion of the sheet on the conveyor belt. Because the sheet is thin, if a strong backlight is used then variations in sheet thickness cause differences in brightness. Therefore, streaks show up on the image as a dark
639 vertical line. Vertical profiling operates by summing the grey-levels of each vertical line in the image and creating an array of these values. A dark vertical line in an image can be found by searching for a valley in the profile. 4 RESULTS AND DISCUSSION Tests were carried out using 75 images of birch wood boards. These were grey-scale images of size 512 x 512 and were obtained by using combined illumination. The method of image acquisition is shown schematically in Fig. 3. The accuracy of the segmentation system in locating the nine different defect categories is given in Table 1. The results show that 93% of the defects in the images were correctly located. Less than 0.1% of clear wood in the image was detected as objects. The two categories which caused the most difficulty were sound knots and hard rot. It is suggested that an X-ray scanner could be used to detect these defect types. 5 CONCLUSION To segment images of birch wood boards, considering the disadvantages of commonly used segmentation methods, a new modular segmentation system was developed. Tests showed a segmentation accuracy of 93%. ACKNOWLEDGEMENTS The authors would like to thank the EPSRC (Total Technology studentship no. 92311168) and the European Commission (BRITE/EURAM contract BRE2 CT92 0251) for funding this research. REFERENCES Alcock, R.J. (1996) Techniques for Automated Visual Inspection of Birch Wood Boards. PhD Thesis, School of Engineering, University of Wales, Cardiff. Cho, T.H., Conners, R.W. and Araman, P.A. (1990) A Computer Vision System for Automated Grading of Rough Hardwood Lumber Using A Knowledge-Based Approach. Proc. IEEE Int. Con. on Systems, Man and Cybernetics, Cambridge, MA., pp 345 - 350. Conners, R.W., Ng, C.T., Cho, T.H. and McMillin, C.W. (1989) Computer Vision System for Locating and Identifying Defects in Hardwood Lumber. SPIE Vol. 1095, Applications of Artificial Intelligence VII, Columbia, SC., pp 48 - 63. Gonzalez, R.C. and Woods, R.E. (1992) Digital Image Processing (3rd F.~l.). Addison-Wesley, Reading, MA. Kim, C.W. and Koivo, A.J. (1994) Hierarchical Classification of Surface Defects on Dusty Wood Boards. Pattern Recognition Letters, Vol. 15, No. 7, pp 713 - 721. Kothari, R. and Huber, H.A. (1991) A Neural Network Based Histogramic Procedure for Fast Image Segmentation. Proc. 23rd Sym. on System Theory, Columbia, S.C., pp 203 - 206. Pham, D.T. and Alcock, R.J. (1996) Automatic Detection of Defects on Birch Wood Boards. Proc. lnstn. Mech. Engrs, Vol. 210, pp 45 - 52.
Wood Board rlmag e
Objects
[""'FeatureExtraction ' -1
~Poatures
[ classification
~
]
Defects
[Grading
1
Grade Fig. 1 - Automated Visual Inspection system framework for wood inspection
640
Fig. 2 - Examples of the defects and clear wood
Fig. 3 - Schematic diagram of image acquisition TYPE Pin Knot Sound Knot Rotten Knot Hole/Bark/ Split Worm Hole Discoloration/ Hard Rot Streak TOTAL
NUMBER 72 15 14 39 19 8 150 19 17 353
FOUND 72 11 14 39 19 8 138 12 16 ~29
MISSED 0 4 0 0 0 0 12 7 1 24
Table 1 - The result of segmentation
% 100 73 100 100 100 100 92 63 94 9~
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
641
"Techniques for classifying sugar crystallization images based on spectral analysis and the use of neural networks" Eloisa Susana Gonz~ilez Palenzuela(o). Pastora I. Vega Cruz(,) Dpt. of Systems Engineering and Automatic Control. Faculty of Sciences. University of Valladolid. Spain. e-mail: [email protected] 9Dpt. of Applied Physics. Faculty of Sciences. University of Salamanca. C/Plaza de la Merced s/n. 37008 Salamanca. Spain. e-mail: [email protected] Keywords: Image Analisys, Spectral Processing, Feature Extraction, Image Understanding. Neural Networks. Abstract. This paper presents a system of image processing for the automation of the visual supervision task carried out by the operators in the sugar crystallization process. The system is based on the combination of classical techniques of image preprocessing in the spatial domain, image recognition using the image spectral analysis and neural networks to form an intelligent sensor for the process. The paper focusses on the feature extraction which will allow situations to be classified using a neural net based on images taken from the real process. Two new techniques are presented where the form and parameters derived of the frequencial spectrum of the image are employed as elements of classification. Some real examples are included showing the system reliabilty and suitability to be used for the whole automation of the supervision in the industrial process. 1. Introduction. Sugar crystallization is one of the most significant stages within the sugar manufacture because it is there where the grain is obtained. At present, in most of sugar factories, the different phases from the introduction of the syrup to the tank until the extraction of the full mass of sugar are automated. During this process, a visual inspection carried out by the operator, allows the detection of specific problems associated to this process and, once any problem is detected, the process is put in manual and the necessary actions executed to correct it. This task is needed in order to guarantee the adequate quality of the final product fixed by legal especifications (adequate size and homogeneous grains). It is therefore of interest to automate this part of the process which is the aim of the present work. Vision techniques has been defined as a process of recognising objects of interest, which pre-supposes the understanding of the images [2], [3], [6]. A system of digital image processing is divided into several stages. The segmentation, description and recognition are the most important, because the first divides the image into its constituent objects, in the second, the characteristics are obtained in order to differentiate types objects and the third allows these objects to be labelled. In the extraction of the crystallization image features, it is more interesting to get a global understanding through some general image characteristics to carry out a description and a detailed recognition of them. This paper proposes a system, based on crystallization image processing together with the use of neural networks, that allows to carry out a qualitative analysis of the images. Two techniques are presented in order to recognize the general image characteristics, both are based on spectral analysis of the image. Given that the image's frequency spectrum is different in each of the situations presented, the spectrumform or spectrum derivedparameters could be employed as inputs to a neural network for the recognition of these situations. The utilization of networks in processing and understanding of images has become extensive especially associative learning networks, [6], [7], [8]. A Learning Vector Quantization neural network is used here in the classification of crystallization images. The paper is organized into six sections. First, a description of an industrial process and a statement of the problem to be solved are given. In the next section, a set of images is presented and a spectral analysis of them is carried out from which are derived the features that will allow the classification to be effected. Next, a brief explanation of the neural network employed, and the two techniques used for classification, is made. Finally some conclusions are given. 2. Industrial process description and problem presentation. Sugar crystallization is produced in some tanks named vacuum pans, where the sugar is separated from the syrup in the form of homogeneous and colorless crystals. In order to produce sugar, the vacuum pan is loaded with subsatured syrup. This one is concentrated by heating it under vacuum until conditions of oversaturation are reached. Then, sugar powder is fed into the
642
tank to seed the crystals, which are then made to grow by introducing more syrup and maintaining the conditions of oversaturation that allow the solution to give the saccharose in excess of the seeded powder. When a maximum level in the pan has been reached and the crystals have a good shape, the tank is discharged and cleaned [ 1]. In addition to the instruments which measure the state of the vacuum pans, the process is also monitored through the properties of the mass within the tank, either by taking samples or by obse~ing the growth of the crystals through a microscope located in the wall of the tank. The image acquisition is carried out through this microscope. Some vision indicators, which allow the development of the crystallization to be followed, have been established and the following situations to be recognized in the images have been defined based on those indicators: a) to establish the initial density of sugar crystals, to indicate if there is an adequate population of sugar grains at the beginning of the process.. An initial excess of grains in the vacuum pans leads to a slow growth of the grains and consequently a very small size of sugar grains at the end of the process. b) to establish the quantities of grain sizes (i.e. a coarse size distribution) present in the process at the beginning of the growth. At the end of the process, more homogeneous possible grains are desired. In order to get this, similar grain sizes should be maintained from the beginning of the growth stage. c) to establish if there a group of crystals between the sugar grains. When the crystals are formed or row side-by-side, in little space, neighboring faces of several crystals could join together, giving place to groupings of crystals. Once these composite grains have formed, they will grow like a normal crystal, although they are deformed, and at the end of the growth period they will have a greater size than the remainder of the grains making very difficult the separation between the sugar and the syrup, in the last stage in the sugarproduction. It should be pointed out that the real images obtained in this process are very complicated by the effect of non-uniform illumination which provokes zones of shade, where the object and the background are not distinguished easily or where the contrast between the objects is very soft. It is supposed, nevertheless, that the best possible segmentation of the image has been obtained, which allows the recognition process to proceed, as proposed in this paper.
3. Spectral analysis of the images. A sample of real images, showing the mentioned above situations, with the corresponding segmented images are presented in Fig. 1 and 2. This images have been taken from a sugar factory placed in Benavente (Spain). The images of Fig. 1 correspond to case a), i.e., the crystal population at the beginning of the process is adequate in Fig 1a but it is too high in Fig. 1b. The first three images of Fig. 2, correspond to case b) where the crystals present a quite uniform size in Fig. 2a, but in Fig. 2b and Fig. 2c several sizes of crystals can be observed. The fourth image, Fig. 2d, corresponds to case c), i.e., crystals with a very different form to the previous cases can be observed. The cases b) and c) could happen simultaneously during crystallization, thus, those are detected at the same.
In order to extract the general image characteristics a spectral analysis of these images is realized. Given that the image's frequency spectrum is different in each of the situations presented, the spectrum form or spectrum derived parameters (spectral maximum, high-energy bandwidth, total spectral energy, etc.) could be employed to distinguish them. With this purpose, the 2-D power spectral density functions, defined in [4], [5], are obtained from the Fourier transforms of the images. Thus, the power spectra of images shown in Fig. 1 can be seen in Fig. 3. In general, there is great difference in the form of the spectra. Hence, the image of Fig. l a has less energy than the image of Fig. l b since the crystal quantity have b e e n increased. Additionally, the power spectrum bandwidth is smaller for the former that for the second. Figure 1" Populationat the beginningof the process. In Fig. 4, the power spectra of the images corresponding to Fig. 2 are shown. Note that the maximum value of the power spectrum of those images with objects of bigger sizes, is higher.
643
Figure 2: Size distribution and group of crystals.
Figure 3" 2-D power spectra of Fig. 1 images. For these images, it can also be observed that the power spectrum concentrates more energy in a smaller number of frequencies (see Fig 4c and Fig 4a). The spectral energy is also higher for images with objects o f bigger size.
Figure 4: 2-D power spectra of Fig. 2 images. The spectrum derived parameters values of images from Fig. 1 and Fig 2 are given in Table 1. In particular, these parameter retain the image features and simply the way of processing the information contained in the spectrum. Table 1: Values of the spectrum derived parameters. Spectral Right lim. of ] Right lim. of Spectral maximum frequency band coli frequency band co energy Adequate population 1.8602 2.86 cm "l 3.64 cm "l 147.645 ~ Excess of population 7.9538 7.8 cm "l 8.32 cm "l 921.3818 Homogeneous size 28.1786 4.42 cm "l 5.72 cm "1 896.6836 Two sizes 193.3685 16.12 cm "l 15.34 cm "l 2596.9 Three or more sizes 158.5024 16.38 cm "l 16.38 cm "l 3555.4 Deformed crystals 546.8036 14.30 cm -I 16.38 cm "l 4033.0 Images
644 4. Image classification using a network of Learning Vector Quantization. In this section the proposed images classification techniques are described. Both of them use a Learning Vector Quantization (LVQ) neural network which consists of two layers: a competitive layer, where the neurons, distributed in the input space, recognize input vectors and classify them in subclasses, and a linear layer, which transform the previous associations into target classifications def'med by the user [ 10]. The main difference between the proposed tecniques are the inputs selected for the net.
4.1. Classification based on the spectrum form. In this case, a representative region from the image spectrum is directly selected and used for training. The network architecture is similar to the one shown in Fig. 5, which has been particularly implemented for the case of crystal size distribution. The number of the hidden and last layers must be adequately chosen for each classification problem and the number of inputs should be selected to include the image spectrum distinguishing information.
In order to illustrate the use of this method, the images shown in Fig. 2 were considered and the aim was to obain the size distribution and detect the crystal grouping presence. In this example, the low frequency band, + 1 cm" 1, was selected because this included all datas of interest. Given that the power spectrum is symetric only half of it was used. This resulted in a 1681 input parameters. The hidden layer, (competitive layer) and output layer have four neurons because this is the number of target situations.
Fig. 5: Architecture of LVQ network to classify with the spectrum form. Table 2: Answersof the LVQ networks. Images [ Classification results 2b Two sizes 2d Deformed crystals 2a Homogeneous size 2c Three or more sizes
Some experimental results are presented in Table 2 illustrating the effectiveness of this techique that, in this case, was able to classify properly all the situations.
4.2. Classification based on spectrum derived parameters. A possible variation, to carry out the classification tasks, is to use a set of parameters derived directly from the power spectrum. As mentioned before this reduces significantly the dimension of the input vector.
The network architecture is shown in Fig. 6 that, in this particular case has been selected to the establish the initial density of sugar crystals (Fig 1). The number of inputs is the same for all the situations (a,b,c) although the number of neurons in hidden and output layers varies from one to another according to objectives, as in the previous technique. In this application, two neurons were used to distinguish between an adequate initial population or crystal population excess. Given the simplification that is obtained in the dimension of the network, this variation produces the results much faster. Fig. 6: Architecture of LVQ networkto classify with the spectrum derived parameters.
645 Table 3" Answersof the LVQ networks. Images ] Classification results I[ Some experimental results are presented in Table 3 illustrating the effectiveness of this techique that, in this case, was able to classify properly all the situations. lb [ Population excess la [ Adequate population 5. Conclusions.
This paper is foccussed on the solution of a real industrial problem, the automation of the monitoring task in the sugar crystallization process. In the system, the supervision task is carried out by the operators by simple visual inspection. The paper presents two techniques that allows to carry out an automatic supervision as an intelligent sensor of this process. The system was developed by combining classical techniques of digital image preprocessing with neural networks and the proposed techniques were applied to real images taken from a sugar factory placed in Benavente (Spain), showing the effectiveness and suitability for its real implementation. The basis of the system are the extraction of image global features. This allows the detection of problems associated to this growth susgar crystal and guarantee the correct performance of the system and, consecuently, the final product quality. 6. References.
Corujo, J., and Garcia, A., "Crystallization: Theory and Practice". SGAE. Factory of Benavente, Spain. Marr, D. 1985, Vision. APs 12. Gonz~ilez,R. C. and Woods, R. E.1992, Digital Image Processing. Addison-Wesley Publishing Company, Inc. Kay, S. M., 1988, Modern Spectral Estimation: Theory and Application. Englewood Cliffs: Prentice-Hall. Lynn, P. A., 1989, An Introduction to the Analysis and Processing of Signals,.3rd ed., Macmillan Education. Kulkarni, A. D., 1994, Artificial Neural Networks for Image Understanding. Van Nostrand Reinhold. Martinez, M., Cardefioso, V. and Natowicz, R., Nov 1994, Level Flock Digital Image Segmentation, Using Kohonen's Self-Organizing Feature Maps" in proceedings AMCA/IEEE International Workshop on Neural Networks Applied to Control and Image Processing, NNACIP'94. pp 53-61. [8] Gonz~ilez, E. S. and Prada, C., Mar. 1996, "Algorithm for the object position determination in an image based on asociative learning networks". 5th International Congress of lnformatic Tecnology News in Havana, Informatic'96. Cuba. [9] Demuth, H. y Beale, M. Jun, 1994, Neural Network Toolbox for use with Matlab: User's and Reference's Guides. Natick, Mass: The Math Works, Inc.
[ 1] [2] [3] [4] [5] [6] [7]
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
647
Large-scale Electrical Tomography Sensing System to Study Mixing Phenomena M. Wang 1, R. Mann 1, F. J. Dickin 2, T. Dyakowski 1.
1Department of Chemical Engineering, UMIST, Manchester M60 1QD (UK) 2Department of Electrical & Electronics Engineering, UMIST, Manchester M60 1QD (UK)
Abstract A large-scale electrical tomography system has been designed and recently installed at UMIST to better understand imperfect mixing and improve the design of stirred vessels and their impeller configuration at plant-scale [ 1]. The sensing system was constructed with 8-planes of sensing tings, each containing 16-electrodes and installed into a 2.7 m 3 polypropylene vessel fitted with a standard Rushton turbine. The signal processing components, together with an example of an application is presented in this paper.
Introduction Over the last two years, UMIST has designed and constructed an ERT data acquisition system (DAS) [2] for applications monitoring conductivity distribution inside process vessels and pipelines [3]. Resistance tomography can detect local changes in conductivity [4], so that the technique can be used for example to look at the mixing of a strong salt solution into a weaker background brine. The basic feasibility of using ERT to measure pulse tracer tests and semi-batch addition for miscible single phase fluid mixing has already been established at 30 dm 3 semi-tech scale [5]. However, the complexities of mixing processes and as a consequence the difficulty in scaling-up, requires the design and development of a tomographic sensing system to interrogate mixing processes at full plant scale. Two parts of initial studies, the sensor system and its application to mixing are summarised in the following sections.
Electrical resistance tomographic sensing system Unlike medical ERT systems, in which the sensors are contact with human skin, the sensors in process ERT must be in continuous electrical contact with the electrolyte inside the process vessel. Effects of electrode size, object's 3D orientation, ac common voltage and dc electrode potentials at the electrode-electrolyte interface are all significant in applications of ERT for process engineering. The size and the material of the electrodes were both important in producing and sensitively measuring the electric field distribution. The electrode size effects were simulated by grouping a number of boundary nodes using a 2 - D grouped-node finite element method (FEM) when a low conductivity object (0.0011 mS/cm) was placed in a 0.11 mS/cm background FEM mesh [6]. The results obtained from the simulation, using different numbers of grouped nodes from one to four, indicate that the smaller the size of electrode, the higher the sensitivity of the voltage measurement at the measurement electrodes and the higher the common voltage at the current driven electrodes, when measurement and current excitation use the same electrode system [ 1]. The electrode angle was selected at 8.2 ~ as used in previous laboratory-scale experiments, which for the 1.5 m vessel diameter gives 10 cm wide and 4 cm high plates made of stainless steel. The boundary measurements of electric field from a single plane of sensors are not just governed by the properties in the 2-D cross section. To obtain a 3-D field distribution, an 8-plane set of sensing tings each containing 16-electrodes has been designed and constructed for the 2.7 m 3 plant-scale
648 stirred vessel of principal dimension 1.5 m (Figure 1). An image of three 1.5 litre cone-flask phantoms, that were differently positioned inside the vessel, was obtained by the sensing system. The resultant image clearly shows their positions inside the stirred vessel when a 2D modified sensitivity coefficient algorithm (Equation 1) was used to reconstruct these images (Figure 2).
where the conductivity relative value, a m, at pixel m is represented by a column matrix P, [r/]n represents the relative value of boundary measurement, M is the number of pixels, N is the number of boundary measurements, % is the conductivity at a homogeneous case for the reference measurement. The normalized sensitivity matrix [tC]m,n is given by equation (2) using the sensitivity coefficients adopted from Kotre's method
[7]. ICm, n -- am,
am, n
Figure 1 Sensor construction
(2)
Figure 2 Three phanton's detection
An electrode used in ERT is a transducer which converts the electric current in a wire to an ionic current in an electrolyte. The behaviour at the electrode-electrolyte interface is a predominately electrochemical reaction. If a metal electrode is immersed in an electrolyte in process resistance tomography, then ions will diffuse into and out of the metal electrode. An equilibrium will be established which will give rise to a dc electrode potential whose magnitude depends on both the nature of the metal and the electrolyte. The values of electrode potentials for some commonly used metals are listed as below, which were measured with reference to a standard hydrogen electrode [8]. c. Iron-440 mV Lead- 126 mV Copper +337 mV Platinum +1190 mV
I 1 Rc ~ - j ~ Figure 3 Schematic equivalent circuit for electrode-electrolyte interface
649 It might be thought that, as two electrodes are used in ERT sensing strategy, the electrode potentials should be cancelled, but in practice the cancellation is not perfect, and could be from several mV to several hundreds inV. The differential voltages of dc electrode potentials could be very much bigger than the amplitude of ac measurement signals. The reasons for this are, firstly, that two electrodes are not identical and, secondly, that the electrode potentials change with time [8]. A simplified equivalent circuit for an electrode-electrolyte interface, ignoring the ion diffusion effect in the majority of applications, has been suggested by Hill and Dolan [9] as shown in Figure 3, where V~ presents the dc electrode potential, CH and R c are the charge transfer resistance and the double layer capacitance. Both the charge transfer resistance and the double layer capacitance can be ignored and the bulk resistance plays the main role at high frequencies in practice, when a current injection measurement strategy is adopted. A simple buffered ac coupling circuit was designed to MUX.x8 cancel the electrode dc potentials and speed up the data I acquisition. The front-end circuit for measurement is shown in Figure 4. A total 1024 of such switches are , , I , used for a sensing system with 128 electrodes. Due to ~ [ ] accoupl.[ using hundreds of meters of coaxial cable to lead the _i_ ' i electrodes, a screen driven technique was adopted to reduce the cable stray capacitance to less than 5% of the [ inherent value. Only two measurement channels are driven at any one time in order to minimise the power ,ocoo,,. consumed and reduce cross-talking. All channels can be K3 independently operated by software control, which leaves the sensing strategies flexible to adapt different measurement strategies. To reduce common voltage, a current source pair and a grounded floating measurement (GFM) are adopted [2]. A data Figure 4 Front-end circuit acquisition speed of 40 ms per frame at 9.6 kHz signal frequency was achieved using the circuit. The total data acquisition speed for the 8-plane sensing system is 0.32 sec at a signal frequency 9.6 kHz.
0 Vl OV2
Application for miscible fluid mixing A sequence of pulse injection mixing images has been obtained for a standard Rushton turbine stirring at 75 rpm, when an 10 litres of concentrated salt solution (12.4 mS/cm) was injected into the stirred vessel filled with mains tap water (0.1 mS/cm), as shown in Figure 5. Images were reconstructed with the algorithm given in Equation (1), which gives an absolute conductivity with errors less than +5% in case of a homogeneous condition. The value obtained after the mixing was completed was used as a mixing index to present the mixing as a sequence of isosurface solid-body images that were interpolated using the data from eight 2-D slices of images. During the first 14s in the mixing pulse test, the strong radial outflow and a strong swirl in an anticlockwise sense are demonstrated in Figure 5. At 24s, more of the brine has been mixed and swirled asymmetrically into the lower part of the vessel and back to the upper part of the vessel at 34s. Finally, after 44s the mixing is going to completion. The final conductivity was 0.1625 mS/cm measured using a CIBA.CORNING conductivity meter after the mixing was completed.
650
Figure 5 Brine tracer pulse mixing for a standard Rushton turbine at 75 rpm (10 litres brine (12.4 mS/cm) was injected into 0.1 mS/cm backgournd. The mixing index/isosurface value for these solidbody images was 0.155 mS/cm)
Conclusions A large-scale electrical tomography system for monitoring mixing process has been successfully constructed in a 2.7 m3 plant scale stirred vessel using 8-plane sensing tings each containing 16electrodes. The electrical resistance distribution in 3-D can be assembled in 0.3s using a set of eight tomograms obtained by the sensing system. A typical application in miscible fluid mixing demonstrates that the system can be used as a tool to validate data from chemical engineering CFD modelling and also to monitor mixing process at an industrial scale, although so far only qualitative images have been reconstructed 1
References [1] Mann, Wang. M., R., Dickin, F.J., Dyakowski, T., Forrest, A.E., Holden, and Edwards, R.B., P.J., 1996, Resistance tomography imaging of stirred vessel mixing at plant scale, in Fluid Mixing 5, ICHEME Symposium series No 140, pp. 155-166 [2] Wang. M., Dickin, F.J. and Beck, M.S., 1993, Improved electrical impedance tomography data collection system and measurement protocols, in Tomography technique and process design and operation, Edied by Beck, M.S., Campogrande, E., Morris, M., Williams, R.,~ and Waterfall, R.C., Computational Mechanics Publications, pp. 75-88. [3] Dickin, F.J. and Wang. M., 1996, Electrical resistance tomography for process applications,Meas. Sci. Tcchnol. 7 pp. 247-260 [4] Barber, C.D., Brown, B.H., Freeston, I.L.,1983, Imaging spatial distributions of resistivity using applied potential tomography, Elec. Lett., 19, pp.933935. [5] Williams, R.A., Mann, R., Dickin, F.J., Ilyas, O.M., Ying. P. Edwards, R.B. and Rushton, s 1993, Application of electrical impedance tomography to mixing in stirred vessels,A.I.ChE. Syrup. Series, 293,8 [6] Wang. M., Dickin, F.J., and Williams, R.A., 1995, Group-node technique as a means of handling large electrode surface, Physic. Meas., 16, Supp 3A, pp.219-226. [7] Kotre, C.J., 1994, EIT image reconstruction using sensitivity weighted filtered backprojection,Physiol. Meas., 15, 2A, pp.125-136. [8] Brown, B.H. and Smallwood, R.H., 1981, Medical physics and physiological measurement, Blackwell Scientific Publication, Oxford [9] Hill, D.W. and Dolan, A.M., 1982, Intensive care instrumentation, Academic Press, London
Session W: IMAGE ANALYSIS II
This Page Intentionally Left Blank
Proceedings IWISP '96; 4- 7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
653
Structural Indexing of Infra-red Images using Statistical Histogram Comparison Benoit Huet I and Edwin R. Hancock 2 Department of Computer Science, University of York, York Y01 5DD, UK Abstract This paper aims to develop simple statistical methods for indexing into an aerial image database using a cartographic model. The images contain severe distortions due to the acquisition process and the data cannot be recovered by applying a simple Euclidean transform to the model. The underlying representation of the infra-red image map within the database is based on histograms of line-segment pair relative angles. We investigate several alternative methods for histogram comparison and conclude that statistical distance measures (such as Bhattacharyya, Matusita and Divergence) provide significant performance improvements over the standard 1_I and L2 norms. 1.
Introduction
This paper aims to evaluate the effectiveness of various statistical distance measures in the indexing of images using pairwise geometric histograms. The idea of image indexing using attribute histograms was first exploited in the colour domain by Swain and Ballard [9]. It has since been shown to be effective for indexing according to geometric attributes. For instance Dorai and Jain [1] have exploited the idea to index range-images using a histogram of surface normal orientation. Despite providing an effective means of indexing, these techniques all rely on comparing histograms using a simple L2 norm. Our aim in this paper is to compare the effectiveness of five different probabilistic distance measures in gauging histogram similarity. In particular we compare the L 1 norm, the L 2 norm, the Bhattacharya distance, the Matusita distance and the divergence.
2.
The Application Domain
The application vehicle for our study is the problem of indexing into a database of remotely sensed images using a digital map. The problem of accurately and rapidly performing content-based retrieval of images is very complex and is currently active area of research [5, 6, 2, 8, 7, 4]. The image data-base used in our study consists of 22 infra-red line scan images. These images are of both rural and urban areas. The images are formed by a line-scan process in the horizontal direction and by aircraft motion in the vertical direction. The main features are man-made road structures which radiate strongly at night time in the infra-red band. These features present themselves as intensity ridges in the infra-red images and are extracted using a relaxational line-finder [3]. Straight-line segments extracted from the infra-red images are used to compute a histogram of pairwise angle differences (see figure 1). We compare the data histograms with those extracted from the cartographic information for road networks in the digital map. Because of the line-scan process used to generate the infra-red images, there is a significant barrel distortion in the horizontal direction. In other words there is distortion of the data histograms with respect to the model. Some images in the database (figure 3) cover the same area as the digital map (sout60) but are taken at different aircraft altitude (soutl70). Furthermore, images sout60_07.0, sout60_10.0 . . . . and sout60_25.0 belong to the same original infra-red image (sout60) but different feature extraction process have been used to obtain their model data. This gives rise to significant variation in the structure of the line-set.
3.
The experiment
The main examples included here illustrate the effectiveness alternative distance measures when compared to the more common distance norms. The normalised histograms PD(i) and PM (i) are composed of 18 distinct bins /each
1 2
huetb @minster.york.ac.uk erh @minster.york.ac.uk
654 representing the frequency f
of occurrence of an angle between the lines within each line-set. The smallest angle
created by the intersection of two lines ranges between 0 (for collinear lines) and n:/2. Since our histogram contains 18 bins each bin i covers a range of angle of n:/36. 9
9
L 1 Norm
LI(PD,P M ):
~ Ipo (i) i
PM(i)I
L 2 Norm
9 Bhattacharyya Distance
B(Po,PM) =-ln Z ~/Po(i)xPM(i) i
9 Matusita Distance
M(Po PM)
9 Divergence
D(Po'PM'=~i[(Po(i)--PM(i))ln."P~ . (i)Jl
(4Po(i)-4PM(i)
Figure 1: A typical infra-red image going through the processing step leading to the histogram representation. To provide some illustrative examples of our methodology, Figure 1 shows the sequence of processing steps from an infra-red radar map image leading to the extraction of the histogram. Figure 1(a) is the raw image. Figure 1(b) is the result of applying straight-line detection to the output of a probabilistic relaxation line detector. Finally, Figure l(c) shows the computed histogram based on line-pair angle. Table 1 presents the response of the various distance measures stated above to a digital map image (see figure 2(a)) corresponding to infra-red images sout60, sout60_07.0, sout60_10.0 . . . . . sout60_25.0 and soutl70 (with some viewpoint variations). At first sight, since all the techniques identify sout60 as having the closed histogram similarity to the model data, we may conclude that they deliver comparable performance. However, it should also be noted from these results that the L 1 and L 2 norms do not perform as well as the other metrics. Particularly, the L 1and L 2 norms are the only measure providing an incorrect classification of soutl70. A graphical representation of the classification made by the Bhattacharyya measure on the database can be seen on figure 3 (Only the first 20 best matches and excluding the multiple representation of sout60). The images are ordered according to their goodness of match with the query image (orig60, figure 2(a)). In figure 3, the images are ordered from left to right, top to bottom, with the top-left image having the most similar content to the digital map. The system was also asked to find the most similar infra-red images compared to infra-red image sout90 (figure 3). All distances methods responded with correct first two matches sout90 and soutl90 (higher aircraft altitude) except the Lj and L 2 norm that found a correct best match (sout90) but failed to provide the true second best match.
655
Figure 2: Typical images used by the system 0
Conclusion and future work
We have described a way of indexing infra-red aerial images using a digital map based on structure histograms. W e have demonstrated that L 1 and L 2 norms are not particularly suited for our histogram matching problem. However, Bhattacharyya, Matusita and the Divergence distance measures are able to correctly index the database of infra-red images using either digital data (digital maps) or real data (infra-red image) as query. W e showed that a histogram constructed from line-pair angle measurements could be used along with some distance measures to retrieve according to similarity into an image database with invariance in scale and rotation. Furthermore, this technique is not affected by the output of the line extraction process, since images processed using different line extraction parameters have similar ranking at retrieval. Among the future developments of the system, we are currently investigating the use of multidimensional histograms. This would allow the system to compare images based on other invariant structural measurements. ,1 norn sout20 0.6760 sout 100 0.6701 soutl 0 0.6700 sout200 0.6629 sout210 0.6407 sout140 0.6258 sout70 0.6151 sout190 0.5784 sout40 0.5626 sout80 0.5620 sout220 0.5562 sout110 0.5519 sout 120 0.5446 sout30 0.5063 soutl60 0.5011 sout 150 0.4997 sout50 0.4958 sout90 0.4885 sout180 0.4745 sout60_10.0 0.4685
sout210 sout 100 sout200 soutl0 sout20 sout 140 sout 120 sout 190 sout30 sout40 sout70 sout 110 sout80 sout 160 sout220 ,ut ! 7() sout50 sout90 sout 180 sout130
0.2302 0.2263 0.2258 0.2109 0.2025 0.1944 0.1903 0.1893 0.1799 0.1759 0.1732 0.1703 0.1660 0.0544 0.1596 I). 159~ 0.1578 0.1562 0.1515 0.1505
sout 10 sout20 sout210 sout100 soutl40 sout200 sout40 sout70 sout 190 sout 110 sout80 sout120 sout50 sout160 sout220 sout30 sout150 sout90 sout130 soutl80
0.1271 0.1154 0.0914 0.0795 0.0757 0.0748 0.0748 0.0680 0.0677 0.0600 0.0584 0.0573 0.0565 0.3256 0.0543 0.0491 0.0475 0.0474 0.0459 0.0437
soutl0 sout20 sout210 soutl00 soutl40 sout200 sout40 sout70 soutl90 sout110 soutS0 soutl20 sout50 sout160 sout220 sout30 sout150 sout90 sout130 sout 180
0.4886 0.4669 0.4180 0.3911 0.3819 0.3797 0.3797 0.3627 0.3618 0.3414 0.3369 0.3338 0.3314 0.4342 0.3253 0.3096 0.3046 0.3045 0.2997 0.2924
fivergenc soutl0 0.9948 sout210 0.7366 sout20 0.6982 sout100 0.6272 sout40 0.6093 soutl40 0.6037 sout200 0.5894 sout70 0.5460 soutl90 0.5381 soutl 10 0.4807 sout80 0.4647 sout50 0.4569 soutl20 0.4550 sout 160 0.1649 sout220 0.4317 sout30 0.3913 soutl50 0.3785 sout90 0.3778 sout 130 0.3703 soutl80 0.3491
Table 1" Distance measures response to query image depicted in figure 2(a).
References [1] C. Dorai and A.K. Jain.View organisation and matching of free-form objects. IEEE Computer Society International Symposium on Computer Vision, pages 25-30, 1995.
656
[2] T. Gevers and A.W.M. Smeulders. Enigma: An image retrieval system. International Conference on Pattern Recognition (ICPR) 1992, pages 697-700, 1992. [3] Edwin R. Hancock. Resolving edge-line ambiguities using probabilistic relaxation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'93), pages 300-306, 1993. [4 A.K. Jain and A. Vailaya. Image retrieval using color and shape. Pattern Recognition, 29(8), pages 1233-1244, August 1996. [5] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman, D. Petkovic, P. Yanker, C. Faloutsos, and G. Taubin. The QBIC project: Querying images by content using color, texture and shape. Image and Vision Storage and Retrieval, 1993. [6] A. P. Pentland, R. W. Picard, and S. Scarloff. Photobook: tools for content-based manipulation of image databases. Storage and Retrieval for Image and Video Database II, pages 34-47, February 1994. San Jose, California. [7] R. W. Picard. Light-years from lena: Video and image libraries of the future. IEEE International Conference on Image Processing., 1:310-313, 1995. [8] M. J. Swain. Interactive indexing into image databases. Image and Vision Storage and Retrieval, 1993. [9] M.J. Swain and D.H. Ballard. Indexing via colour histograms. Third International Conference on Computer Vision, pages 390-393, 1990.
Figure 3: Result of querying the Infra-Red Radar Map Database with the Digital Map (see Figure 2(a)) using the Bhattacharyya distance measure
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
657
A MODEL-BASED APPROACH FOR THE DETECTION OF AIRPORT TRANSPORTATION NETWORKS IN SEQUENCES OF AERIAL IMAGES
D. Sarantis and C.S. Xydeas
Division of Electrical Engineering, School of Engineering, University of Manchester, Dover St., Manchester M13 9PL, UK tel.: +44[16112754511, fax: +44[16112754528 e-mail: C.Xydeas @man.ac.uk ABSTRACT The detection of Airport Transportation Networks (ATNs) in sequences of aerial images is an important task in applications related to the autonomous navigation and landing of aircraft. It is also a complex and demanding problem, due to "variability" in the structure and characteristics of ATNs and to the various forms of "noise" which may corrupt the input sequences. In addition, the complexity of the task increases considerably when perspective effects are present in the images. In order to solve this problem effectively, additional information, in terms of both the aircraft navigation parameters and the airport area, should be 'fused" together with the input visual information. This paper describes such an approach for detecting ATNs in airport aerial image sequences obtained by a forward looking airborne camera. Computer simulation results are presented, when the system operates on real airport aerial image sequences, and demonstrate clearly the advantage of fusing with input images, information obtained from other image independent data sources. System performance depends on the quality of the input image and the image independent data. At a range of 3.4 to 2.9 Km the scheme detects and tracks main ATN elements, such as the main runway, with a 0.36 ~ mean absolute orientation error and a relative length error of 0.2%.
1. INTRODUCTION Currently, advanced work on Avionic imaging systems is focused on solutions that provide the pilot and possibly other aircraft systems, with accurate information on ground objects and structures of specific interest present in aerial video sequences. This is a complex and demanding problem due to (i) the variability in the characteristics of these structures and (ii) the various types of "noise" which usually corrupt the input visual information. Furthermore, the complexity of the problem increases considerably when the input aerial images exhibit strong perspective effects. Schell and Dickmanns [1] proposed an aircraft related computer vision system which locates "runway" image regions in synthetic or real images. These regions were defined initially, using a detailed runway geometric model and were then tracked throughout the video sequence using an Extended Kalman Filter (EKF) approach. Another model-based technique which employed a-priori object geometric model information and camera position data, as provided by the aircraft's instruments, is described in [2]. In this case image regions, in semi-synthetic Passive Millimeter-Wave image sequences, are examined using features extracted from these regions, in order to identify runway instances. Furthermore, recent reports [3,4] on similar systems which are able to cope with perspective images but operate on still aerial images, also suggest the use of other "image independent" information sources. Within this theme, a model-based approach for detecting and tracking Airport Transportation Network (ATN) elements in sequences of obliquely captured airport aerial images is presented in this paper together with experimental results from tests using real data. The proposed approach combines information, that is extracted from grey-level CCD image sequences, with that obtained from the following sources: (i) Information related to the imaging system (CCDCD) and specified in terms of the CCD camera focal length, its position on the aircraft, its orientation relative to the ground, and the intrinsic parameters of
658 the camera/video cassette/digitiser system. (ii) The aircraft position and orientation which is supplied by the Inertial Navigation System (INS) and the Global Positioning System (GPS) of the aircraft. (iii) A Digital Terrain Elevation Map (DTEM) of the viewed area. (iv) A simple Target Model (TM), which consists of a list of objects that appear in a large-scale map of the airport area. (iv) General ATN structural characteristics, which include general design information associated with ATN international standards [5,6]. The architecture of the proposed system is described in the following section which explains how the above image independent information is used in assisting the image analysis and understanding procedures. The final section of the paper provides computer simulation results, which highlight the performance of this system when it is tested with real airport aerial video sequences. 2. SYSTEM DESCRIPTION ATN structures, in this work, include runways and cross-roads and can be described adequately by pairs of Straight Line Segments (SLSs). The Hough Transform (HT) approach [7] may be used to group edge pixels into straight lines, in a way which effectively exploits its robust operation, when noise and discontinuities are present in the image. However, in the current application, the complexity of the image data precludes the use of a standard HT technique, and led to the development of a more powerful HT scheme that utilises the local properties of the structures of interest [8]. Thus, randomly selected pairs of edgels, which however satisfy certain similarity and proximity constraints, are used within the HT framework in order to def'me possible ATN related line directions. In this way "noise", usually present in the HT parameter space (HS), is significantly reduced, which in turn allows image features to be detected more accurately. Specifically, line directions which are likely to correspond to parts of ATN elements are selected on the basis of (i) their distance from the projection of the ATN Target Model (ATN-TM) into the HS and (i) their "strength" in this parameter space. Thus, ATN-TM elements are described parametrically by pairs of SLSs, with a degree of uncertainty that depend on the accuracy of the large-scale airport site map information. This information is effectively projected to HS, using INS/GPS data, camera related parameters and the transformation functions defined in [8] and [9]. The distance between the line directions of the parts of the m th ATN-TM element Vm,TM and a line direction VHS identified by a HS peak, takes the form of the Mahalanobis distance measure [ 10]: d( Vm,TM, vHS
):
= (Vm,TM - VHS
-,
(Vm,q.M _VHs) (Vm,TM - VHS )
(1)
-1
where E (Vm,TM_VHS) is the inverse covariance matrix required in evaluating this measure. This covariance matrix is estimated by projecting TM related location uncertainties, as originally defined in a world co-ordinate system, into the above HS. The projection is achieved via transformation functions (see [8,9]). Only line directions satisfying:
Am(D) ={d(Vm,TM, VHS) < D
V VHS ellS t
(2)
element are considered. From this set of qualified line directions, instances Vm,c that satisfy: B m = Vm,c > Vi,HS
V i e A m(D)
(3)
are defined as those corresponding to parts of the m th ATN-TM element in the HS. The index s in Equation (3) denotes the "strength" of the qualified line direction in the HS domain. Furthermore, a SLS is def'med for each of the selected line directions according to (i) the projection in the image of the associated ATN-TM element line, and (ii) SLS length information formulated in this specific direction during the HT "voting" process. ATN-TM knowledge is also instrumental in producing ATN related pairs of SLSs. Those SLSs corresponding to parts of a specific ATN-TM element are then coupled together, forming pairs of SLSs which are likely to correspond to ATN elements. This is possible since ATN elements are described by parallel SLSs whose lengths and distances between them, are predetermined according to international airport design standards [5,6]. Also SLSs, in the image domain, should have opposite edge directions. These spatial length, width and edge direction relationships represent general ATN structural characteristics and manifest themselves in terms of system constraints. However, because these constraints are violated when perspective
659 effects are present [3,4], the system applies them in a Cartesian world co-ordinate system. This is achieved after "backprojecting", with the aid of INS/GPS, DTEM and CCDCD data, extracted SLSs from the image to this co-ordinate system. Specifically, SLS pairs are tested for compatibility with the available general ATN structural characteristics as defmed in both the world co-ordinate system and in the image. Thus, a number of constraint rules are used by the proposed system: 9 In the world co-ordinate system: (i) SLSs are parallel, (ii) there is an overlap in the projections of SLSs to a common direction, (iii) their lengths are within a predefined range, (iv) the distance defined between the directions of the two SLSs is within a certain range, (v) the aspect ratio of a hypothetical parallelogram that is formed by the two SLSs is much greater than 1. 9 In the image domain: (vi) the two SLSs have opposite edge directions. The system then proceeds with qualified pairs of SLSs which correspond to the same ATN-TM element and consistently appear in consecutive image frames. These are examined in order to select the "best" pair that will finally represent a particular ATN-TM element. A distortion measure is defined for this purpose over f successive image frames using n qualified pairs Pk of SLSs, which takes into account the structural characteristics, in the world co-ordinate system, of a candidate pair Pi of SLSs:
DM (Pi)=
~ / j=l~ (wJ'[PJ-PkJl)+w4"d(p?'p)4)
k=l
(4)
4.n
Index j in Equation (4) denotes the length (j=l), width (j=2), orientation (j=3) and the position of the centroid 0=4), in the world co-ordinate system, of the previously mentioned parallelogram, d(p4,p 4) is the Euclidean distance between the ith and k th pair centroids and wj are weights associated with the above structural characteristics. A pair of SLSs in a given frame, which results to a minimum distortion measure is thus selected to represent the ATN SLSs in all the f frames of the input video sequence. 3. COMPUTER SIMULATION RESULTS AND CONCLUSIONS
The proposed model-based approach has been tested using real airport aerial image sequences, each containing three ATN elements (MR, CRA, CRB) of different contrast and importance in terms of landing an aircraft. Detection and False Alarms rates measured in these cases in a "per frame" basis, i.e. without utilising the distortion measure defined by Equation (4), are illustrated in Figure 1.a. The system has the ability to locate ATN elements even in cases where the quality of the images is particularly poor and when inexperienced observers have difficulties in correctly identifying these elements, i.e. all ATN elements in the Airport02 sequence and the CRB element of the Airport04 sequence. Notice that the overall system performance depends on the accuracy of the image independent information, particularly that of the INS/GPS data and the CCD camera parameters. The robustness of the system with respect to the above data has been examined with the Airport02 sequence where both the INS/GPS and camera parameters were highly corrupted. In this case the system can still identify for most of the time the main runway (MR) and provides low False Alarms rates. Figure 1.b illustrates ATN elements detection rates for the more accurate aerial video sequence case of the Airport04 sequence, as a function of the number of frames f used in the verification process. This multiframe scheme offers zero False Alarms and even higher detection rates, when compared to the "per frame" case. Medium and high contrast ATN elements, such as the MR and CRA, result to reasonably high detection rates when the number of frames f used by the multi-frame scheme is f > 5. However, in this case the detection of low contrast ATN elements, like CRB, is poor due to inconsistencies in detecting parts of this structure throughout a large number of successive frames. In addition to the above typical ATN detection performance, experiments were also carried out in order to determine the accuracy of the system in estimating correctly the structural characteristics of detected ATN elements. Thus the Mean Absolute Differences (MADs) were measured for ATN elements which are aligned with (i.e. the MR ATN element) or which are perpendicular to the flight path of the aircraft (i.e. the CRA ATN element). The figures quoted below were measured with the aircraft being within the range of 3420 to 2960 meters from the airfield's reference point. In the Airport04 sequence the minimum MAD orientation
660
FIGURE 1. (a) ATN elements detection and False Alarms rates measured in the "per frame" case, for three ATN elements in two input video sequences. (b) ATN elements detection rates of the proposed multi-frame system, for the same ATN elements in the Airport04 sequence, as a function of the number of frames f used. No False Alarms are observed in this case.
observed was 0.36 ~ (MR case) and the maximum 3.6 ~ (CRA case). A MAD length of 3.5 meters was measured for the MR case which corresponds to a 0.2% relative length error. For the CRA element, the MAD length was 15.5 meters which corresponds to a 1.2% relative length error. The maximum absolute width difference observed was 28 meters, at a distance of 3380 meters away from the reference point, and the MR and CRA MAD width were 5.3 and 2.6 meters resulting in relative width errors of 11.4% and 5.6%, respectively. These performance characteristics are typical of the system operating with corrupted image independent information. Notice that the fidelity of this information is the enabling factor that allows the system to operate at a maximum required performance in critical applications, such as the autonomous landing of aircraft.
ACKNOWLEDGEMENTS This work was supported by the Military Aircraft Division of British Aerospace (Defence) Ltd.. and the Engineering and Physical Sciences Research Council (EPSRC).
REFERENCES [1] ScheU, F.R. and Dickmanns, E.D. "Autonomous Landing of Airplanes by Dynamic Machine Vision", Proc. IEEE Workshop on Applications of Computer Vision, Nov./Dec. 1992. [2] Tang, Y.L., Devadiga, S., Kasturi., R. and Harris Sr., R.L. "Model-Based Approach for Detection of Objects in Low Resolution Passive Millimeter-Wave Images", Proc. SPIE: Image and Video Processing II, Vol. 2182, Feb. 1994, pp. 320-330. [3] McGlone, J.C. and Shufelt, J.A. "Projective and Object Space Geometry for Monocular Building Extraction", Proc. IEEE Conf. on Computer Vision and Pattern Recognition, June 1994, pp. 54-61. [4] Jaynes, C., Stolle, F. and Collins, R. "Task Driven Perceptual Organization for Extraction of Rooftop Polygons", Proc. 23 rd ARPA Image Understanding Workshop, Vol I, Nov. 1994, pp.359-368. [5] Horonjeff, R. and McKelvey, F.X. "Planning and Design of Airports", 4th ed., McGraw-Hill Inc., 1994. [6] International Civil Aviation Organisation "Aerodrome Design Manual, Part 1: Runways", 2nded., Canada, 1984. [7] Leavers, V.F. "Which Hough Transform ?", CVGIP: Image Understanding, Vol. 58, No. 2, Sept. 1993, pp. 250264. [8] Sarantis, D. and Xydeas, C.S. "A Methodology for Detecting Man-Made Structures in Sequences of Airport Aerial Images", Proc. Int. Conf. on Digital Signal Processing, Cyprus, Vol. 2, June 1995, pp. 565-570. [9] Bryson, N.F. "FuseNTS-Fusion of Navigation, Terrain, and Sensor Data: Phase I: Work Package W2-ModelBased Feature Analysis", Technical Report, School of Engineering, Division of Electrical Engineering, University of Manchester, UK, May 1993. [10] Mahalanobis, P.C. "On the Generalized Distance in Statistics", Proc. National Inst. of Science of India, Vol. II, No. 1, April 1936, pp. 49-55. [ 11] Smith, R.C. and Cheesman, P. "On the Representation and Estimation of Spatial Uncertainty", Int. Journal of Robotics Research, Vol. 5, No. 4, Winter 1986, pp. 56-68.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
661
C O N T E X T DRIVEN MATCHING IN S T R U C T U R A L PATTERN RECOGNITION S.Gautama and J.P.F.D'Haeyer Vakgroep Telecommunicatie en Informatieverwerking Universiteit Gent ABSTRACT In this paper we examine the problem of structural pattern recognition using graph structures. To speed up the correspondence problem, we propose a histogram technique which characterizes the context of a primitive within a pattern and allows indexing in the model database with polynomial complexity. INTRODUCTION The current research on image understanding in real-world applications is dominated by knowledgebased systems, where knowledge from low-level image processing procedures to high-level image interpretation is gathered and programmed into an expert system 1'2. The disadvantage of such a system is that it becomes highly application dependend, making the redesign of an existing system to a new application impractable. In environments where expert knowledge is hard to formalize or where the size of the problem can benefit from automation, efficient use could be made of automated learning tools during the design. In this paper we examine representation, generated using basic primitives, and an iterative matching technique which can make efficiently use of this representation to guide the recognition process. As it applies to probabilistic graph structures, the method serves as basis for the incorporation of incremental learning. The object models and the scene that needs to be interpreted are described by primitives and relationships between these primitives. They are mathematically represented by a (probabilistic) hypergraph structure in which n-ary relations are represented by a hyperedge connecting the n primitives in the argument list. These hyperedges, encoding topological and geometrical relations, contain important information that is needed to constrain the large space of possible mappings between primitives. To restrict the number of relations that are being generated, a neighbourhood system is imposed on the scene. In this way only relations between a primitive and its nearest neighbours are allowed, reducing the size of the scene hypergraph. Within a neighbourhood, relations are being measured, after which they are passed through a quantifier, generating a discrete set of 'relation labels'. Thus after preprocessing the scene, each scene hyperedge carries a single label. No use is made of unary measurements in the primitives, other than its midposition to determine the neighbourhood. Object models are induced from object instances which are generalised into a probabilistic hypergraph. Model hyperedges contain a probability distribution of labels, to capture the variability in shape of the object instances. The model primitives contain the mean midposition of the corresponding instance primitives, defining the neighbourhood system over the set of model primitives. To solve the correspondence problem, the notion of context histogram is introduced ~. This histogram, calculated for each primitive, gathers the occurrence frequencies of the quantified relation labels in the support set of a target primitive. The support set is the set of relations (hyperedges) that contain the target primitive in their argument list. The characterisation by means of the support set bears resemblance to the Q-coefficient used in probabilistic relaxation techniques 4. A histogram however, while increasing the memory requirements, does allow a more detailed characterisation than a single coefficient, meaning more complex similarity measures can be used.
662
CONTEXT DRIVEN MATCHING In this section, definitions and mathematics are introduced that form the base of the recognition process. Attributed hypergraphs are used as representation for higher-order structural patterns. An attributed hypergraph I, defined on a set of labels s consists out of two parts: 1) H which denotes the structure of hyperedges, and 2) A: H ~ which describes the attribute values of the hyperedge set. A hyperedge of order v with index k is denoted as I~. Primitives in the hypergraph correspond to hyperedges of order 0 and are notated by Ik, dropping the superscript to ease the notation.
A random hypergraph M represents a random family of attributed hypergraphs, thereby serving as a model description which captures the variability present within a class. It consists out of two parts: 1) H which denotes its structure, and 2) P: H x ~ [ 0 , 1 ] which describes the random elements. Associated with each possible outcome I of M and graph correspondence T: I ~ M there is a probability P(I<Mr) of I being an instance of M through T. Correspondence between a scene primitive Ik and a model primitive Mrk proceeds by comparing the support set of both primitives. The support set S of a primitve Ik is defined as the set of hyperedges that contain Ik in their argument list: S(Ik) = { I~ / lk 9 Arg(l~ ) }, where Arg(l~ ) denotes the argument list of the hyperedge I~. Built over the support set is the context histogram, which is used to characterize scene and model primitive. For a scene primitive Ik and label or, the context histogram gathers the occurrence frequencies of a label oc in the support set of Ik and is defined as: t~S(lk)
=
Is(/,)[
The denominator normalises the total mass of the contex histogram to unity. Calculated on a random hypergraph, a context histogram is defined as containing the expected occurrence frequencies of the labels, modified by a hedge function F which encodes prior knowledge of the correspondence between scene and model primitive:
Z P ( Z ( M [ ) = ot).F(l k -< MT.k ,ot, M~') C(Ik -< M L ,or) = M~S(Mrk)
Is(M,)[
The hedge function weights the contribution of each hyperedge within the support set of the model primitive, by taking into account the support that the primitives in the argument list of the hyperedge receive. This is modeled after the Q-coefficient in probabilistic relaxation. For binary relations this coefficient is expressed as:
Q(lk "<Mr k )=
I-I ~ p(Z(MI,~) =/~(l~,t)).P(l t -< M~) ll,t~Sf lk ) M~k,rT~SfMrk )
where the subscript in the first order hyperedges fk, l denotes its arguments. This function can be viewed as calculating the probability of occurrence of the context vector {A(llk.t)/ I ~`,t ~ S(lk)} in the support set of the model segment Mrj:, where the scene graph is taken as an ANDgraph, while the model graph is taken as an OR-graph with independent and mutually exclusive primitives. Each occurrence probability p(lq,(Mlk,t)=,~,([lk,t)) is additionally weighted with the support of its argument p(It <Mrt). For first order hypergraphs, the hedge function F is taken as:
F(lk -< Mr~ ,t~,M~,~ ) =
max
p(ll -< M~ )
I~:~S(Ik)
Similarity between a scene primitive lk and a model primitive Mrk is defined as: S(Ik,Mr, ) = ~,,min(C(lk,ot),C(lk -< M L ,a)) which can be used again as prior estimation, thereby establishing an iterative recognition scheme. Figure 1 illustrates the basic elements of the process.
663
Figure 1 Illustration of the construction of the context histogram for scene and model primitive
To illustrate the technique, we examine the recognition of crossroad structures within a digital image. Fig.2a presents part of the city of Ghent, generated using CorelDraw, after which it has been segmented into line segments. The line segments form the basic primitives of the representation. Binary relations are generated using the relative angle between line segments, resulting in a first order hypergraph. The neigbourhood of a primitive is set to a region within radius 25 (img size=512x512) and the quantisation level is set to 8, resulting into 8 discrete relation labels. No use is made of unary measurements to characterize a line segment as the matching process relies solely on the information offered by the angle relations. The model is an extract from the scene which has to be localized. Model and scene graph representations are generated independently from eachother. Recognition is more complex than a 2-class problem (i.e. structure and scene noise), since each primitive within a structure needs to be correctly mapped onto the corresponding primitive. After two iterations, placing a threshold of 50% on the match probability to suppress scene noise and retaining the best match, the results are summarised in table 1 (exp.1). The scene primitives that pass the threshold (i.e. that are recognized as being model primitives) are highlighted in fig.2a. Fig.2b shows the unthresholded correspondence map after two iterations, which presents the match probabilities of the scene primitives (horizontal axis) onto the model primitives (vertical axis).
664
Figure 3 (a) scene with recognized primitives highlighted, (b) correspondence map, (c) model 1, (d) model 2, (e) original image
Fig.3e presents a digitized image of a roadmap of Bologna. The result after initial segmentation into line segments is shown in fig.3a, containing 214 segments. Two structures need to be identified in the image, fig.3b and 3c, containing resp. 24 and 20 segments. The same conditions hold as for experiment 1 and the results are summarised in table 1 (exp.2). Experiment 1 Total I % model segments correct mapping wrong mapping missing segments scene noise false alarm suppressed noise
27 1
15
62,8 2,3 34,9
Experiment 2 Total I % 30 1
13
1 0 0.5 204 213 99.5 Table 1 Summary results of structural matching
68.2 2.3 29.5 0
100
CONCLUSIONS We have presented a new iterative matching technique based on a histogram of structural context information. Experiments show a good noise suppressing ability while retaining adequate recognition results with minimal false alarms. Since scene primitives are structurally mapped onto the model, orientation and scale can be hypothesised from a match, whereas the model can be used to direct a search for missing information, thereby improving or ignoring the match. This will be the subject for further work. 1 V.Hwang, L.Davis, T.Matsuyama, "Hypothesis integration in image understanding systems," Computer Vision, Graphics and Image Processing, CVGIP 36, 1986, pp.321-371. 2J.Van Cleynenbreugel,"Tapping multiple knowledge sources to delineate road networks on high resolution satellite images," Master's thesis, KUL, 1992 3 S.Gautama,J.P.F.D'Haeyer, "Automatic induction of relational models" in Hybrid Image and Signal Processing V, Proc.SPIE vol.2751, 1996, pp.253-263 4 W.J.Christmas, J.Kitfler, M.Petrou, "Structural matching in computer vision using probabilistic relaxation," IEEE Transactions on Pattern Analysis and Machine Intelligence,Vol.17, No.8, 1995, pp.749-764.
Proceedings IWISP '96,"4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
665
A n Efficient Box-Couting Fractal Dimension A p p r o a c h for E x p e r i m e n t a l I m a g e Variation Characterization Aura Conci 1"2 Claudenize F. J. C a m p o s 2 1Comp. Apl. e AutomagSo - C A A - P6s-Grad. Eng. Mec~nica - PGMEC - UFF CEP 24 210-240, Niter6i, R J , Brazil - [email protected] 2Dep. Eng. Mec~nica - Pontificia Universidade Cat61ica do Rio de Janeiro - PUC-Rio CEP 22 953-900, Rio de Janeiro, R J , Brazil - [email protected]
Abstract Many applications of fractal concepts rely on the ability to estimate the fractal dimension (FD) of objects. FD is a attempt to quantify how densely a fractal occupies the space in which it lies. This characteristc has been used in texture classification, segmentation and other problems. An efficient algorithm to estimate FD of images is proposed in this paper. We suggest its use to identify on line image deviation from a standard pattern. We report on some experiments on textile failings and comparison with four other methods. Introduction The FD of a set A in Euclidean n-space can be derived from 1 = Nr rEDor FD = log ( N r ) / l o g ( l / r ) (1) where Nr is the union of non-overlapping copies of.4 scaled down by a ratio r. However, it is difficult to compute FD using these equations directly. Peleg et al. Ill extended FD to image, that can be viewed as a terrain surface whose height is proportional to the image gray value (figure 1). The reticular ceil counting estimator has been proposed by Gangepain and Roques-Car m e~_[2]. But this estimator can not be used when the range of the actual FD of an image is 2.0 - 2.5. Keller et al. [3] proposed an approach, which presents satisfactory results up to FD=2.75. Pentland [4] suggested a method of estimating FD by using Fourier power spectrum of image intensity surface, such method gives satisfactory results but, since Fourier transformation computation is included, it is slower than the others. Sarkar and Chaudhuri [5] described an efficient approach, named Differential Box-Couting (DBC), that uses differences on computing Nr and gives satisfactory results in all range of FD. The Nr in DBC method is counted in a different manner from the others box-counting methods [6]. Consider the image of MxM pixcls has been partitioned into grids of sxs pixels and scaled down to r=s/M. If G is the total number of gray levels then G/s'=M/s. On each grid there is a column of boxes of size sxsxs'. Assign number l, 2, ...n to the boxes as shown in figure I. If the minimum gray level of the image in the grid (i,j) fall in box number k, and the maximum gray level of the images (i,j)th grid fall in the box number I, then in DBC approach nr (i,j)=l-k+ 1 (2) is the contribution of nr from the grid (i,j). Taking contributions from all grids N r - Z nr (i,j) (3) Then the FD can be estimate from the least squares linear fit of log(Nr) x log(I/r), where Nr is counted for different values of r and s. DBC Modification Although the DBC method gives a very good estimate of FD some simplifications in computations and improvements in efficiency is possible using the following modifications in the original method. If a set A E ~ 3 is covered by just-touching boxes of side length (1/2)", equation (1) can be rewriting as FD = limn __>oo(logNn) / (log 2n) (4)
666 where Nn denotes the number of boxes of side length (1/2)" which intersect the set A 6. In our proposed algorithm, the image division in box of different length is processed in a new manner from others box counting variations I61 and the original DBC method tsl. Consider the image of size MxM pixels, we take M to be a power of 2 and take the range of light intensity to be integers from 0 to 255. All images are enclosed in a big box of size MxMx256. We consider the image divided into box of side length nxnxn' for n=2,4,8...2m and n~ .... 2m' for each image subdivision. Nn is counted as Nn = Y, nn (i,j). (5) nn=int(Gray_max(i,j)/n'-int(Gray_min(i,j)/n')+ 1 where int(..) is the integer part of a division. These changes turn the implementation faster and simpler than the DBC original algorithm. The image file is read only once, in the first image division in boxes, the bitmap of MxM pixels needless be saved, when the image is read we saved only two matrices of M/2xM/2: Gray_max and Gray_min (saving MxM/2). This first calculation of nn using equation (3) correspond to divided the image in boxes of 2x2 pixels. For boxes of 4x4 pixels there will be M/4xM/4 elements in Gray_max and Gray_rnin, and each new element (i_new,j_new) is obtained from consulting only the four dement (i,j); (i+ 1,j); (i,j+ 1) and (i+ 1,j+ 1) of the Gray_max and Gray_min matrices. If the algorithm begin in i=j=0 in each new iteration the Gray_max and Gray_min matrix elements (i_new=i/2 and j_new=j/2), for the next division of the image, can be saved in the same space. Then using (4) we estimate FD from mean of log(Nn)/log(2n). It can be easily showed iS] that computation complexity of others approach, including the original DBC, is much more high than that using the above suggested modifications.
Experiments
The proposed DBC modifications have been used in some experiments. Our first goal is to examine the accuracy of the approach for FD estimation so we use first only images having known fractal dimensions. Test data for this first group of experiments came from synthetic textures (9 Brownian images and 9 Takagi surfaces generated on a 256x256 grid with 256 gray level and FD varied from 2.1 to 2.9 on steps of 0.1- not reported here) [71. For experiment with natural images we took Brodatz's textures (figure 2 and table 1). The fact that our modifications return accurate values on images with known dimension motivates questions on possibility of the use of FD to identify variations in images. The remainder of our experiments investigate this possibility. Experiments on estimates changes in images are shown in figure 3 and 4. These figures represent fabrics with different patterns and kind of defect. As can be seen light changes in the image modified the FD computed for each image.
--
I
if
_
,,
J;-.->-,, ,,. ,,
i.r~l
kr
/;--'--_;-4--'vx4--" Figure 1 - Deter~nation of nr or nn.
-7"
Table 1 - FD of natural texture (image numbers correspond to Brodatz's book) image DBC DBC tSj Pent- Peleg Keller modif, land[4] et al.m et al.t3J 2.66 2.55 2.72 2.68 D04 2.66 2.45 2.38 2.52 2.57 D05 2.45 2.58 2.49 2.65 2.65 D09 2.58 2.44 2.46 2.59 2.57 D24 2.44 2.55 2.48 2.61 2.62 D28 2.55 2.48 2.37 2.60 2.59 D55 2.48 2.53 2.44 2.63 2.60 D68 2.53 2.61 2.47 2.68 2.65 D84 2.61 2.50 2.38 2.59 2.59 D92 2.50
667
Fi~,ure 2. - Brodatz's natural texture ( FD in table 1 )
DF=2.60 (top) and 2.62 (botton) DF=2.51(top) and 2.53 (botton) DF=2.59(top) and 2.57 (botton) Figure 3 - Usual textiles imperfections on drill (le~), cotton (centre) and carpet (fight).
668
Figure 4 - Jeans without defects, FD=2.43 (top left); stained jeans, FD=2.41 (top centre) and non-uniform dye, FD=2.47 (top fight). Silk without imperfections (botton left), FD=2.18; with a single imperfections (botton centre), FD=2.09; and with many imperfections (botton fight), FD=2.40. Conclusions
The main goal of this paper is to present a simple approach to compute FD on images. Elementary experiments demonstrated that the variations between an original image pattern and its reproductions can affect the respective FD. A statistical analysis should be carried out, related to specific class of images, in order to access the applicability of the FD as a means for finding imperfections in practical considerations. The encouraging conclusion is that this approach is faster and simpler than the usual. This can be readily extended to 3D images as well. References
[1]S. Peleg, J. Naor, R. Hartley and D. Avnir, "Multiple resolution texture analysis and classification", 1EEE Trans. Pattern Anal. Machine Intell., Vol. 6,1984, pp. 518-523. [2]J. Gangepain and C. Roques-Carmes, "Fractal approach to two dimensional and three dimensional surface roughness", Wear, Vol. 109, 1986, pp. 119-126. [3]J. Keller, R. Crownover and S. Chen, "Texture description and segmentation through fractal geometry", Computer Vision Graphics Image Processing, Vol. 45, 1989, pp. 150-160. [4]A. P. Pentland, "Fractal based description of natural scenes", IEEE Trans. Pattern Analysis Machine Intell., Vol. 6, No.6, pp. 661-674, 1984. [5]N. Sarkar and B. B. Chaudhuri, "An efficient differential box-counting approach to compute fractal Dimension of Image", 1EEE Trans. Syst. Man and Cybernetic, Vol. 24, No. 1, pp. 115120, 1994. [6]B. Bamsley, Fractal Everywhere, Academic Press, 1988. [7]Qian Huang, J. R Lorch, R. C. Dubes, "Can fractal dimension of images be measured?", Pattern Recognition, Vol. 27, No. 3, March 1994, pp. 339-350.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
669
An Identification Tool to build Physical Models for Virtual Reality. Jean LOUCHET*t, Li JIANG:~* *ENSTA, Laboratoire d'Electronique et d'Informatique, 32 boulevard Victor,75739 PARIS 15, France t INRIA, projet SYNTIM, BP.153, Rocquencourt, 78153 LE CHESNAY, France SLAFORIA, Universit6 Pierre et Marie Curie, tour 46-00, 1 Place Jussieu, 75005 PARIS, France e-mail: [email protected]
Keywords
Computer vision, motion analysis, motion modelling, image animation, artificial evolution, multimedia, virtual reality. Virtual reality applications including visual and gestural man-machine interaction requires the use of particle-based deformable and articulated models. However, they lack the relevant model building methods, in order to ensure realistic behaviours based on observations of real-world objects. To this end, our research aims at looking for reliable physical model building mechanisms, which use experimental kinematic data of real objects as an input. Building a physical-based model of an object thus divides into two main steps: 9 The first step is the capture of kinematic data. It may consist of extracting characteristic points trajectories from the images of a real scene, involving image sequence analysis techniques, or other ad-hoc experimental means [6]. 9 the second step consists of using these kinematic data into a specific physical model implementation, scheme. This paper focuses on the second step. We propose an original method to automatically identify physical models built through using local masses and generalised springs.
1 The Physical Model. The general principles of the physical models we developed are presented in [4]. Particles are described by their masses, positions and speeds. Motion results from forces applied by internal bonds or the environment (gravitation...). Four bond types are considered: 9 9 9 9
unary bonds, between a particle and the medium (e.g. viscosity, gravitation...); binary bonds, between particle pairs; ternary ("flexion") bonds, between three particles; quaternary ("torsion") bonds, between four particles.
Each bond type may consist of two components: 9 the first component is defined as the gradient of an energy potential which depends on the relative positions of the particles involved. Forces applied to particles derive from this potential. This allows an easy (if tedious) transposition to ternary and quaternary bonds, of elementary mechanical conservation principles about resulting torques and forces [9]. 9 the second component generates damping forces, which depend on the particles' mutual positions and speeds.
2 An Evolutionary Identification Strategy. The task of the identification algorithm is to find out the bonds' parameters from given kinematic data. The basic idea is to consider identification as an optimisation problem. To this end, a cost function evaluates the quality of candidate physical models (i.e. object description files). This cost function measures a generalised distance between the trajectory predicted by the model to be evaluated, and the real given trajectory. The problem consists in finding, among all possible models, the one with the lowest cost. We observed that conventional numerical optimisation techniques are unsuccessful on these cost functions, probably lacking the desirable mathematical regularity properties. Even a stochastic technique like simulated annealing becomes extremely slow, and in practice cannot cope with objects containing more than about half a dozen particles. This is why we developed an "Evolutionary Strategy", a stochastic optimisation technique the principles of which are inspired (like
670 genetic algorithms) from biological evolution. It consists of creating a random initial population of models, and letting it evolve naturally through the effects of mechanisms of selection, crossover and mutations, under the control of the cost function values of the population's individuals. More generally, evolutionary strategies are known for their robustness and outstanding ability to optimise functions with multiple local minima. However, they are often slow, difficult to tune up, and may benefit from problem-specific implementations. We experienced that a conventional evolutionary scheme succeeds in identifying small particle systems (i.e. objects with simple behaviours), but loses efficiency and precision with increasing particle numbers. Therefore we introduced some novel characteristics into the evolutionary algorithm. Our philosophy is to exploit as much as possible the problem's specific semantics, but also to make sure that no a priori, implicit information on the parameters is given to the algorithm. First, we improved the robustness of the identification algorithm by designing a cost function which calculates shortterm (rather than long-term) trajectory differences: the cost function is the quadratic sum of differences between the real reference trajectory and the points extrapolated from the preceding time step. This has important consequences on the shape of the cost function. Second, we exploited a topological property: the position of a particle at a given time only depends on the recent history of its neighbourhood. This led us to split the cost function into the sum of each particle's contribution. We call these contributions, attached to individual particles, the "local cost functions". The local cost function of a particle is defined as the temporal sum of the prediction errors concerning its coordinates. These local cost functions play a key role in the identification algorithm. Indeed, let us remark that the position of a particle at time step (t + l~only depends on its position and speed and on its neighbours' positions at time step t (the neighbours of a given particle are defined as the particles which share a bond with it). Therefore, local cost functions corresponding to the extremities of a bond will be more relevant than the global cost function to evaluate the estimation quality of this bond. We use widely these local cost functions into the crossover and mutation processes in the evolutionary identification scheme. The guidance of evolutionary mechanisms by local cost functions results in the fact that the algorithm converges on each region of the object, independently of its convergence on other regions: the local cost functions are not influenced by remote regions as the global cost function would be. The fundamental consequence is that the number of generations (calculation steps) required for convergence does no more depend on the number of unknowns or the object' s complexity. Third, we provided the algorithm with a self-adaptive behaviour [2]: the variance of mutations, which is an important internal parameter of the algorithm, is controlled by input data. To this end, besides each (mechanical) parameter in the individual's codes, we introduced an extra "mutability" parameter, which controls the standard deviation of mutations which can be applied to the corresponding mechanical parameter. The mutability parameters have no direct effect on the behaviour or the cost of the individuals, but as part of the genetic code they are susceptible of the same evolutionary mechanisms (mutations, crossover) as mechanical parameters are. The mutabilities thus evolve and keep fitted to the algorithm's needs. Thus, the balance of robustness and precision, which is often a difficult point in evolutionary algorithms, remains under the control of input data and cost function values: this results in a much better precision of parameters' estimations at the end of the algorithm. An apparent side effect of this self-adaptivity is to double the representation size of an individual, but this has no real consequence on the execution speed, as the calculation of cost functions values, which is the dominant part, is unchanged.
3 Some application examples 3.1 Identification of the general linear model and noise resistance In order to test the identification algorithm's performance, we chose to use an experimental protocol which consists of: 9 build a catalogue of bonds 9 build an object by defining masses and positioning instantiations of the bonds between these masses 9 calculate a trajectory of the object, from arbitrary initial conditions. Then, we 'forget' the object parameters' values and use the evolutionary strategy described above in order to find out these values from the given kinematic data. The curves below show typical convergence results, on an elastic object consisting of 15 masses, 10 different linear bond types (20 parameters) and 31 installed bonds. The identification algorithm uses 100 consecutive images and a population of 100 individuals. In order to test the algorithm's robustness, we added a Gaussian noise to the given kinematic data with different standard deviations.
671
Log(precision of parameters' estimation) in function of the number of generations, with several noise levels on trajectories. The thick line represents the algorithm's convergence without noise. Tests show that convergence is very good after 500 to 1000 generations, independently of the number of parameters to be identified. In the typical convergence results shown above, the average parameter estimation error is lower than 0.01% for the three lowest noise levels shown. 3.2 Turbulent fluids In the general framework presented above, the objects have got a fixed structure of masses and bonds. However, particle-based models have long been used to simulate fluid objects like clouds or smoke. [8] use the Cordis-Anima approach [9] to model turbulent fluid flows, using point masses and conditional bonds between particles. Each of these bonds becomes active whenever the distance between its extremity particles goes under a given threshold. The authors obtained visually convincing simulations, using several particle and conditional bonds classes. Our aim is again to examine how it is possible to use our identification algorithm to induce internal characteristics of such a turbulent fluid model, from given kinematics. We implemented a similar viscous fluid model (see images below), using simultaneously several types of conservative bonds ("springs") depending on relative speeds:
if(distance
< thresholds)then (force = f s (Jc2 - 5q))
and energy-dissipative ("dampers") and depend on the relative positions:
if (distance < thresholda)then(f orce = f a(x2
-
Xl))
where fs and fa are non-linear functions.
Two 2-dimensional images of a jet penetrating into a fluid (flames nos. 300 and 400 from a sequence). The main difference with the general model is that the number of bonds (and therefore the computational load of the cost function) are very high compared to standard flexible objects (about 80000 in the example above) even if not all
672 are activated at the same time due to the distance threshold. After 1000 generations, the algorithm yields very good estimates of the initial parameters: bond parameter
initial parameter
parameter's estimate
viscosity coefficient 1 (between matrix particles)
2.5
2.499
viscosity coefficient 2 (non-linearity)
1.25
1.248
distance threshold for viscosity activation
1.5
1.500
viscosity coefficient 1 (between matrix and jet particles)
4
3.962
viscosity coefficient 2 (non-linearity)
2
2.052
distance threshold for viscosity activation
2
2.000
elasticity coefficient
2
2.000
distance threshold for elastic bonds activation
1.1
1.100
3.3 Other object types and future extensions The same identification procedure has been successfully tested on a cloth animation model [7] with a similar experimental protocol, and allowed a realistic reconstruction of a cloth image sequence. Here, the object is periodic and all bonds include a non-linearity factor through the introduction of an "elongation rate". The next step of our research project will consist of validating our approach using real-world kinematic data, coming from processing real images of articulated solid objects [5], fabrics [7] and turbulent fluid flows[8]. 4 Conclusion Particle-based physical models appear to be a promising general framework to build realistic and efficient models for simulation and animated image synthesis. They are being used increasingly to model smokes [10, 12], elastic or articulated bodies [1], fabrics [13]... but they need to be associated to identification methods both to deserve the "physical model" qualification and to ensure their behavioural realism. The evolutionary parameter identification technique proposed in this paper has proven successful to reconstruct model internal parameters from their kinematic outputs, including non-linear, conditional, elastic and viscous bonds in periodic or non-periodic objects, and thus provide the particle-based modelling tool with the 'measuring instrument' which should always be associated to a physical model. References [ 1]W. W. Armstrong, M. W. Green, "The dynamics of articulated rigid bodies for purposes of animation", The Visual Computer, Vol.l,, n~ 231-240, 1985. [2] Thomas B~ick, "Evolution Strategies: an alternative evolutionary algorithm". Artificial Evolution 95, Brest, September 1995, Springer 1996. [3] D.A.Goldberg, "Genetic Algorithms in Search, Optimization and Machine Learning", Addison-Wesley 1989. [4] J. Louchet, "An Evolutionary Algorithm for Physical Motion Analysis", British Machine Vision Conference, York, Sep. 1994. [5] J.Louchet, M.Boccara, "Detecting rotating regions in image sequences", Image'Com96, Bordeaux, may 1996. [6] J. Louchet, M.Boccara, D.Crochemore, X.Provot, "Building new tools for Synthetic Image Animation by using Evolutionary Techniques", Artificial Evolution 1995, Brest, September 1995, Springer 1996. [7] J.Louchet, X.Provot, D.Crochemore, "Evolutionary identification of cloth animation models, Eurographics Workshop on Animation and Simulation", Maastricht, Sep. 1995. [8] A. Luciani, A. Habibi, A. Vapillon, Y. Duroc, "A physical Model of Turbulent Fluids", Eurographics Workshop on Animation and Simulation, pp. 16-29, Maastricht, Sep. 1995. [9] A. Luciani, S. Jimenez. J.L. Florens, C. Cadoz, O. Raoult, "Computational Physics: a Modeller Simulator for Animated Physical Objects", Proc. Eurographics Conference, Vienna, Sep. 1991, Elsevier. [10] W. T. Reeves, "Particle Systems - A Technique for Modelling a Class of Fuzzy Objects", Computer Graphics (Siggraph) vol. 17 n ~ 3,359-376, juillet 1983. [ 1 I] N. Szilas, C. Cadoz, "Physical Models That Learn", International Computer Music Conference, Tokyo, 1993. [ 12] J.Stam, E. Fiume, "Turbulent wind fields for gaseous phenomena", ACM Computer Graphics (Siggraph 93), 369376, August 1993. [13] D. Terzopoulos, J. Platt, A. Barr, K. Fleischer, "Elastically Deformable Models", Proc. Siggraph 87, Computer Graphics, 1987, Vol. 21, No. 4, pp. 205-214.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
673
CUE- BASED CAMERA CALIBRATION AND ITS APPLICATION TO DIGITAL MOVING IMAGE PRODUCTION Yuji NAKAZAWA,
T~dcxtshi KOI4ATSU, and Takahim SAITO
Department of Electrical Engineering, Kanagawa University 3-27-1 Rokkakubashi, Kanagawa-ku, Yokohama, 221, JAPAN Tel" +81-45-481-5661 Ext.3119 Fax: +81-45-491-7915 Email:[email protected] ABSTRACT
One of the keys to new-generation digital image production is to construct simple methods for estimating the camera's motion, position and orientation from a moving image sequence observed with a single TV camera. For that purpose, we present a method for camera calibration and estimation of focal length. The method utilizes four definite coplanar points, e.g. four vertices of an A4 size paper, as a cue. Moreover, we apply the cue-based method to the digital image production task of making up an moving image sequence of a synthetic 3-D CG image sequence and a real moving image sequence taken with a TV camera. The cue-based method works well for the task. 1. I N T R O D U C T I O N - BACKGROUND AND MOTIVATION Recently some research institutes have started studying digital production of a panoramic image sequence from an observed moving image sequence, construction of a virtual studio with the 3-D CG technology and so on, with intent to establish the concept and the schema of the newgeneration digital image production technology. Such an image production technology, utilizing information about the camera's motion, position, orientation and so on, integrates consecutive image frames to produce such an enhanced image as a high-resolution panorama and/or makes up a moving image sequence of a synthetic 3-D CG image sequence and a real moving image sequence taken with a TV camera. The key to the new: generation digital image production is to develop simple methods for estimating the camera's motion, position and orientation from a real moving image sequence observed with a single TV camera [ 1].
by using four definite coplanar points, which usually correspond to four vertices of a certain quadrilateral plane object such as an A4 size paper, as a cue. The practical computational algorithms for the cue-based method of camera calibration are composed of simple linear algebraic operations and arithmetic operations, and hence they work so well as to provide accurate estimates of the camera's motion, position and orientation stably. Furthermore, in this paper, we apply the cue-based camera calibration method to the image production task of making up an image sequence of a synthetic 3-D CG image sequence and a real moving image sequence taken with a TV camera according to the recovered estimates of the camera's motion, position and orientation. 2.
CALIBRATION
relative positions are known in advance and which usually correspond to four vertices of a certain quadrilateral plane object with known shape and are used as a cue for camera calibration. We perform camera calibration, that is to say, determination of the camera's position and orientation at each image frame, from 2-D spatial image coordinates of the four definite coplanar cue points, which are tracked temporally over consecutive image frames by our recently presented feature tracking algorithm [3]. Under such conditions, we perform camera calibration and estimate the focal length/" of the camera at the same time.
2.1 In this paper, to render it feasible to perform such 3-D estimation of the camera's motion, position and orientation when we use a single handy TV camera whose camera parameters are not given in advance, we present a method for performing camera calibration along with estimation of focal length of the camera accurately
CUE-BASED CAMERA
In this paper, we assume the following situation. The situation is as follows: while moving the single TV camera by hand [2], we image the scene which includes not only the objects of interest but also four definite coplanar points P1 " P4 whose
Image Coordinate System
Here for each image frame we define the 3-D viewing coordinate system o'x),'z' which is associated with the 2-D image coordinate system O-XY as shown in figure 1. In figure 1, we represent the 3-D viewing coordinates and the 2-D image coordinates with ( x' y' z ' ) and ( X Y ) respectively. We represent the 3-D viewing
674
coordinates of the four coplanar cue points P1 " P4 with { Pi' =(xi'
Yi' zi') t;i= 1,2,3,4},and we
represent the 2-D image coordinates of the imaged coplanar cue points, perspectively projected onto the image plane, with { Pi = (Xi Yi )t ; i = 1,2, 3, 4}.
2.2
y" Z ' ) t = M ' ( x
y
= ( m 1 m2
represent the 3-D world coordinates of the four coplanar cue points P I ~ P4 with P I = ( x'
Camera Calibration
The problem of camera calibration is to recover geometrical transformation of 3-D world coordinates of an arbitrary point in the imaged scene into its corresponding 2-D image coordinates, from given multiple pairs of the 3-D world coordinates and their corresponding 2-D image coordinates. The camera calibration problem is concisely formulated with the homogeneous coordinate systems. Given both 4D homogeneous world coordinates a = ( x 3' z 1 )t of an arbitrary point in the imaged scene and their corresponding 3-D homogeneous image coordinates b = h . ( X Y 1 )t, then the foregoing transformation will be represented as the linear transformation, which is defined as follows: (x'
z-axis is normal to the cue quadrilateral. Moreover, without loss of generality, we put the origin of the 3-D world coordinate system o-xyz at one of the coplanar cue points, e.g. P1. In this case, we can
z
Pi =(xi
t
(3)
Assuming that the focal length f o f the camera is accurately estimated some way or other, which will be described in the next section, we can easily recover the 3 u 4 transformation matrix M of equation 1 from the four pairs of the 3-D world coordinates Pi = (xi Yi 0 )t of each cue point Pi and its corresponding image coordinates Pi = ( Xi
Yi )t. Substituting the four coordinate pairs into equation 1, then we will reach the simultaneous equations:
(x'i Y'i Z'i)t = N ' ( x i
Yi
1)t
,n ,,2
1)t m3
Yl Zl)t =(Xl Yl ()) t : O Yi Zi / =(Xi Yi ~)' ;i=2,3,4
m4).(x
= x. ml + y. mz + z . m3 + m 4
y
= ] 1721 1722 n24 \n31 1132 n34
z 1)t
Vi X i = xi , Y/= -w--; Zi ~.s
(1)
'
. i = 1, 2, 3, 4
(4)
7.
x=X--.f z
, y=y--.f z
(2)
where the focal length f is explicitly handled. Here the camera calibration problem is defined as the problem to recover the 3 u 4 matrix M and the focal length f of equation 1 from given multiple pairs of the homogeneous world coordinates and their corresponding homogeneous image coordinates. Equation 1 means that the 3-D viewing coordinates ( x' y' z' ) are expressed as a linear combination of the three vectors { m 1 m 2 m 3 }, and hence we may regard the three vectors { m 1 m 2 m 3 } as the basis vectors of the 3-D world
where the focal length/" is implicitly included in the expression of the 3 u matrix N and the matrix N is related with the matrix M as follows"
I
ml I m12 m14 I (l?ll/f 1112/f /114/f / m21 m22 m24 : In-~l ]/" l?.~/.f" n 2 4 / f / m31 m32 m34 \ 1131 1132 !134 J
(5)
The simultaneous equations of given by equation 4 are linear with respect to the nine unknown variables { n I 1 ..... n34 }, and hence we
coordinate system o-xrz. On the other hand, the vector m 4 means the displacement vector shifting
can easily solve them. Their solution is expressed with a scale factor k. Moreover, given the focal length f of the camera, then we can recover the column vectors { m 1 m 2 m 4 }of the matrix M
from the origin of the 3-D viewing coordinate system to that of the 3-D world coordinate system.
by applying the relation of equation 5. With regard to the column vector m 3 of the matrix M, weshould
Here we imagine a plane quadrilateral whose four vertices are given by the four definite coplanar cue points, and we refer to the plane quadrilateral as the cue quadrilateral. As a common coordinate system to all image frames, we define the 3-D world coordinate system o-xyz whose x-y cross section contains the cue quadrilateral, that is to say, whose
employ a vector which is normal to both the two column vectors { m 1 m 2 },e.g. [ml[ .(m I xm2) m3 =lm 1 xm2[
(6)
Thus we can recover the 3 u 4 transformation
675
based camera calibration method explicitly identifies the focal length ./"of the camera.
matrix M of equation 1.
2.3 Estimation of Focal Length Once we recover the foregoing transformation matrix N of equation 4, we can estimate the relative depth z i' of each coplanar cue point P i as follows: Zi -- m31 "xi +m32 "Yi +m34 =rt31 "xi +1732 "Yi +n34
(7)
Thus we get an estimate of the 3-D viewing coordinates Pi' = (xi' Yi' zi' )t of each coplanar cue point Pi as follows: (8) The lengths of the four sides of the cue quadrilateral are assumed to be known in advance, and furthermore taking it into consideration the fact that the ratio of lengths of two sides arbitrarily chosen out of the four sides is invariant irrespective of the definition of the 3-D coordinate system, we get the relation:
IP'- P~
[P2 -Pl
12 =
Ip'4- p'll 2
[p4-Pl[ 2
(9)
r
Substituting equation 8 along with equation 7 into equation 9, then we will obtain the quadratic equation with respect to the focal length f The solution is given by f =
~r.C-A B-r.D
(10)
2.4 Comparison with the Existing Camera Calibration Method Most of the usual camera calibration methods do not employ any cues, and some of them involve computation of an eigenvector with the minimum eigenvalue for a certain positive-definite matrix, which seems to be often sensitive to noise [4]. On the other hand, the proposed cue-based camera calibration method requires only one solution o f linear simultaneous equations with nine unknown variables and simple scalar arithmetic operations, and hence its computational algorithm works numerically stably. In addition, the proposed cue-
3.
DIGITAL M O V I N G I M A G E PRODUCTION We have imaged the scene in our laboratory while moving a 8-mm handy TV camera for home use manually, and then we have applied the foregoing cue-based camera calibration method to the moving image sequence, each image frame of which is composed of 720 u 486 pixels. In the imaged scene we have put an A4 size paper on the floor of the laboratory, and we have used the four vertices of the A4 size paper as cue four coplanar points. Moreover, we have tracked certain feature points designated in the scene with the existing standard feature tracking algorithm which computes the position of a square feature window minimizing the sum of the squares of the intensity difference over the feature window from one image frame to the next [5]. Furthermore, we have performed the digital image production task of making up a moving image sequence of a synthetic 3-D CG image sequence of rotating and shifting bricks and the real moving image sequence of our laboratory according to the recovered estimates of the camera's motion, position and orientation. Figure 2 shows an image frame chosen from the resultant compound moving image sequence. As shown in figure 2, we can hardly identify artificial distortions in the compound image sequence, which demonstrates that the cue-based camera calibration method works satisfactorily for the foregoing digital image production task. 4. CONCLUSIONS In this paper, we present a method for performing camera calibration along with estimation of focal length of the camera accurately by using four definite coplanar points as a cue. The practical computational algorithms for the cue-based method of camera calibration are composed of simple linear algebraic operations and arithmetic operations, and hence they work so well as to provide accurate estimates of the camera's motion, position and orientation stably. Moreover, in this paper, we apply the cue-based camera calibration method to the digital moving image production task of making up an image sequence of a synthetic 3-D CG image sequence and a real moving image sequence taken with a TV camera according to the recovered estimates of the camera's motion, position and orientation. Experimental simulations demonstrate that the cue-based camera calibration method works satisfactorily for the digital moving image production task.
676 REFERENCES [1] K. Deguchi : "Image of 3-D Space : Mathematical Geometry of Computer Vision," Shoukodo Press, Tokyo, Japan, 1991. [2] C. Longuet-Higgins :"A Computer Algorithm for Reconstructing a Scene from Two Projections," Nature, vol.293, pp.133-135, 1981. [3] Y. Nakazawa, et al. : " A Robust ObjectSpecified Active Contour Model for Tracking Line-Features and Its Practical Application,"
Submitted to IEEE ICIP '96 (Accepted). [4] R. Horaud, et al. :"An Analytic Solution for the Perspective 4-Point Problem," Computer Vision, Graphics, and Image Processing,vol.47, pp.33-44, 1989. [5] C. J. Poelman and T. Kanade : "A Paraperspective Factorization Method for Shape and Motion Recovery," Lecture Notes in Computer Science, vol.801, pp.97-108, 1994.
m3 I
y
0
f
X
Figure 1
Figure 2
Coordinate systems.
Image frames chosen from the resultant compound moving image sequence.
3-D world coordinate
system
Session X: SIGNAL P R O C E S S I N G lI
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
679
A Novel Approach to Phoneme Recognition using Speech Image M Ahmadi, N J Bailey, B S Hoyle The University of Leeds Department of Electronic & Electrical Eng. Leeds, LS2 9JT, England Tel: +44 (113) 233 2016 Fax: +44 (113) 233 2032 E-mail: <[email protected]> Abstract- In this paper a novel feature extraction technique based on the two-dimensional DCT (Discrete Cosine Transform) of the spectrogram is proposed. This is in contrast to conventional approaches based on single dimension analysis such as LPC, Cepstral, or FFT. In order to demonstrate the novel approach two tasks of word and phoneme recognition were conducted. The word recognition was carried out as a preliminary study. A small database of 30 names spoken by 15 speakers were selected. As a phoneme recognition task, a series of experiments were conducted on the voice stops ('b', 'd', 'g') of the TIMIT [1] database uttered by 630 speakers (male & female). The extracted data form the basis for input patterns for training two types of neural networks, the semi-dynamic network Time-Delay Neural Network (TDNN), and a static network Multilayer perceptron (MLP). For word recognition task a recognition of 86 percent were achieved for 7 names using TDNN. However for the phoneme recognition task the highest recognition rates of 77.5 and 72.4 percent were recorded for TDNN and MLP respectively. The results for phoneme recognition contrasts with 72 percent quoted by Hwang et al [2] for the same phonemes spoken by 40 females. I. Introduction Since the advent of neural networks there has been a growing interest in automatic speech recognition. Many reviews have been carried out on different approaches to train a neural network [3,4]. The ultimate aim of this challenging task is to develop a system for speaker and text independent speech recognition. However current status of research falls well short of a comprehensive solution to this problem. The aim of this paper is a step forward in that direction. The proposed idea has shown significant improvements in recognition accuracy as well as convergence rate for cross validation and training. Fig. 1 illustrates the overall system model. In the following section the input data is defined. Section 3 explains the processing and feature extraction algorithm. In section 4 the neural network structures are discussed, and the results for different neural networks for the TIMIT database are quoted.
Input __ Speech
Featureprocessing) Extraction ~ Pre- |lI-~[[ (image processing
Networks~ Neural (TDNN&MLP)
Recognised Speech
Fig. 1 The overall system
2. Data Collection 2.1. Word Database The objective of this task was to recognize the number of names of personnel within our department. The names were recorded under room conditions with a noisy background by means of a good tape recorder and a dynamic microphone, in order to produce samples such as may occur in a real application. Thirty names were recorded by males and females speakers. The recording were then transferred to a Sun workstation in ~t-law 8-bits .au format, sampled at 8 KHz and edited to a length of 750 msec. The names with shorter length were padded to fit the fixed length. The data were then converted to 16-bit linear format with the same sampling rate and finally they were divided into training and testing sets ready for processing. The recorded speech signals were not centered or time aligned so they would correspond to real data.
680
2.1. Phoneme Database The voice stops ('b', 'd', 'g') data were extracted from natural continuous speech, uttered by 630 speakers from 8 different regions of the TIMIT [1 ] database. Over 2000 utterances were selected for training the neural networks and about 1250 utterances in total were selected for cross validation. The training and cross validation (i.e. testing) utterances consist of the same number of male and female for both pattern sets.
3. Data Processing & Feature Extraction Feature selection refers to the choice of certain attributes of an image. Such attributes are required for the recognition/classification purposes. The fundamental principal in digital image processing is the ability to represent the image in a space in which the attributes of the picture are not correlated. The orthogonal transform has such a distinct properties:I. It decorrelates the signal in the transform domain, II. It contains the most variance in the fewest number of transform coefficients. DCT [5] is the best sub-optimal orthogonal transform in comparison with the KLT (Karhunen-Loeve Transform) which is referred to as the optimal transform. Fig. 2 illustrates the MSE of the different orthogonal transforms [6] versus the size of the block. As it can be seen in Fig. 2 the smaller blocks are chosen rather than the entire image for three major reasons. Firstly, to exploit the redundancy in a set of pixels. Secondly, image processing of smaller number of blocks is computationally less intensive which reduces the real-time constraint for most practical purposes. Finally, any one pixel in a picture is likely to be closely related to the four pixels that surround it and similarly each of these four pixels are likely to bear the same relation to their respective neighbors, but the original pixel is unlikely to be related to one which is a long distance away. Therefore, by splitting up the image in a number of smaller blocks we hope to form groups of pixels that are statistically related with a consequently high level of redundancy. The generated wide-band spectrogram was broken into a number of PxQ (P = Q = 8) pixel blocks as it is shown in Fig. 3, where R and C are the dimensions of the spectrogram.
Fig. 2 Mean square error versus
Fig. 3 Image segmentation
Fig. 4 Processed PxQ block
block size for different orthogonal transforms A 2D-DCT of each block of 8x8 was calculated and the m key features were extracted as is shown in Fig. 4. In the case of the word recognition blocks of 16xl 6 were selected. As it is shown in Fig. 4 the frequency increases along the vertical and horizontal axis starting at dc element which is situated at top left comer, i.e. first element, and ends at the 64 th element which is situated at bottom right with highest frequency. Most of information in each processed block is stored in the low frequency region. The dc component was selected as the key feature from each individual block and stored in a pattern file for training neural networks. The overall system consists of three main sections as is shown in Fig. 5. In pre-processing stage the analogue data are converted to 16-bit linear data. The second stage represents the image processing and the key feature extraction and finally in the last section the generated patterns are trained and tested by the two neural networks.
681 4. Neural Network Structure and Results The selected data form the basis for input patterns for training neural networks. In this study a semi-dynamic neural network (Time-Delay Neural Network, TDNN) and a static network (Multilayer Preceptors, MLP) are trained for recognition purposes. These two networks were used in order to investigate whether the processed spectrogram needs to adapt to the dynamic behavior of the speech signal or the extracted features are adequate for a simple static network. 4.1. Word recognition A simple structure of 16x14 input layer with three sliding windows, first hidden layer of size of 14x8 with five sliding windows, second hidden layer of size 10x3 and finally N output nodes where N is the number of outputs depending how many names were required to be recognized. The network was trained for speaker independent recognition of 3, 5, and 7 names. After training the performance of the TDNN was tested with test patterns. The networks with 3 and 5 outputs showed a 100% accuracy in word recognition. The network with 7 classification achieved a lower accuracy of 86% for the test data set. The result for the 7 names can be improved by introducing time alignment and centering the signal. 4.2. Phoneme recognition The proposed procedure reduces the number of input nodes in the training patterns and at the same time provides a more prominent number of features in the data-set. For example in this experiment the input features are reduced to 72 (8x9) compared to 240 (16x15) reported by Waibel [1] for the same task. Therefore for a TDNN network the reduction in the number of input units translates to a smaller number of hidden nodes (reducing the total number of connections), which in turn results in shorter training time and better convergence rate. The TDNN used for this experiment has a input layer of 8x9 (72) nodes with three sliding windows (8x9/3), first hidden layer of 6x7 (42) nodes with four sliding windows (6x7/4), second hidden layer of 3x3 (9) nodes, and finally 3 outputs, i.e. 8x9/3 -6x7/4- 3x3 - 3. In case of MLP the same number of input and output was used, i.e. 72 and 3 respectively, but only one hidden layer of 20 nodes was used in comparison with two hidden layers in the TDNN. A full set of results are illustrated in Table 1. The highest recognition rates of 77.5 and 72.4 percents were recorded for TDNN and MLP respectively, These results contrast with result of 72 percent quoted by Hwang et al [2] for the same phonemes spoken by only 40 female speakers.
pre-processing
I
Analog Speech Signal
ADC samples8 KHz 8-bits mu-law
--
11
convert to 1 - ~ spectogram 16-bits linear I l withnp~
choose P, Q m
I
Ii
Image Processing (Feature Extraction)
I
r Recognised phoneme~word
-~
I I
Classifier TDNN
i I
takemfeatures~ from each segment (zigzag scanning)
store as pattern file
I
L
I
I
Fig. 5 The speech recognition system.
NN Types
TDNN MLP
divide segments into L number of PxQ i
Training 95 89
Table 1 Results of the TIMIT database.
Testing 77.5 72.4
2D-DCT of each segments
I
682 References
[1] Waibel A H, Hanazawa T, Hinton G,Shikano K, Lang K, "Phoneme Recognition Using Time-Delay Neural Networks.", IEEE Trans. on ASSP, Vol. ASSP-37, No. 3, March 1989. [2] Hwang J, Li H, "Interactive Quary Learning for Isolated Speech Recognition", Proc. Of IEEE Signal Processing, Network for Signal Processing II, Denmark 31 Aug. - 2 Sep. 1992, page 93-102. [3] Lippmann R P, "Review of Neural Networksfor Speech Recognition.", Neural Computation, Vol. 1, pp 1-38, MIT Publication, 1989. [4] Rabiner L, Juang B H, "Fundamental of Speech Recognition", Prentice-Hall International Inc., 1993. [5] Ahmed N, Natarajan T., Rao K. R., "Discrete Cosine Transform.", IEEE Trans. on Computer Jan. 1974. [6] Rao K R, Yip P, "DCT, Algorithm, Advantages, Applications.", Academic Press Inc., 1990.
Proceedings IWISP '96; 4- 7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
683
MODIFIED NLMS ALGORITHM FOR ACOUSTIC ECHO CANCELLATION M. M E D V E C K Y SLOVAK TECHNICAL UNIVERSITY FACULTY OF ELECTRICAL ENGINEERING AND INFORMATION TECHNOLOGY DEPARTMENT OF TELECOMMLrNICATIONS ILKOVICOVA 3, 812 19 BRATISLAVA, SLOVAKJA
Abstract This paper presents the modified version of normalised least-mean-square algorithm (M-NLMS). The M-NLMS algorithm was developed for efficient weight adaptation of the modified finite impulse response (M-FIR) filter that is especially suitable for modelling systems with a very long, decaying and time varying impulse response in final precision arithmetic in applications such as hardware realisation of acoustic echo canceller. The M-NLMS algorithm derivation, parameter investigation and simulation results are presented. The comparison of acoustic echo cancellers with adaptive FIR filter adapted by NLMS algorithm and adaptive M-FIR filter adapted by M-NLMS algorithm are presented. The better performance of the acoustic echo canceller with M-NLMS algorithm is demonstrated by simulation results.
1. Introduction The main problem of acoustic echo cancellation is the precise identification of the acoustic impulse response. The actual structure of the echo path is usually modelled by a FIR filter. Such a filter has the advantages of guaranteed stability during adaptation, unimodal means square error (MSE) surface and the principle of FIR filter response computation is similar to the acoustic echo origination process. The FIR filter may require up to thousands adaptive weights to correctly identify the acoustic impulse response. Such a number of taps result in a large arithmetic complexity that is a crucial problem for real time implementations like handsfree telephone or videoconference. The solution is a hardware realisation of the acoustic echo canceller that enable a parallel processing implementation. Unfortunately, the hardware realisation brings a problem with a computation precision. Arithmetic precision has a profound impact on realisation complexity. The hardware realisations of the acoustic echo canceller by VLSI or ASIC circuits use from the reason of simpler realisation fixed point arithmetic that yields to poorer performance than floating point arithmetic. Furthermore, the room impulse response has a very specific shape: a short delay followed by an exponentially decaying tail. The modelling of such a characteristic by a transversal filter with fixed point arithmetic leads due to roundoff error to gradual degradation of a filter weights accuracy in direction toward the end of the filter [ 1]. This yields to a poorer performance of the acoustic echo canceller. The problem can be solved by the partitioning of the impulse response in time or in frequency domain for each partition [2]. Then, the filter weights may have a different gain in each block. Since the room impulse response changes - the partitions and gains should be changed as well. To solve this problem a modified structure of the FIR (M-FIR) filter and the modified version of a NLMS (M-NLMS) algorithm has been derived and they are presented.
2. Modified FIR filter When we use an adaptive transversal filter to model an acoustic echo, then the output Yk of the transversal filter can be obtained as a sum of weighted contributions from the last N samples of the input signal x. It can be written as follow N-I
Yk = Z i=o
N-I
Yik -" Z
/=o
Xk. iwik
(1)
where Wik are weight of the adaptive filter in time k. If d k is a sample of a desired signal in time k than the error e k is given as ek = dk - Yk = dk - (Y0k + Ylk + "''+ Y(N-I)k ) = ( d k - Y0k ) - Ylk - "'" - Y(N-l)k
(2)
684 Applying the following substitutions dok = dk
(3)
elk = dik "Yik d(i+l)k = elk
(4)
(5)
we obtain a new topology of an adaptive filter. The block diagram of the Modified FIR filter is shown in Fig. 1.
o Xk Input signal
_ 1
~
I
z-~l
~W0k Desired e
dk
d
x~.~
i
I
J .~ "l Z
Xk .~ . . . .
' I
~y~lk
[ Yok d
.....
~/W2k lk
eok
.
X k. ~.,
I ' 1
~ Z 1
~W(N_l)k
.a._~Y2k
e~k
d2
_ ~ Y(~-l)k
e2k
dc .
e~N.~)k
Fig. 1. Modified FIR filter As can be seen, the values of the
weight Wikare decreasing as well we
error
eik are decreasing with raising i. Since the values of the
can increase the computation accuracy by multiplying (bit shifting) of
the elk and Wik when their values fall under the specific level (thresholds). The multiplying can be made adaptive. The choice of the multiplier coefficients from the range 2" enables to realise the multiplication very simple by rotation. The hardware realisation of such an adaptive acoustic echo canceller can have the computation complexity equal to the complexity of the classical FIR filter. More details about the M-FIR filter implementations and internal data representations can be find in [3].
3. M o d i f i e d N L M S
algorithm
Consider a situation that the real signal and filter with coefficient represented by real numbers should be realised by the system with fixed point arithmetic. In this case, scaling of data and coefficients to best fit within the dynamic range of the used system is required. In practical terms, scaling is reduced to selection of crest factor appropriate for the signal characteristics and the precision used in storage and arithmetic. In that case, the internal scaling can be realised by the normalisation to the value 6 that is a signal power level rounded to the nearest 2" value. The variable 6 represents a basic shift in internal number representation. Considering the above implementation the convergence factor g for M-NLMS algorithm can be express as
,
,
T
~k -- (X / (It -F X kXk)
(6)
where ct and y are coefficients a and y from NLMS algorithm shifted by the value 5. The term X kXk in equation (6) can be obtain for time k by iteration T
2
X kXk -" X k-i and in our ease x
T
=
x
T
T
2
2
-- X k.iXk.! - X k-N-! -k- X k - x
2
+
x
2
(7) (8)
The new values of the filter weight can be obtained as follow. Let we define
W'i(k+l) = (Wik 8) / 8Wk + ((~tkekXk.i ) ] 8) / 8e
(9)
and
W**i(k+l) = (W*i(k+l) 8Wk+!) / 8 (10) where 8 w k is an additional shift of the weight wi in time k compare to the basic shift fi and 8 w k+! is an additional shift of the filter weight w~ in time k+ 1 compare to the basic shift 8. The ek is a final error (see Fig. 1.) obtained in time k and 8 ~ is additional shift of this error. The brackets in equations (9) and (10) define the sequence of arithmetic operations execution that guarantee that the entire term can be retained without prematurely truncating its precision and without resorting to extended-width of internal buses.
685 Next we can define following six conditions C! = 1 if I w'*i(k+l) I > WMw
(lla)
1 if I w*'~+~)l > '-e"w
(llb)
C 2 --
c3=1
if
8wk+1> 1
(11c)
C4 = 1 if [W**i(k+l)I < WOw
(lid)
C5 = 1 if I w**iO,+l)I <> 0
(lle)
C6 -- 1 if
(llf)
8wk+l
< ~M w
where 8Mw is a maximum acceptable additional shift of the weight W i. The threshold WMw is a maximum value of wi. The thresholds WHwand WLw define high/low level that when they are crossed then the weight wi can be increased/decreased (shifted up/down) by the unit step 8~. Now we can define four logical functions f~ =Cl
(12a)
f2 = C2 ^ C3/x fl
(12b)
f3 = C4/x C5/x C6 ^ f2
(12c)
f4= f3 If the value of the function f~ = 1 (True), than 8Wk+1 = 1 Wi(k+l)
-" W*
i(k+l) / 8
else if function f2 = 1 (True), than 8wk+1= 8wk+1 / 81 **
wio~+l) = w i(k+1)/ 81 if not, than if f3 = 1 (True)
(12d) (13) (14) (15) (16)
8Wk+1= 8Wk+i 81
(17)
Wi(k+l)
i(k+l) 81
(18)
i(k+l)
(19)
= W
and if the function f4 = 1 (True), than* * Wi(k+l)
"- W
and 8 w, k+~does not change. The value of 8w can be computed from two one bit (Boolean) variables 8u and 8D which indicate the shift of 8w in stage i against the shift in stage i-1. This approach enables to save a storage space and is implemented in M-FIR filter. The new values of 8u and 8D are set-up as follow 8 uI = fl ^ f3 (20) 8 D.I = fi A f2
(21)
4. Implementation The hardware realisation of the M-NLMS algorithm can decrease the computation complexity to the complexity of the classical NLMS algorithm. The conditions (15) can be solved simply by comparators. The function (16), (20) and (21) can be generated by logic elements (AND) or by look-up tables. When the thresholds and the unit step 8 ! are set as numbers 2" than the multiplication and division in equations (6 - 27) can be realised by multiplexer, demultiplexer or simply by bit shifting.
5. Simulation results To verify better performance of the Modified NLMS algorithm for acoustic echo cancellation, the following computer simulations were carried out. In the simulation an impulse response of a real teleconference room with length 4000 samples has been suppressed. The two adaptive filters with 4000 coefficients have been used. The FIR filter has been adapted by NLMS algorithm and M-FIR filter has been adapted by M-NLMS algorithm. The filter parameters were the same. In both cases the real data were scaled and internally represented by 16 bit integer with 11 bit for decimal part (the basic shift 8 = 2048). The unit step for M-NLMS algorithm was 81 = 2. The measuring signal was the white Gaussian noise. Total 50000 iterations has been made for each experiment. The convergence characteristics of NLMS and M-NLMS algorithms are shown in Fig. 2. As can be seen, while NLMS algorithm reaches only 30dB echo suppression, M-NLMS algorithm overcomes a 40 dB level defined by ITU-T.
686
Fig. 2. Comparison of convergence characteristics of echo cancellers adapted by NLMS Dependencies of the level of acoustic echo cancellation on the beginning value of normalised adaptation coefficient ~ for NLMS and M-NLMS algorithm are shown in Fig. 3. As can be seen, the higher level of acoustic echo cancellation is reached by M-FIR filter adapted by M-NLMS algorithm. Furthermore, while NLMS algorithm reaches the maximum level of acoustic echo cancellation for higher values of normalised convergence parameter ct ~ 2 and faster convergence for ct ~ 1 then M-NLMS algorithm reaches the maximum level of acoustic echo cancellation and the fastest convergence for the same value ot ~ 1. (Note: parameter ct can be in range 0 < ct < 2) Therefore, the choice c~ = 1 can decrease the computation complexity of M~NLMS algorithm.
Fig. 3. Dependence of the level of acoustic cancellation on the nonnalised adaptation coefficient c~
6. Summary In this paper the new modified version ofNLMS algorittun (M-NLMS) is presented. It is shown that the M-NLMS adaptive algorithm can achieve the better performance for acoustic echo cancellation then the NLMS algorithm with the same parameters. The implementation of weight shifting algorithm enables better exploitation of the dynamic range given by the number of bits for data representation. The effect of adaptive weight shifting is similar to floating point arithmetic, but its hardware implementation is much simpler. The hardware realisation enables achieve the same computation complexity as classic FIR filter and NLMS algorithm.
References [1] TREICHLER, J. R., JOHNSON, C. R., LARIMORE, M. G.: Theory and Design of Adaptive Filters.John Willey & Son, New York, 1987. [2] BORRALLO, J. M. P., OTERO, M. G.: "On the Implementation of a Partitioned Block Frequency Domain Adaptive Filter (PBFDAF) for Long Acoustic Echo Cancellation". Signal Processing, Vol. 27, 1992, pp. 301-315. [3] MEDVECKY.M.: "Improvement of Acoustic Echo Cancellation in Hands-free Telephones". In.: 1st htemational Conference in Telecommunications Technologies TELEKOMUNIKACIE'95, Bratislava, 31.5-1.6.1995, Vol.1, 1995, pp.127-132. (in Slovak)
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios and P. Liatsis (FAitors) 9 1996 Elsevier Science B.V. All rights reserved.
687
Matrix Polynomial Computations Using the Reconfigurable Systolic Torus T. H. K a s k a l i s * K.G. Margaritis D e p a r t m e n t of I n f o r m a t i c s , U n i v e r s i t y of M a c e d o n i a 156 E g n a t i a str., 54006 T h e s s a l o n i k i , G r e e c e E-mail: {kaskalis,kmarg} ~macedonia.uom.gr
Abstract
A wide range of matrix functions, including matrix exponentials, inversions and square roots can be transformed to matrix polynomials through Taylor series expansions. The efficient computation of such matrix polynomials is considered here, through the exploitation of their recursive nature. The Reconfigurable Systolic Torus is proposed for its ability to implement iterative equations of various forms. Moreover, a detailed example of the matrix exponential realization is presented, together with the scaling and squaring method. The general design concepts of the Reconfigurable Systolic Torus are discussed and the algorithmic steps needed for the implementation are presented. The Area and Time requirements, together with the accomplished utilization percentage conclude the presentation.
1
Introduction
The solution of various types of equations, appearing in many mathematical models, dynamic probabilistic systems and in stochastic and control theory, often requires the calculation of distinct matrix functions [1, 2, 6, 13]. Such types of functions include matrix exponentials (cA), matrix inversions ( A - l ) , matrix roots (A 1/2, A-i~ 2) or functions of the form: cos A, sin A, log A, etc. As a result, the transformation of these polynomials to iterative algorithms, and the consequent efficient implementation, becomes an important issue. The Reconfigurable Systolic Torus [7] is a structure designed to implement iterative equations of various forms and is, therefore, a good candidate for the realization of matrix polynomial computations.
2
Systolic Implementation Example" The Matrix Exponential
In order to present the distinct steps for the implementation of a particular matrix function, we.will focus on the matrix exponential example. Given a matrix A, the exponential eA c a n be formally defined by the following convergent power series" A2
A3
eA -- I + A + T ( + --~. + ...
(1)
The straightforward algorithmic approach for calculating the exponential of a matrix is the Taylor series approximation technique:
eA " " T k ( A ) -
p!
(2)
p=0
However, such an algorithm is known to be unsatisfactory, since k is usually very large, for a sufficiently small error tolerance. Furthermore, the round-off errors and the computing costs of the Taylor approximation increase as IIAII increases [10]. We can surpass these difficulties by using the fundamental property:
~
_ (~/,~)m
(3)
Moreover, if we employ the scaling and squaring method, we choose m to be a power of two, such that: m-
2z
and
IIAII m
-
IIAII
21 -<
1
(4)
This approach guarantees the satisfactory Taylor approximation. We use Equation 2 for the calculation of e AIm and then CA is formed by 1 successive squarings [2]. For a given error tolerance ~ and magnitude IIAII, Table 1 summarizes the optimum (k,l) associated with [Tk(A/2t)] 2' [10]. According to the statements made above, the calculation of the exponential of a matrix A can be implemented following this algorithm: *supported by the Greek National Scholarship Foundation.
688
10-2 10 -1 10 o
IIAII
lO ~
lO 2 lO 3
10 - 3
10 - 6
10 - 9
10-12
10-15
1, 0 3,0 5, 1 4, 5 4, 8 5, 11
2, 1 4,0 7, 1 6, 5 5, 9 7, 11
3, 1 4,2 6, 3 8, 5 7, 9 6, 13
4, 1 4,4 8, 3 7, 7 9, 9 8, 13
5, 1 5,4 7, 5 9, 7 10, 10 8, 14
Table 1" The optimum (k, l) parameters for the Taylor series approximation 1. 2. 3. 4.
Given a matrix A and an error tolerance e, obtain the optimum values (k, l) from Table 1. Calculate the matrix B - A / 2 ~. Calculate Tk (B). Calculate e A "~ [Tk(B)] 2~ through 1 successive squarings of T k ( B ) .
The steps 3 and 4 of the above mentioned algorithm carry the main computational burden of the whole problem. As a result, we should try to implement these operations efficiently. Step 3 of the algorithm is a matrix polynomial computation, which can be expressed by:
x(~
-
I
(5)
X(p+l)
=
1 k-p
B X (p) + I
.
.
p-0 .
.
1 ... k - 1
(6)
In a more general form, the above mentioned expressions comprise the following iterative equation:
X (~ X (p+I)
(7)
-
I
=
S(P)X (p)+Y
,
p-0,1,...,k-1
(8)
where: S (p)
=
~1
Y
-
I
B
k-p
,
(9) constant
(10)
Step 4 of the algorithm is a successive matrix squaring given by the following recurrence: X (~
X (p+I)
-
Ta(B)
=
X ( P ) X (p)
(11) ,
p-
O,l,...,l-
1
(12)
Obviously: X (t) - [Tk(B)] 2'
(13)
and the calculation of the matrix exponential is completed. The underlying computational structure for both the Taylor series evaluation and successive matrix powers can be a systolic network allowing for repeated matrix-matrix Inner Product Steps (IPS) [12]. Obviously, this systolic network should implement efficiently such operations, i.e. the input of each new computation should be the output of the previous computation and, possibly, some new matrix.
3
Reconfigurable
Systolic Torus
The Reconfigurable Systolic Torus [7] is a structure designed to perform general iterative operations. The Area and Time requirements are kept low and flexibility is also retained. This design consists of iWarp-like [11] 4-input and 4-output Inner Product Step cells, which are interconnected in a reconfigurable manner. The term reconfigurable does not mean that the interconnection network changes during the computation phase, since that would violate the regularity requirement of the systolic structure [8, 9, 12]. What is rather meant is that this network can be configured according to some predefined manner, prior to the application of a recursive algorithm. In this design, 2 I / O channels of each IPS cell are allocated for the horizontal data stream and the other 2 I / O channels belong to the vertical data stream. The one horizontal and one vertical I/O channel permanently follow the usual (neighbouring) systolic interconnection. Each of the other two channels can be in one of two predetermined configurations, presented bellow. These I/O channels serve the correct dissemination of the data items, at the beginning of a new iteration. We will now examine the horizontal data stream but the same statements also apply for the vertical data stream by simply interchanging the indices i, j.
689 S(I~I)
S(I~I)
S(p+I)
S(l~-l)
S(P+1 ) 912
s(P+I) 23
s(l~l ) 934
s(P+I) 41
911
"22
33
"44
s(P+l) 913
s(l,+l) "7.4
s(l,+l) '31
s~l) 42
s(P+l) ' 14
s(l~l) ' 21
" 32
s~l)
sO+l) "43
__-'~ r~22
x-(~) ~ x ~') ~ (r) : ~ d .-'," I 4 , j ~,z l " , 2
R o w #2
,~J ~,-71 :_r,,,--q ~x.o,
_x_, S14 --
L ---
--
R o w #3
(a)
Row #4 (b)
Figure 1" The Reconfigurable Systolic Torus (n - 4). In the first case, which is called "straight configuration", each cell (i, j) is connected in the normal systolic manner with its neighbours. That is, it accepts input from cell (i - 1, j) and transmits output to cell (i + 1, j). Data traveling through this channel are not processed or used at all and are transmitted as is in the next time step. Considering the boundary IPS cells, cell (1, j) accepts input from outside and cell (n, j) discards its output. Finally, during the first IPS cycle of each new iteration, the data received in the previous time step through this channel are transmitted through the other (static) I/O channel. Next, we will describe the second case of interconnection, called "shuffle configuration". This interconnection serves the need for disseminating the newly produced x ij(p+I) elements to the correct IPS cells. As soon as cell (i ' j) produces x i(p+i) j , it outputs this value to cell (i~,j), where
i' - ( i - j + 1) mod n + 1
(14)
and, accordingly, accepts input from cell (i", j), for which: (i" - j + 1) rood n + 1 - i
(15)
Figure 1 presents the implementation of Equation 8 for the n x n matrices X, S, Y and for n = 4, on a Reconfigurable Systolic Torus. The Y matrix is assumed preloaded in the IPS cells, where it remains throughout the whole operation. Moreover, X(~ is also preloaded and then the structure begins functioning as if X(~ has just been calculated. Figure 1.a represents the static I/O channels together with the vertical straight configuration The dotted lines correspond to the horizontal configuration, which follows the shuffle principle and it is presented in detail in Figure 1.b for the 4 rows of the systolic array. The static I/O channel is, obviously, the outer line, while the shuffle configuration is presented by the inner line. Care should be given in the fact that the row cells follow the numbering of the xij elements they produce. In general, the Reconfigurable Systolic Torus performs the operations presented in Table 2, where the respective configurations are also listed. Equation 8 is solved if we employ the straight configuration for the vertical data stream and the shuffle configuration for the horizontal data stream (operation #1). After k iterations we reconfigure the vertical data stream to the shuffle configuration, we load the null matrix Y - O and let the RST function for l recursions, in order to compute Equation 12 (operation #5). The Area and Time requirements of one of the operations presented in Table 2, having the preloading and unloading time intervals considered, are: A
-
T-
//2
(k+3)~
IPS cells time steps
(16) (17)
and the utilization (efficiency) of the array is: U-
(array active computing time) (number of cells) x (number of time steps)
-
1 1 + 3k
(18)
690 Equation Type
Matrices Used
Horizontal Configuration
Vertical Configuration
X (pq'l) . . . . . -- S(P)K (p) -I- Y
S, X, Y
shuffle
straight
X (p'I'I) -- S(P)(X (p) q- Y)
S,X,Y
shuffle
straight
X(p+l) = X(P)S(p) + y
S T, X T, y T
shuffle
straight
X(p+ 1) = (X(p) + y)S(p)
S T, X T, y T
shuffle
straight
X (pq'l) -- X ( P ) X (p) q- V
X T, X, Y
shuffle
shuffle
X(p+I) = X(P)(X(p) + Y)
xT,x,Y
shuffle
shuffle
Z (p) -- S ( P ) X (p) -~ Y
S, X, Y
straight
straight
Index
Prior Addition of Y
~/
~/
x/
Table 2" Confguration of the RST for the implementation of various operations. while the respective values for the two operations considered in our problem are: A
-
T
-
n2
(k +l-q-4)n 1 U = 4 1-~ k+l where the time interval for the reconfiguration of the vertical data stream is not considered.
(19) (20) (21)
Conclusions
4
In this paper we considered the efficient solution of matrix polynomial computations appearing in many practical problems. This solution actually depends on the calculation of iterative equations. These iterations are the most computationally expensive part of the overall algorithms and their systolic implementation becomes important. The Reconfigurable Systolic Torus is used in order to implement the recursive equations appearing in such algorithms. The problem is solved in several consecutive iterative operations in a straightforward and normal manner, keeping the overall Area and Time requirements low and, therefore, providing high utilization. Further research includes the usage of the Reconfigurable Systolic Torus for the implementation of several other types of matrix functions inhibiting recursions in their calculation.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
Froberg C. E., Introduction to Numerical Analysis, Addison-Wesley, 1965. Golub G. H., Van Loan C. F., Matrix Computations, North-Oxford Academic, 1983. Helton B. W., "Logarithms of Matrices", Proc. American Mathematical Society, vol. 19, pp. 733-736, 1968. Higham H. J., "Newton's Method for the Matrix Square Root", Math. Comput., vol. 46, pp. 537-549, 1986. Hoskins W. D., "A Faster Method of Computing the Square Root of a Matrix", IEEE Trans. on Automatic Control, vol. 23, pp. 494-495, 1978. Howard R. A., Dynamic Probabilistic Systems, John Wiley, 1971. Kaskalis T. H., Margaritis K. G., "A Reconfigurable Systolic Torus for Iterative Matrix Operations", accep. for publ. Parallel Algorithms and Applications, 1996. Kung H. T., "Why Systolic Architectures", Computer, vol. 15, no. 1, pp. 37-46, 1982. Kung H. T., Leiserson C. E., " Systolic Arrays (for VLSI)", Sparse Matrix Proc. 1978 (Society of Industrial and Applied Mathematics, 1979), pp. 256-282. Moler C., Van Loan C. F., "Nineteen Dubious Ways to Compute the Exponential of a Matrix", SlAM Review, vol. 20, no. 4, pp. 801-838, 1978. Peterson C., Sutton J., Wiley P., "iWarp: a 100-MOPS LIW Microprocessor for Multicomputers", IEEE Micro, vol. 11, no. 6, 1991. Petkov N., Systolic Parallel Processing, Elsevier, 1993. Pan V., Reif J., "Efficient Parallel Solution of Linear Systems", Proc. 17th Annual Sysmposium on Theory of Computing, pp. 143-152, 1985. Philippe B., "Approximating the Square Root of the Inverse of a Matrix", Technical Report 508, Center for Supercomputing R&D, Urbana, 1985. Serbin S. M., Blalock S. A., "An Algorithm for Computing the Matrix Cosine", SlAM J. Sci. Stat. Comput., vol. 1, pp. 198-204, 1980.
Proceedings IWISP '96," 4-7 November 1996," Manchester, U.K. B.G. Mertzios ans P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
691
Real-Time Connected Component Labeling on One-Dimensional Array Processors based on Content-Addressable Memory: Optimization and Implementation Eril Mozef, Serge Weber, Jamal Jaber, and Etienne Tisserand Laboratoire d'Instnunentation Electronique de Nancy (L.I.E.N) University of Nancy I, BP 239 54506, Vandoeuvre Cedex, FRANCE Tel: (33) 83 91 20 71, Fax : (33) 83 91 23 91
Abstract Connected component labeling is not easy to process due to its local and global features. These features make the labeling operation extremely time costly as sequential architecture has to be used due to its local operation principle. In order to reduce the processing time, labeling should be done in parallel using both the local and global operations. Unfortunately, this solution is very expensive, particularly for a two- or three-dimensional array processors. In order to find a trade-off between processing time and hardware cost, we propose an efficient parallel architecture dedicated to connected component labeling based on Content-Addressable Memory (CAM). For an n x n image, the optimized architecture merely requires n/2 -1 PEs and nV4 CAM modules through a 4-pixels grouping technique. The proposed algorithm, based on a divide-and-conquer technique, leads to a complexity of O(n log n) with a small multiplicative constant factor of an order of 89 The global communication is reconfigurable and ensured in O(log n) units of propagation time by a tree structure of switches. Hence, through this performance, this architecture reaches a quasi-optimal processortime in labeling. Moreover, the architecture permits sequential processing, perfectly adapted to labeling in one scan image from any interlaced-mode video signal. For a practical image of 512 x 512 pixels, the labeling time is estimated at an order of 204.8 psec. Hence, the labeling can be performed in real time at video rate i.e., 50frames per second.
1. Introduction
Connected component labeling is amongst the most fundamental tasks in intermediate level vision. This operation is not easy to process as it possesses local as well as global features. In other words, connectedness of image regions implies that labels can be propagated locally among adjacent pixels. However, the label assigned to a pixel may be the same as that of a pixel at a relatively distant location within the image. The significance of connected component labeling has incited a large number of research work and led to numerous algorithms and architectures. Sequential machines that generally obtain, for an n x n image, the algorithm complexities of O(n4) [1 ], are not suitable for this purpose because of its local operation principle. In order to reduce processing time, labeling should be done in parallel using both local and global operations. Unfortunately, this solution is very expensive, particularly for a two- or three-dimensional array processors. It so follows that a processor-time trade-off should be found. Hence, an architecture categorized as a two-dimensional array processors with O(n2) PEs (Processing Elements), CLIP [2], using a local approach yields an algorithm complexity of O(n2). Another architecture of the same type, PolymorphicTorus [3], performing labeling based on a global approach, permits the reduction of complexity to O(n) [4]. In spite of its high performance particularly for neighborhood processing, the processor-time performance for labeling of these architectures are not efficient. Moreover, the data propagation in the interconnection network is not negligible and may be O(n) units of propagation time. This can limit the operation frequency and decrease the architecture performance. Yet another similar architecture, Meshes with Reconfigurable Buses [5], labeling based on a list-traversal technique leads to a complexity of O(log n) [6]. Unfortunately in this case, for a complex boundary of a connected component containing O(n2) pixels, the data propagation in its global communication is O(n2) units of propagation time. Another type of architecture, this time categorized as a one-dimensional array processors with orthogonal-access memory, ACMAA [7], containing O(n) PEs which yields a complexity of O(n2). Although its communication diameter is efficient, its processor-time labeling is not. Much effort has been made to find a processor-time optimal labeling [ 1], [ 11]. A parallel algorithm for a given problem is said to be processor-time optimal if the product of the number of processors and the parallel execution time is equal to the sequential complexity of the problem. Labeling in optimal logarithmic time on an EREW PRAM is presented in [ 12], [13]. However, the large multiplicative constant factors in time complexity of these algorithms limit their practical implementation. A processor-time optimal labeling is presented in [11] based on combining parallel and sequential algorithms. In order to f'md a processor-time trade-off, we propose another efficient parallel architecture based on ContentAddressable Memory. Categorized in one-dimensional array processors, this architecture has O(n) PEs and leads to an algorithm complexity of O(n log n) with a small multiplicative constant factor of an order of 89 Through these performances, this architecture presents a quasi-optimal processor-time in labeling. Moreover, the data propagation in global communication is ensured in O(log n) units of propagation time by a tree structure of switches. This architecture which exploits a global approach and a divide-and-conquer technique is very simple to realize. There are no complex pointer manipulations or data routing schemes compared to the above architectures. For an n x n image, the architecture merely requires n/2 -1 PEs and n2/4 CAMs (Content-Addressable Memory) through a 4-pixels grouping technique. Furthermore, the architecture permits sequential processing, perfectly adapted to labeling in one scan image from any interlaced-mode video signal. For a practical image of 512 x 512 pixels, the labeling time is estimated at an order of
692 204.8 ~tsec. Hence, the labeling can be performed in real time at video rate i.e., 50 frames per second. Through these performances, the proposed architecture is well suited to applications in industrial vision, document analysis, etc. 9 "m" ,
". . . . . . . . . . . . .
ut~
i
. . u ~ l c.~.
i~
."
'
-
~,~,~,, ~ "
""
Fig. 2. The moduleof memories ...
from M2(0,h)
I
I
Va]
I
I
I
I Comparat~V'
T ~ ~awt~ o f ~ l ~ s
,..M i-28.~2h
Fig. 1. The organizationof the proposedarchitecture
l
Vb[t
/
.
I
I
Multiplexer Data bus I
.
~-'.***r i a-" .~l
-i..
-
-i -
.
I
-
IAadreu
.
~-$
""~To
/
Global bus
.~ tf----MI(0,j+I)MI IJ+l j'2h MI 0 +2
Fig. 3. The ProcessingElement
2. T h e P r o p o s e d A r c h i t e c t u r e 2.1. T h e M e m o r y Modules The memory modules consist of 2 planes denoted Ml[ij] (0 < ij < n-l) and M2[g,h] (0 < g,h < n/2 -1), where i=2g and j=2h (Fig. 1). The Ml[ij] plane consists of n2 simple registers of 1 bit. This is used for storing an initial binary image. Through a 4-pixels grouping technique (see following paragraph), the M2[g,h] plane has merely n2/4 CAM (ContentAddressable Memories). This plane consists of n/2 rows each confining n/2 CAM and is used for storing either intermediate labels or definitive labels. To simplify memory architecture and to speed up processing time, the pipeline principle is used. In this case, Ml[ij] and M2[g,h] operate in a leit-shitt circular FIFO mode. This mode allows to transfer the data to the PEs. The CAM provides the PSMU (Parallel Search and Multiple Update) operation. This operation allows an update of CAM rows of any length in merely one clock cycle or O(1). This module is similar to the MUCAM module (Multiple Update CAM) presented in [8]. The CAM module (Fig, 2) consists of one register and one identity comparator both of 2 log (rd2) bits. A value of R2[g,h], is compared to a value from the address bus. If these values are identical, the content of the register is updated by a value from the data bus. This is why the CAM is called "content-addressable". The register R2 has 2 other inputs, a FIFO input and a row-major-order input corresponding to the row major order position of the CAM. In order to reduce the number of requi[edC_AM, four adjacent registers in M~ are associated to a CAM. The CAM value depends on that of these registers and vice versa. This is what we define as a 4-pixels grouping technique. Using this technique the number of required CAM is reduced by a factor of 4. This technique can be described as follows. In the initialization phase, if at least one of the 4 adjace-nt pixels, has a binary value 1, the corresponding CAM is initialized by its row major Order. This value is processed at the intermediate phase and can change. At the end of processing, each CAM obtains its definitive value called label. This label automatically belongs to each pixel possessing a binary value 1 in the 4 adjacent pixels. In order to get back the labels from M2[g,h], the M~[ij] is sequentially scanned; at each pixel having a binary value 1, the corresponding value in M2[g,h] is then multiplexed to the output bus (not shown in schematic).
2.2. The Processing Element (PE) The architecture consists of n/2 -1 PEs denoted PEr, (0 < r < n/2 -2) (Fig. 1). The PE's operate in SIMD mode. Each PE possesses merely one principal function, the MinMax function. This function allows multiplexing of two values of two adjacent CAMs in the first column, M2[0,h] and M2[0,h+l ]; the smaller value is multiplexed to the address bus and the larger to the data bus (Fig. 3). These are then transferred to a global bus for the updating of 2 adjacent rows of CAMs, M2[*,h] and M2[*,h+ 1] or two adjacent regions in merely one clock cycle or O(1). We remark that a region consists of more than 2 adjacent rows of CAMs (see merging operation in section 3). In an active State, the PE is connected to the global bus, in an inactive state, it is disconnected. Activation of each PE is dependent on the merging operation: in the first merge, PEo, PE2, PE4..... PE(n-2), are activated whilst the other PEs are in an inactive state. In the second merge, PE~, PEs, PE9.... , etc are activated, in the third, PE3, PE~, PEI9..... etc., and so forth. 2.3. The C o m m u n i c a t i o n N e t w o r k The communication network consists of a local bus of 2 log (n/2) bits and a global bus consisting of a data bus and an address bus, each of 2 log (n/2) bits. The local bus connects the CAMs to the PE in the first column. However, the global bus simultaneously connects two adjacent rows or regions of CAMs to the PEs during the merging operation through a tree structure of switches. This structure is constituted of (n/4)-I switches, denoted SWk, (0 < k < n/4 -2). Each SWk is
693
constructed from unidirectional three state buffers. The number of required switches is, in this case, optimal. This structure consists of (log n/2)-I stages. Thus it ensures the data transfer in O(log n) units of propagation time. The stages are activated one after the other, each switch in the active stage definitively connects 2 global buses.
3. Connected Component Labeling Algorithm Labeling consists in assigning a unique label to each connected component in the image whilst ensuring a different label for each distinct object [9]. In utilizing CAM, the algorithms in [8] and [10] yield a complexity of O(n2) whilst the algorithms in [14] and [15] described in this section leads to a complexity of O(n log n). This algorithm, based on a divideand-conquer technique, is similar to that presented in [4]. Here, instead of two-directional merging, vertically and horizontally, our algorithm proceeds uniquely in a horizontal direction. After sequential loading, the initial binary image is supposedly available in Ml[ij]. In this algorithm, 4-connectivity is assumed. Initialization: First, each CAM in M2[g,h] is assigned by its row-major order if at least one of its 4-inner associated adjacent pixels, i.e., Ml[2g,2h], Ml[2g+l,2h], Ml[2g,2h+l ] and Ml[2g+l,2h+l ] (where i=2g and j=2h), has a binary value 1. This can be done in O(1) time. Row processing: 2 adjacent CAMs, M2[0,h] and M2[1,h] are then tested. If a positive non-zero value is detected and if their 4-outer associated adjacent pixels, i.e., M1[ 1,2h], Ml[2,2h], MI[ 1,2h+l ] and Ml[2,2h+l ], are connected, the left value, M2[0,h], is propagated to the right, M2[1,h]. Otherwise, there is no operation. This operation is undertaken in parallel for each row. This operation is repeated, at following iterations, by first shifting each row to the left. At the end of this stage, all objects in each row are labeled according to the smallest value. The complexity for this stage is of O(n). Merging: Here, 2 adjacent rows are simultaneously merged. This can be done as follows. Each active PE compares 2 values of its adjacent CAMs in the first column, M2[0,h] and M2[0,h+l ], in parallel. The smallest value is multiplexed to the data bus whilst the larger is multiplexed to the address bus. Both are transferred to the global bus in order to update all CAMs in the corresponding rows. This can be done in O(1) time (see PSMU operation in section 2.1). The 2 merged rows then become a region. The same operation is repeated at following iterations by activating the following stage of the tree structure of switches but this time in merging 2 adjacent regions instead of 2 adjacent rows. The merging is reiterated until the last stage of the tree structure of switches. The total number of stages corresponding to the total number of merges reaches (log n/2) -1. As each merge takes n/2 iterations, it so follows that this phase takes O(n log n). The algorithm requires n/2 iterations for row processing and n/2((log n/2) -1) iterations for merging, hence representing a complexity of O(n log n) with a multiplicative constant factor equal to %. We remark that this complexity is independent of the shape of the objet.
4. Implementation
The labeling algorithm was tested by both software and hardware simulation before validating by FPGA implementation for a small image of 32 x 32 pixels [16]. We are currently in the process of implementing a VLSI prototype, using a hardware description language, VHDL, able of labeling an 512 x 512 image corresponding to 255 PEs and 256 x 256 CAMs of 16 bits. Using a 0.6 I~m CMOS technology, a PE requires 986 transistors while a CAM and its 4 associated registers require 1180 transistors. For labeling a 512 x 512 image, the number of required transistors is rather cumbersome to implement as a single ASIC chip. However, the regularity of its row structure and the simplicity of its processing in one column make this architecture easy to slice into chips with each chip containing r x 256 CAMs. For r = 16, we estimate the number of required transistors to be 5 million instead of 4 times this amount in the non-optimized architecture presented in [ 14]. This number remains extremely high. This is due to the choice in bit-parallel implementation. From a theoretical point of view, optimization may still be considered with the use of a bit-serial implementation. However, this will increase the multiplicative constant factor of the algorithm complexity as well as processing time.
5. Discussion For a practical binary image of 512 x 512 pixels, a 50 nsec clock cycle of MI and 100 nsec clock cycle of M2 the labeling time is estimated at an order of (512/2) x (log(512/2)) x 100 nsec = 204.8 ~tsec. Since the lapse of time between the first image and the following image is 1600 lxsec, i.e., 25 lines x 64 ~tsec in CCIR standard, the labeling can easily be executed in real time at a video rate of 50 frames per second. This can be done by starting the labeling just after loading a complete image, i.e., first frame and second frame in the case of interlaced mode. For a larger image, e.g., 2048 x 2048 pixels, labeling is terminated in 1024 ~tsec, thus the real-time labeling is still possible. In Table 1, performances of various architectures are compared. Our architecture is categorized as a one-dimensional array processors with O(n) PEs and yields a complexity of O(n log n). Compared to the other architectures, it is obvious that the proposed architecture provides an efficient trade-off among the algorithm complexity, the number of processors and data propagation time. Through its efficient global communication and its PSMU operation, this arehitectm'e is well suited for connected component analysis, e.g., determinations of area and perimeter [15] as well as for intermediate level vision tasks, e.g., [ 17]. Moreover, labeling from any standard interlaced mode in one scan image is feasible by adding the possibility of the configuration of switches in a linear structure (not in a tree structure). The labeling process will be done as follows. -Labeling from an interlaced video signal: The images from lines Ll~ to L~ ~/2 of the first frame and from lines L2~ and L22 of the second frame are sequentially loaded into M~. During the charging of line L23 of the second frame, row R~ (which handles Lll and L21) and R2 (which handles Ll2 and L22) in M2 are merged to form the first labeled image region denoted $1. During the charging of line I-,24,merging of R3 and S~ is undertaken to form the second region, $2. This process is iterated until line L2 ~2. The labeled image is obtained in M2 at the end of the second frame.
694 -Labeling from a non-interlaced video signal: Lines LI, L2, L3, and L4 are sequentially loaded into M~. During the charging of line Ls, row Rl (which handles LI and L2) and R2 (which handles L3 and L4) in M2 are merged to form the fast labeled image region denoted S~. After the charging of line L6, and during the charging of line LT, merging of R3 and Sl is undertaken to form the second region, $2. This process is iterated until line Ln. The labeled image is obtained in M2 at the last line of the image. Both of these sequential algorithms take O(n 2) with the multiplicative constant factor equal to 1. The use of CAM makes these sequential algorithms very simple to implement as there are no complex manipulations of equivalence tables compared to most of conventional sequential labeling. Architecture
type~
Features
I proccessor
(sequential approach)
2-D array processors I-D array processors Local Global List-traversal Non Orthogonal Ortho~,onal teehnioue teelmioue technique Proposed CLIP CAAPP Polymorphic Reconfigura[2] [10] -Toms [4] ble Mesh [6] ACMAA [7] architecture
Processors
0(1)
O(n')
O(n2)
O(n2)
O(na)
Algorithm complexity
O(n4)
O(n2)
O(n')
O(n)
O(n)
O(n)
O(log n)
O(n')
O(n log n)
Data propagation time in communication network
0(1)
O(n)
O(n)
O(n)
O(nffi)
o(l)
ooogn)
Table 1. Comparisonof performanceson various architecture
6. Conclusion The parallel architecture dedicated to connected component labeling and the evaluation of its VLSI implementation have been presented. The use of CAM has lead to an efficient algorithm complexity of O(n log n) with a small multiplicative constant factor of %. Optimization of the architecture using the 4-pixels grouping technique has lead to significant advantages namely reduction in the number of required transistors and processing time, real time possibilities, and possibilities of sequential processing from any interlaced-mode video. Compared to other architectures, this architecture has obtained the best performances in terms of efficient trade-off among the algorithm complexity, number of processor and data propagation time. Future work will account for the simulation of the complete system as well as the reoptimization and the complete implementation of this architecture.
References [ 1]
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17]
H.M. Alnuwciri, and V.K. Prasanna, "Parallel architectures and algorithms for image component labeling," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 14, no. 10, pp. 1014-1034, Oct. 1992. M.J. Duff, and T. J. Fountain, Cellular Logic Image Processing, New York: Academic, 1986. H. Li, and M. Marcsca, "Polymorphic-Torus architecture for computer vision," IEEE Transactions On Pattern Analysis and Machine Intelligence, vol. 11, no. 3, pp. 233-243, March 1989. M. Marcsca, and H. Li, M. Lavin, "Connected component labeling on Polymorphic-Torus architecture," IEEE Int. Conf. Computer Vision Pattern Recognition, Ann Arbor, pp. 951-956, 1988. R. Miller, and V. K. Prasanna-Kumar, "Meshes with rcconfigurablr buses," Proe. Of the 5th MIT Conf. Advanced Research in VLSI, pp. 163-178, March 1988. S. Olariu, J. L. Schwing, and J. Zhang, "Fast component labelling and convex hull computation rcconfigurablc meshes," Image and Vision Computing, vol. 11, no. 7, pp. 447-455, 1993. P.T. Balsara, and M. J. Irwin, "Intermediate-level vision tasks on a memory array architecture," Machine Vision and Applications, 1993, vol. 6, pp. 50-65. Y-C. Shin, R. Sridhar, V. Dcmjancnko, P. W. Palumbo, and S. N. Shihari, "A special-purpose content acldrcssablc memory chip for real-time image processing," IEEE Journal of Solid-State Circuit, vol. 27, no. 5, pp. 737-744, May 1992. A. Roscnfr and J. L. Pfaltz, "Sequential operations in digital picture processing," Journal of the Association for Computing Machinery, vol. 13, no. 4, pp. 471-494, 1966. C. C. Weems, S. P. Levitan, A. R. Hanson, E. M. Riseman, D.B. Shu, and J.G. Nash, "The image understanding architecture," International Journal of Computer Vision, vol. 2, pp. 251-282, 1989. H. M. Alnuweiri, "Optimal image computations on reduced processing parallel architectures," in Parallel Architectures and Algorithms for Image Understanding, (V. K. Prasanna Kumar, Ed.). Academic Press INC, 1991, pp. 157-183. R. J. Anderson and G. L. Miller, "Deterministic parallel list ranking," in VLSI Algorithms Architectures: Proc. 3rd Aegean Workshop Comput., June 1986, pp. 81-90. R. Cole and U. Vishkin, "The accelerated centroid decomposition technique for optimal parallel tree evaluation in logarithmic time," Algorithmica, vol. 3, pp. 329-346, 1987. E. Mozef, S. Weber, J. Jaber, and E. Tisserand, "Architecture d6di6e /i ralgorithme parallele O(n log n) d'6tiquetage de composantes connexes," (in French), 3~me Journdes Addquation Algorithme Architecture, Toulouse, France, pp. 83-89, Jan. 96. E. Mozef, S. Weber, J. Jaber, and E. Tisserand, "Parallel architecture dedicated to connected component analysis," IEEE 13th International Conference on Pattern Recognition, Vienna, Austria, August 96. E. Mozef, S. Weber, J. Jaber, and G. Prieur, "Parallel architecture dedicated to image component labelling in O(n log n): FPGA implementation," SPIE International Symposium On Las., Opt., and Vision, Besan~on, France, June 96. E. Mozef, S. Weber, J. Jaber, and E. Tisserand, "Design of Linear Array Processors with Content-Addressable Memory for Intermediate Level Vision," ISCA 9th Inter. Conf. on Parallel and Distributed Computing Systems, Dijon, France, Sept. 96.
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios ans P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
695
A 2-D window processor for modular image processing applications and its VLSI implementation
by P. Tzionas, Ch. Mizas and A. Thanailakis Laboratory of Electrical & Electronic Materials Technology, Department of Electrical and Computer Engineering, School of Engineering, Democritus University of Thrace GREECE ABSTRACT
This paper presents the design and VLSI iml~lcmcntalion o1" a 2-D window processor that is capable of loading specific blocks (2-D windows) o1" Illc image spacc, depending on a set of external control signals, onto any type of parallel architecture. The 2-D window is capable of moving along both dimensions of the image space and to provide at its output the image data stored in it. Paralicl architectures lk~r image processing usually require a number of clock cycles in order to complete the spccific tasks assigned to them by the host computer. Thc proposed processor takes advantage of this latency by preparing and loading the necessary data for the next operation of the parallel architecture while the architecture is still operating on the data of the previous operation. This is achieved through a predictor system, capable of predicting the position of the next block of image that the parallel architecture will rcquire, based on the hL~'tOtTof the previous window movelnents. The processor was implemented in a VLSI chip using a liana CMOS Double Layer MetaI'(DLM) technology. The area of the chip, including the pads, is 6.102mm x 4.845mm = 29.56 mm 2 and the maximum frequency of operation, obtailled after loaded simulations, was l'ound to be 41.67 MHz. 1. INTRODUCTION Parallcl architectures proposed for image processing applications are, usually, capable of processing only parts of the image space, at the same time, and processing of the overall image space is achieved by a sequence of parallel operations [1,2]. However, for many imagc processing applications, such as edge detection [3], region growing teclmiques [4], nearest neighbour applications [5] etc., it is sufficient to process only specific areas of interest in the image, rather than the whole of the imagc space. Thus, the capability to direct (locus) processing to specific parts of the image space is of great ilnportance, since all available processors are utilised in the regions of interest rather than in regions of non-interest, and the overall processing time is accelerated. This paper presents the design and VLSI implementation of a 2-D window processor that is capable of loading specific blocks (windows) of the image space, depending on a set o1" external control signals, onto any type of parallel architecture. Thc 2-D window is capable of moving along both dimensions of the image space and to provide at ils output the imagc data stored in it. Thus, the i~roposcd proccssor is able to focus at specific parts of the image space. Additionally, parallel architectures for image proccssing usually require a number of clock cycles in ordcr to complete the specific lasks assigned to them by the host computer. The proposed processor takes advantage o1"this latency by preparing and loading tile necessary data for the next operation of lhe parallel architecture while the architecture is still operating on the data ol" the previous operation. This is achieved through a predictor system, capable of predicting the position of the next block of image that the parallel architecture will require, based on the histoo~ of the previous window lnovelnents. Thus, data are loaded l'roln memory on clock cycles during which the host was normally idle, and the overall processing speed is improved while bandwidth delnands, mainly parallel memory input/output operations, are reduced. The proposed processor acts as a pre-processor that intcrlaces a parallel architecture and a host computer, by l)roviding an efficiellt lnemory management schemc. The processor was implemented in a VLSI chip using a l~tm CMOS Double Layer Metal (DLM) technology. The area of the chip, including the pads, is 6.102ram x 4.845mm = 29.56 mm 2 and the maximum frequcncy of operation, obtained after loaded simulations, was found to be 41.67 MHz. 2. DESCRIPTION OF THE ARCHITECTURE For the purposes o1"this paper it is assumed that tile dimensions o1"tile image space are 512 x 512 pixels and that the dimcnsions ol'the 2-D window are 8 x 8 pixels. Additionally, the proposed architecture is applied to 8-bit (grey scale) image data, but it can be easily adapted to operate on 24-bit data (colour images). It is also assulned that the overall image is storcd in external RAM (8-bit words, 2 ~8entries). Moreover, it should be noted that a relative co-ordinate scheme is used for the window movement. Thus, the next window position is determined with respect to the current window position, rather than using an equivalent absolute co-ordinate
696
scheme, fixed to a specilic origin. The specification of relative window movements is expected to be more economical when focusing to specific parts of the image. More specifically, a Cartesian co-ordinate system (x,y) is defined for the image such that 0<x<511 and 0
Figure I: Block diagram of Ihe processor
697 2.1 Description of the basic units Two 10-bit registers X,Y store the desired number of pixels that tile 2-D window has to move, ill the x and y directions respectively, and the direction of that movement. The data stored in the input registers are fed to the WinMove unit. This unit calculates the new data address that corresponds to the new position of the window, determined by the external MoveX, Move Y. SelectX, Select Y signals. Assuming that the window reference point lies at co-ordinates (x,y) on the image space, which corresponds to memol 3, address XH, window movements along the axes are implemented according to the lk~llowing equation: X'H=XH:I:ct+(I~ x 200,)
(1)
where ct is the desired number of pixels for a horizontal movement, J~ is the desired number of pixels for a vertical window inovement and X'H is the new lnemory address for the window reference point. The Window Movement Control system controls window movement and flags out a Bad Move signal whenever the whole or part of the window lies outside the image space. The control system checks the relative position of the 2-D window with respect to the image space and it consists of two 9-bit registers, two adder/subtractor circuits, four comparators and a control unit that ilnl)lements tile necessary conditions which decide whether whole or part of thc window lies outside the image space. The units described ill the previous l)aragraphs calculate the melnol.-y address for the window rel'erence point. The window scanning mechanisln IVinScan calculates the addresses lk)r the rest of the pixels that belong to the window. It must be noted that pixels lying on the same window line (salnc y co-ordinate) difrer by one memory position. Also, the memory address of the last pixel of a specific linc of the window differs by 505 melnory positions from the first pixcl of the next line of the window. The WinScan unit consists of a parallel 2 to 1 multiplexer, an 18-bit register, two binary up-counters, an arithmetic unit and a control unit.
2.2 Description of the the Predictor System Ill order to develop a fairly general prediction scheme, capable of operating ill applications where successive examination of neighboufillg blocks ill the 2-D image space is required and, still, simple enough to be integrated ill VLSI, the prediction principle proposed in this paper call be summarised as follows: "The block most likely to be required by the next operation of the parallel architecture is determined by the history of previous window movclnenls" The predictor system kcei)s track of the nulnber ol" previous window movelnents along tile axes of tile 2D grid (history of window movements) and suggests, as a prediction, the next block at that direction that is most heavily represented ill the history of window movclncnts. Four counter circuits arc used to implement the window histoly mechanism. These counter circuits dctcrlnine which one ol" the 8 possible neighbouring windows (both direct and indircct neighbours arc included, in accordance with nearest neighbour principles [6]) should be ioadcd ill intcnlal lncmory, by counting tile number of window movements along tile axes of the grid according to the values of the extenml control signals SclcctX and SelectY. The counter outputs are compared to each other in pairs, using two parallel comparator circuits. A decision for which neighbouring window to be loaded in internal melno~3' is reached with respect to the outputs of the comparator circuits. Tile data of the predicted block are stored in lnemol.'y, internal to the predictor unit. The internal memory system, used for storing the predicted block of image data, consists of 8 identical RAM cells. Each R A M cell call store 8, 23-bit words. Thus, the 8 RAM cells are capable ol" storing the data for a complete 2-D window. The memoly addressing scheme used for the inlernal RAM cells is based on the Direct Mapping method [7] which is usually used I'or the efficient melnoly addressing of cache mclnory. According to this lnethod, part of the memory address used lk~r accessing the external lnemory is also used to access the internal memory. The 18-bit lnemory address is divided into two parts: tile first part contains the 3 least significant bits of the address and is called the "index" and the second part, containing the rest 15 bits of the melnoly address is callcd the "tag". Each 23-bit lnemory word slored in the internal RAM consists of two parts: the 8-bit data plus the corresponding tag (8-bit pixcl data + 15-bit addrcss tag=23 bits). When new data are requested by the parallel architecture a check is perl'onned to decide whether the prediction was successful. In the case of a successful prediction, data are loaded to the architecture from internal memory. Otherwise, a request is made to load data from external memory. The m a i n advantage of the proposed prediction system is that internal memory has a considerably lhster access time than external lnemory and uses a more efficient direct access addressing lnethod. Additionally, the width of the data bus for data transfer between the proposed pre-processor and the parallel architecture can be varied in order to obtain very high data transfer r~l I c s .
2.3 VLSI Implementation Tile proposed processor was implemented in a VLSI chipusing a ll.tm CMOS Double Layer Metal (DLM) technology. Tile C A D E N C E CAD tools for VLSI design were used l'or tile implementation of tile chip and tile Standard Cells design methodology was adopted, using the CMOS libraries provided by European Silicon Structures (ES2). Tile area of the chip, including the inl~Ul/outpul pads, is 6.102mm x 4.845mm = 29.56 Inln 2 and the maximum frequency of Ol)cralion, obtained after loaded silnulations, was Ik)und to be 41.67 MHz.
698 3. CONCLUSIONS Tile design and VLSI ilnplelnentation of a 2-D window processor, capable of loading specific blocks (2D windows) of tile image space depending on a set o1" external control signals, ont.o any type of parallel architecture, was presented ill this paper. The 2-D window is capable of focusing at specific parts of the image space and to provide at its output the image data stored ill it. A predictor system, capable of predicting the position of the next block of image that the parallel architecture will require, based on the history of the previous window movements, was also implemented and integrated ill the proposed processor. The processor was implemented ill a VLSI chip using a l~un CMOS Double Layer Metal (DLM) technology. The area of the chip, including the pads, is 6.102mm x 4.845mm = 29.56 mln 2 and tile maximum frequency of operation, obtained after loaded simulations, was found to be 41.67 MHz.
REFERENCES
[1] K. Hwang (1987), 'Advanced parallel processing with superconil)utcr architectures', Proceedings of tile IEEE, vol. 75, pp. 1348-1379. [2] P. Tzionas, Ph. Tsalides and A. Thanailakis (1992), 'Design and VLSI implementation ol, a pattenl chlssifier using Pseudo-2D Cellular Autolnata', lEE Proceedings-G, vol. 139, no. 6, I)P. 66 i-668. [3] A.K. Jain (1989), 'Fulldamentals o1"digital image processing', Prcnlice Hall Information and System Sciences series, ch. 9, New Jersey. [4] A. Waks and O. Tretiak (1990), 'Robust detection of region bottndaries ill a ~;cquence of images', Proceedings of the 10th Intenlational Conference on Pattenl Recognition, New Jersey, USA, IEEE Computer Society Press, pp. 947-952. [5] B.V. Dasarathy (ed.) (1991), ' Nearest Neighbor (NN) norms: NN Patenl Classification techniques', IEEE Computer Society Press. [6] T.P. Yunck (1976), 'A technique to idcntil'y Nearest Neighbors', IEEE Transactions on Systems Mall alld Cybernetics, vol. SMC-6, no. 10, pp. 678-683. [7] J.E. Smith and J.R. Goodman (1985), 'Instruction cache replaccmcnl policies and organizations', IEEE Transactions on Colnputers, vol. c-34, no. 3, pp. 243-241.
Due to unavoidable circumstances the next paper, which belongs in session H, is printed at the end of this book.
This Page Intentionally Left Blank
Proceedings IWISP '96; 4-7 November 1996; Manchester, U.K. B.G. Mertzios and P. Liatsis (Editors) 9 1996 Elsevier Science B.V. All rights reserved.
701
Rate conversion of compressed video for matching bandwidth constraints in ATM networks Pedro A ssung5o* and Mohammad Ghanbari Dep. of ESE, University of Essex Wivenhoe Park - Colchester CO4 3SQ - England Tel: +44 1206 872448; fax: +44 1206 872900 e-mail :
paaass~essex,
ac.uk,
ghan~essex.ac.uk
Abstract
A system capable of dynamically matching compressed video bit streams into ATM multiplex channels is needed and no general solution to this requirement has been devised so far. Open loop schemes such as simple requantisation of the compressed bit stream or drop of higher frequency coefficients, are not suitable for this task, since these methods cause drift. In the present work we show that drift can lead to a significant drop in quality just after a small number of predicted frames. Hence we derive a drift free rate converter, working entirely in the frequency domain and capable of meeting ATM transmission demands. The low complexity and low delay of our rate converter combined with its drift free property, make it a perfect system for dynamic matching of bandwidth constraints in ATM networks.
1
Introduction
The majority of the traffic in the future broadband networks is expected to come from video and multimedia services [1]. Also, it is expected that most of video services will be based on MPEG2 standard [2], many of them using recorded bit streams. The characteristics of the forthcoming video services demand a new level of adaptation between video sources and transmission networks, or even between different networks. This level should be capable of dealing with compressed bit streams for matching the network requirements to the coded video. When compressed video is recorded, the characteristics of the channel through which it will be transmitted are assumed to be known beforehand. Therefore a great lack of flexibility arises in transmission of these streams when channels of diverse characteristics are used. If the same video programme is to be transmitted to several users through channels with different capacities, the service provider needs to keep several copies of that programme, each one encoded according to the corresponding channel characteristics. In ATM networks there is always a possibility of overflow in the switch buffers which leads to cell loss. Compressed video is very vulnerable to any loss, thus any traffic flow control capable of reducing network congestion is highly desirable. Reactive congestion control methods presented in the literature, rely on the assumption that a real-time encoder is available for responding to network demands [3, 4]. However, not all video bit streams come directly from real-time encoders, video servers and workstations, for instance, are examples where pre-encoded video is used and the flow control has to act dynamically on the compressed bit stream. Another environment where rate control of compressed video is needed is video multicasting, where either heterogeneous networks or links with different capacities are used for transmission of the same bit stream. It is fair to expect that the quality of the received video by each end-user should only depend on the constraints of its own path. In order to achieve such a goal, one should provide means of rate control at the switching nodes [5]. This enables the nodes with limited resources to transcode the input bit stream into lower rates according to their output link constraints. This type of rate control is mainly static, since it depends more on the characteristics of the output link, which do not change during the connection, than on traffic congestion which is due to buffer overflow in the switch. In this paper we show that it is only required to keep a single copy of the compressed video at its highest possible quality and still being able to cope with all possible transmission constraints. By transcoding the main bit stream into lower bit rates with a minimum delay, any bandwidth constraint can be met. The transcoder, * His work has been supported by Instituto Polit~cnico de Leiria and Programa PRAXIS XXI - JNICT, Portugal
702 working as a rate converter, is capable of responding to network demand at shortest possible time to prevent cell loss. In the previous work we derived a low delay post-processing system for transmission of coded video at lower bit rates [6]. In this paper we report on the modification to simplify the transcoder further by avoiding the use of Discrete Cosine Transform (DCT) and its inverse (IDCT). By extending the method developed in [7] for multiplexing multiple video bit streams into a single bit stream, we have developed a generic rate converter working entirely in the frequency domain.
M a t c h i n g coded video to network demand A generic rate converter should accept a standard compressed bit stream and produce another standard bit stream at lower rate. Therefore syntax compatibility at both input and output bit streams is mandatory. Also pictures decoded from the converted bit stream should have nearly the same quality as if they were originally encoded at the converted bit rate. Such a system should also introduce very low delay in order to be capable of dynamically matching the compressed bit streams to any constrained channel. Figure 1 depicts the generic environment where the rate converter is employed. In the figure we consider the video source any system transmitting video bit streams, either on-line encoders, video servers or workstations. Furthermore we do not make any specific assumption about the input and output channels to the rate converter (A, B). They may accept constant or variable bit rate video. The rate converter can even be part of the video source (no A channel), or can be placed anywhere along the transmission path.
Transmission
~ S(t)
Transmission
(Display)
SOURCE RATE 1,,,ROT ~aan~ VIDEO ]," Chanml~ '-- RT CONTROLLER t | DECODER Figure 1: Generic video communication system The amount of bandwidth reduction on the input bit stream is dictated by an external signal 0 < S(t) < 1, which can be either a dynamic network feedback signal (e.g. reactive congestion control) or a static signal for a constant reduction of the input bit rate. The input and output rates are given by equations 1 and 2.
R~~ =
Rgrut -
R e ( t ) + R~(t)
Rc(t) + S(t).R~,n(t)
(1) (2)
Where Re(t) and R~9(t) are the non-reducible and variable parts of the input bit stream R~n(t) respectively, and R~r"t is the total output rate. Since we do not consider new motion estimation for the rate conversion, Re(t) comprises all the header information as well as that of motion vectors, whereas R~y'~(t) are the bits used to code the transform coefficients.
3
R a t e conversion of coded video
For bit streams generated by the standard codecs, simple rate conversion of compressed video into lower bit rates can be doneby the requantisation of DCT coefficients or just dropping some of them. However, this will lead to picture drifting distortions, because the decoded pictures are different from those used as predictions at the encoder [8]. Therefore the ideal rate converter would be to decode the bit stream into reconstructed pixels and re-encode them again. This inevitably will increase the cost and delay of transcoding enormously. We propose the scheme depicted in figure 2 for rate conversion of coded video. The basic structure can be derived from a cascaded decoder-encoder as we have demonstrated in the previous work [6]. In this paper we have implemented the motion compensation in the DCT domain. This will further simplify the transcoder by not employing DCT and IDCT modules. In this rate converter the only processing delays are those introduced by VLC decoding and encoding, quantisation and motion compensation in DCT domain. Since these delays are only dependent on the implementation of respective functions, they can be limited to negligible values. Generally two buffers are needed in the feedback loop of the rate converter (buffer 1 and buffer 2 of fig. 2), just because of interpolated frames (B). Since B frames are not used as predictions, with some temporary distortions on these frames one can reduce the cost of the system by using only one prediction buffer. Using only one buffer still prevents drift through predicted frames (P) which are the important ones since intra frames (I) reset the error accumulation.
703
9
He,adcr info
~ S(t) t
[ I f M C-DCT~[ ~..Buffer ...
:
Motionvcctors
r
t
Figure 2: Rate converter for coded video
3.1
M C in D C T domain
In general, when a data block is predicted from a previous frame it does not match the block structure of that frame, unless the motion vectors are multiples of the block size. Therefore one block in the current frame can cover an area of 4 adjacent blocks in the previous frame. When operating in the pixel domain there is no need of keeping track of the frame block structure. However, in the DCT domain it cannot be avoided because of the inherent block structure of DCT in most of video coding algorithms. In order to reconstruct a predicted block in the DCT domain from the 4 DCT blocks that contribute to the prediction, the overlapping subblocks have to be extracted [7]. The extraction of these subblocks for reconstruction of the predicted block B R can be done by applying the matrix equation 3 in the pixel domain, 4
BR = E Hil.Bi.Hi2
(3)
i=1
where Bi are the overlapping blocks in the previous frame, and Hi1, Hi2 are fixed matrices dependent on the size of each subblock. Also, since the DCT transform is distributive to matrix multiplication, the transform block T(BR) can be obtained from equation 4. Note that, since Hi1, Hi2 matrices are fixed their transforms can be pre-computed and stored in memory. 4
T(BR) = E T(Hil).T(Bi).T(Hi2)
(4)
i=1
In case the block is aligned either horizontally or vertically, then, only 2 rather than 4 blocks are needed. Furthermore, for perfect alignment (e.g. both horizontally and vertically) equation 4 is simplified to: T(Bn) T(Bi). Since motion compensation in MPEG2 is performed with half pixel accuracy, this method has to be applied for the shifted blocks too. The final block is obtained by averaging the two extracted blocks. Recent work has shown that very high computational savings can be achieved by applying motion compensation in the frequency domain [9].
4
Performance
The main problem of rate conversion of compressed video is the drift in the decoded pictures. In fact, if there is any mismatch between the encoder and decoder loops, the quality of predicted pictures will decrease continuously. This is due to the error accumulation in the decoder. In order to evaluate the drift free property of the rate converter presented in this work, we compare its output to those of a single-stage encoder and an open loop requantiser. All the three systems working at the same quantisation step size. One GOP with 30 frames of MOBILE sequence was VBR encoded using a fixed quantiser step of Q1 - 8. Then the resultant bit stream was requantised and converted with the scheme of figure 2, with Q2 = 14, which reduces the bit rate by about 36%. Figure 4 shows the PSNR of the decoded video for the 3 cases: encoding original frames using Q1 = Q2 = 14, open loop requantisation and rate conversion both using Q2 = 14. The effect of drift in an open loop requantisation is very obvious, whereas the difference between the rate converter and a single-stage encoder is roughly constant at about 1 dB along the whole sequence. Note that this is the worst case, since we are comparing a system which encodes the original video (single-stage encoder) with another one which transcodes compressed video. Even so the difference is not significant.
704
Figure 3: Performance of the rate converter
5
Conclusions
We have described a rate conversion scheme for matching bandwidth constraints in ATM networks. By implementing the motion compensation operation in the DCT domain we have designed a new system working entirely in the frequency domain. We have shown that no drift is introduced with our rate conversion system and the quality of transcoded pictures is very close to those encoded from the original video. Furthermore, the delay introduced by the rate conversion process is low, which is a mandatory requirement for any system working in an ATM environment. Applications of such a system include reactive congestion control for pre-encoded video, video multicasting and video on demand.
References [1] H. J. Stuttgen, "Network evolution and multimedia communication," IEEE Multimedia, vol. 2, pp. 42-59, Fall 95 1995. [2] ISO/IEC 13818-2, "Generic coding of moving pictures and associated audio, recommendation h.262," March 1994. [3] Y. Omori, T. Suda, G. Lin, and Y. Kosugi, "Feedback-based congestion control for VBR video in ATM networks," in Sixth International Workshop on Packet Video, (Portland, Oregon- USA), September 1994. [4] H. Kanakia, P. P. Mishra, and A. R. Reibman, "An adaptive congestion control scheme for real time packet video transport," IEEE/A CM Transactions on Networking, vol. 3, pp. 671-682, December 1995. [5] P. Assuncao and M. Ghanbari, "Multi-casting of MPEG-2 video with multiple bandwidth constraints," in 7th International Workshop on Packet Video, (Brisbane- Australia), March 1996. [6] P. Assuncao and M' Ghanbari, "Post-processing of MPEG2 coded video for transmission at lower bit rates," in IEEE International Conference in Acoustics, Speech, and Signal Processing, vol. 4, (AtlantaUSA), pp. 1999-2002, May 1996. [7] S.-F. Chang and D. G. Messerschmitt, "Manipulation and compositing of MC-DCT compressed video," IEEE Journal on Selected Areas in Communications, vol. 13, pp. 1-11, January 1995. [8] D. G. Morrison, M. E. Nilsson, and M. Ghanbari, "Reduction of bit-rate of compressed video while in its coded form," in Sixth International Workshop on Packet Video, (Portland, Oregon- USA), pp. 392-406, September 1994. [9] N. Merhav and V. Bhaskaran, "A fast algorithm for dct-domain inverse motion compensation," in International Conference on Acoustics Systems and Signal Processing, (Atlanta, Georgia- USA), May 1996.
705
Index of authors Abdi, H., 415 Acheroy, M., 329 Aguilar, P.L., 133 Ahmadi, M., 679 Ait-Boudaoud, D., 223 Aitsab, O., 3 Alberola-Lopez, C., 195 Alcock, R.J., 637 Amoroux, T., 333 Ancona, N., 577 Antoine, J.-P., 53, 65 Arp, F., 219 Asada, M., 257 Asselin de Beauville, J.P., 419 Assunr P., 701 Baginyan, S.A., 105 Bailey, N.J., 679 Ballarin, V.L., 625 Ballosino, N., 205 Baofen, Z., 127 Bardos, A.J., 317 Bartkowiak, M., 27 Battle, J., 299 Bauer, S., 245 Belbachir, M.F., 511 Bercebel, G., 69 Besserer,B., 593 Bi, D., 419 Birecik, S., 353 Bischoff, R., 299 Bock, F., 215 Bojkovic, Z., 541 Bolsens, I., 521 Bormans, J., 521 Bortolozzi, F., 479 Bosse, E., 377 Bouattoura, D., 461 Boukir, S., 593 Boukouvalas, C., 611 Bouridane, A., 23, 633 Brunetta, S.G., 577 Buisson, O., 593 Butler, D., 15 Campell, D.C., 119 Camposs, C.F.J., 665 Canagarajah, N., 183 Carrion, R.G., 337 Casar-Corredera, J.R., 195 Casas, A., 337
Cavagnino, D., 205 Cazaguel, G., 11 Cernadas, E., 337 Cetiner, B., 187 Chang, K., 397 Chen, H., 227 Chen, S., 7 Chi, S.-Y., 295 Chitti, Y., 589 Chiuderi, A., 365 Choi, T.W., 439 Christopoulos, C.A., 557 Cieplinski, L., 443 Cirredu, M., 209 Conci, A., 665 Cooklev, T., 69 Cornelis, J. 427, 557 Cuaj, Z., 267 Cuhadar, A., 37 Cziho, A., 11
Dabrowski, A., 81 Danya, Y., 127 Davidson, T.N., 385 Davis, C.J., 573 De Natale, F.G.B., 209 Deprettere, E.F., 525 D'Haeyer, J.P.F., 661 Dickin, F.J., 647 Dill, J.C., 393 Djebbari, A., 511 Djebbari, AI., 511 Domanski, M., 27, 277 Donlagic, D., 267 Doulamis, A., 155, 561 Doulamis, N., 561 Downtown, A.C., 37 Draye, J.-P., 115 Drew, M.S., 607 Dueck, J., 607 Dugelay, J.-L., 569 Durrani, T.S., 119 Dyakowski, T., 647
Edwards, M.D., 43 Efstratiadis, S.N., 341 Engels, M., 521 Eranosian, A.V., 253 Erwan, L., 249
706 Facon, J., 479 Faez, K., 361 Fantini, J., 111 Farah, B.D., 629 Fazekas, K., 173, 545 Fettweis, G., 517 Finlayson, G.D., 607 Fraser, D.A., 585 Fukuda,K., 281 Funt, B.V., 607 Fyfe, C., 141 Gaillard, P., 461 Garcia-Campos, R., 299 Garcia-Garduna, V., 447 Garcia-Ugalde, F., 447 Gatica-Perez, D., 447 Gautama, S., 661 Gerken, P., 27 Ghanbari, M., 549, 701 Ginsto, D.D., 209 Girolami, M., 141 Gogan, P., 589 Gomez, L., 337 Gonzalez, M., 625 Gonzalez-Palenzuela, E.S., 641 Gotlib, A., 603 Goulermas, J.Y., 169 Goulermas, Y.P., 347 Grgic, M., 245 Guillaume, M., 333 Hamada, T., 455 Hamamoto, K., 325 Hancock, E., 653 Hara, S., 409 Haris, K., 341 Harrington, J., 633 Hayasaka, R., 231 Hayes, G.M., 123 He, Z., 7 Heidari, S., 389 Hekstra, G.J., 525 Hel-Or, H., 603 Heusdens, R., 525 Hosticka, B.J., 475 Hostika, B.J., 451 Hoyle, B.S.; 679 Huet, B., 653 Hussain, A., 119
Ibatici, K., 137 Ibrahim, M.K., 531 Ivanov, V.V., 89, 97 Jabar, J., 691 Jagadeesh Kumar, V., 469 Jaitly, R., 585 Jedrzejek, C., 271,443 Jin, Q., 377, 381 Jiang, J., 15 Jiang, L., 669 Jones, W.W., 393 Joyeux, L., 593 Jun, W., 147 Jurkovic, F., 267 Juuso, E., 137, 423 Kang, D.W., 235 Karamyan, H.L., 253 Karras, D.A., 191 Karkanis, S.A., 191 Kaskalis, T.H., 687 Kato, K., 161 Kawanaka, A., 281 Keren, D., 603 Kettaf, F.Z., 419 Kida, T., 489, 507 Kida, Y., 489 Kim, C.R., 439 Kim, H., 263 Kim, J.C., 439 Kinsner, W., 405 Kisel, I., 101 Kivikunnas, S., 137 Kollias, S., 155, 561 Kollias, S.D., 31 Komatsu, T., 581,673 Konotopskaya, E., 101 Kouboulis, F.N., 493 Kovalenko, V., 101 Kovesi, M., 173,545 Koziris, N., 433 Kroupnova, N., 599 Kurugollu, F., 353 Lado, M.J., 465 Lafruit, G., 521 Lakany, H.M., 123 Lam, C.L., 165 Lamure, M., 619 Langevin, F., 461
707 Langi, A., 405 Lecornu, L., 271 Lemanhieu,I., 329 Levy, J.B., 369 Li, C.C., 263 Li, W., 241 Li, Y., 227 Liatsis, P., 169, 347 Lihui, P., 127 Lin, X., 397 Lina, J.-M., 61 Louchet, J., 669 Lovanyi, I., 11 Lu, C.-Y., 19 Luo, Z.M., 377 Maglaveras, N., 341 Mann, R., 647 Margaritis, K.G., 687 Marino, F., 311 Markovic, M., 541 Martinez, P., 133 Mastonardi, G., 311 Matsuda, T., 409 Matsumoto, S., 455 Matsushita, Y., 231 Medvecky, M., 683 Mendez, A.J., 465 Mertens, M., 427 Mertzios, B.G., 191,347, 357, 493 Mikulec, J., 177 Milovanovic, D., 285 Min, W., 147 Mitzias, D., 357 Mizas, C., 695 Mizuno, K., 321 Mkrttchian, V.S., 253 Moler, E., 625 Monari, M., 525 Moon, K.-A., 295 Moreno, J., 133 Morgan, S., 23 Morinaga, N., 409 Morita, S., 199 Morris, D.T., 43 Morris, T., 303 Mouhoub, S., 619 Mozef, E., 691 Murtovaara, S., 423 Nakajima, M., 497 Nakazawa, Y., 673 Namazi, M., 361
Newell, Z., 303 Neyt, X., 329 Nicoloyannis, N., 619 Nikias, C.L., 389 Nishimura, T., 325 Nixon, M.S., 573 Oh, W.-G., 295 Ohta, T., 231 Ososkov, G.A., 105 Otake, T., 281 Ovsenik, L., 173,545 Paindavoine, M., 415 Panagiotidis, N.G., 31 Panas, S.M., 307 Papademitriou, R.C., 501 Papakonstantinou, G., 433 Pappas, C., 341 Parhi, K.K., 535 Park, J.-W., 295 Paul, J.S., 469 Pavisic, D., 115 Perez, A.M., 133 Pes, P., 209 Pessana, F., 625 Petrou, M., 611 Pham, D.T., 187, 637 Philips, W., 557 Pichler, O., 451,475 Planinsic, P., 267 Polidori, E., 569 Porter, R., 183 Potocnik, B., 151 Prabhakar, R., 241 Pyndiah, R., 3 Qiu, G., 7, 553 Qizhi, D., 147 Reddy, M.R.S., 469 Refregier, P., 333 Ricny, V., 177 Rigas, A.G., 483 Roche, S., 569 Rodriguez, P.G., 337 Rouvaen, J.M., 511 Roux, C., 11 Ruiz-Alzola, J., 195 Ryu, D.H., 439
708 Sahli, H., 427 Saito, T., 581,673 Sampson, D.G., 37, 549 Sandic, D., 285 Sangwine, S.J., 317, 615 Sankur, B., 353 Santos, S.G., 493 Sarantis, D., 657 Sawada, K., 257 Sawai, H., 321 Scarpetis, M.G., 493 Schempp, W., 73 Sezgin, M., 353 Shao, G.-F., 161 Shiina, T., 325 Shimazu, Y., 231 Silva, A., 133 Silva, E. da, 549 Skodras, A.N., 557 Solaiman, B., 3, 11 Soraghan, J.J., 119 Souto, M., 465 Spiliotis, I.M., 347 Strintzis, M.G., 289 Sutinen, R., 423 Swierczynski, R., 277 Tahoces, P.G., 465 Takahashi, H., 497 Tamaki, A., 161 Tanaka, M., 199 Teoh, C., 369 Teran, R., 115 Teuner, A., 451,475 Thanailakis, A., 695 Thornton, A.L., 615 Tisserand, E., 691 Tolias, Y.A., 307 Torres, S., 625 Tsanakas, P., 433 Tsapatsoulis, N., 155
Tsiodras, A., 561 Turan, J., 173,545 Tzionas, P., 695 Tzovaras, D., 289 Vandergheynst, P., 65 Vanhoof, B., 521 Vega-Cruz, P.I., 641 Venetsapoulos, A.N., 69 Villon, P., 461 Walter, H., 215 Wang, C.-Y., 535 Wang, M., 647 Watabe, K., 321 Watanabe, S., 581 Weber, S., 691 Weiss, M., 517 Wen, K.A., 19 Wickerhauser, M.V., 47 Wilde, M., 215 Wong, K.M., 377,381,385 Wu, J., 381 Wu, X., 227 Xu, C., 535 Xydeas, C.S., 657 Yang, F., 415 Yao, F.-H., 161 Yuasa, K., 321 Yuen, S.Y., 165 Zazula, D., 151 Zhao, J., 231 Zhijie, X., 127 Zovko-Cihlar, B., 245 Zrelov, P.V., 97