ABSTRACT 
 
 
 
 
Title of Document: HIGH-PERFORMANCE 3D IMAGE  
PROCESSING ARCHITECTURES FOR  
IMAGE-GUIDED INTERVENTIONS 
  
 
Omkar Dandekar, Ph.D., 2008 
  
Directed By: Professor Raj Shekhar (Chair/Advisor),  
Professor Shuvra S. Bhattacharyya (Co-advisor), 
Department of Electrical and Computer Engineering 
 
 
Minimally invasive image-guided interventions (IGIs) are time and cost 
efficient, minimize unintended damage to healthy tissues, and lead to faster patient 
recovery. Advanced three-dimensional (3D) image processing is a critical need for 
navigation during IGIs. However, achieving on-demand performance, as required by 
IGIs, for these image processing operations using software-only implementations is 
challenging because of the sheer size of the 3D images, and memory and compute 
intensive nature of the operations. This dissertation, therefore, is geared toward 
developing high-performance 3D image processing architectures, which will enable 
improved intraprocedural visualization and navigation capabilities during IGIs.  
In this dissertation we present an architecture for real-time implementation of 
3D filtering operations that are commonly employed for preprocessing of medical 
images. This architecture is approximately two orders of magnitude faster than 
  
corresponding software implementations and is capable of processing 3D medical 
images at their acquisition speeds.  
Combining complementary information through registration between pre- and 
intraprocedural images is a fundamental need in the IGI workflow. Intensity-based 
deformable registration, which is completely automatic and locally accurate, is a 
promising approach to achieve this alignment. These algorithms, however, are 
extremely compute intensive, which has prevented their clinical use. We present an 
FPGA-based architecture for accelerated implementation of intensity-based 
deformable image registration. This high-performance architecture achieves over an 
order of magnitude speedup when compared with a corresponding software 
implementation and reduces the execution time of deformable registration from hours 
to minutes while offering comparable image registration accuracy. 
Furthermore, we present a framework for multiobjective optimization of 
finite-precision implementations of signal processing algorithms that takes into 
account multiple conflicting objectives such as implementation accuracy and 
hardware resource consumption. The evaluation that we have performed in the 
context of FPGA-based image registration demonstrates that such an analysis can be 
used to enhance automated hardware design processes, and efficiently identify a 
system configuration that meets given design constraints. In addition, we also outline 
two novel clinical applications that can directly benefit from these developments and 
demonstrate the feasibility of our approach in the context of these applications. These 
advances will ultimately enable integration of 3D image processing into clinical 
workflow.
  
 
 
 
 
 
 
 
 
 
HIGH-PERFORMANCE 3D IMAGE PROCESSING  
ARCHITECTURES FOR IMAGE-GUIDED INTERVENTIONS 
 
 
 
By 
 
 
Omkar Dandekar 
 
 
 
 
 
Dissertation submitted to the Faculty of the Graduate School of the  
University of Maryland, College Park, in partial fulfillment 
of the requirements for the degree of 
Doctor of Philosophy 
2008 
 
 
 
 
 
 
 
 
 
Advisory Committee: 
Professor Raj Shekhar, Chair/Advisor 
Professor Shuvra S. Bhattacharyya, Co-advisor 
Professor Rama Chellappa 
Professor Manoj Franklin 
Professor Yang Tao 
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
? Copyright by 
Omkar Dandekar 
2008 
 
 
 
 
 
 
 
 
 
 
 
 
 
 ii 
 
Dedication 
 
 
 
 
 
 
 
 
 
 
 
To my Mother, Mangal, Himani, and the memories of my late Father, 
for their love and support 
 iii 
 
Publications 
? W. Plishker, O. Dandekar, S. Bhattacharyya, and R. Shekhar, ?Towards a 
heterogeneous medical image registration acceleration platform,? IEEE 
Transactions on Biomedical Circuits and Systems, (in preparation), 2008. 
 
? R. Shekhar, O. Dandekar, V. Bhat, R. Mezrich, and A. Park, "Development of 
CT-guided minimally invasive surgery," Surgical Innovation, (in 
preparation), 2008. 
 
? O. Dandekar, W. Plishker, S. S. Bhattacharyya, and R. Shekhar, 
?Multiobjective optimization for reconfigurable implementation of medical 
image registration,? International Journal of Reconfigurable Computing, 
(under review), 2008. 
 
? P. Lei, O. Dandekar, D. Widlus, P. Malloy, and R. Shekhar, ?Incorporation of 
PET into CT-guided liver radiofrequency ablation,? Radiology, (under 
revision), 2008. 
 
? O. Dandekar, W. Plishker, S. S. Bhattacharyya, and R. Shekhar, 
?Multiobjective optimization of FPGA-based medical image registration,? 
presented at IEEE Symposium on Field-Programmable Custom Computing 
Machines, 2008. 
 
? O. Dandekar and R. Shekhar, ?FPGA-accelerated deformable image 
registration for improved target-delineation during CT-guided interventions,? 
IEEE Transactions on Biomedical Circuits and Systems, vol. 1 (2), 2007, pp. 
116-127. 
 
 iv 
 
? O. Dandekar, C. Castro-Pareja, and R. Shekhar, ?FPGA-based real-time 3D 
image preprocessing for image-guided medical interventions,? Journal of 
Real-Time Image Processing, vol. 1 (4), pp. 285-301, 2007. 
 
? W. Plishker, O. Dandekar, S. Bhattacharyya, and R. Shekhar, ?Towards a 
heterogeneous medical image registration acceleration platform,? presented at 
IEEE Biomedical Circuits and Systems Conference, 2007, pp. 231-234. 
 
? O. Dandekar, K. Siddiqui, V. Walimbe, and R. Shekhar, ?Image registration 
accuracy with low-dose CT: How low can we go?,? presented at IEEE 
International Symposium on Biomedical Imaging, 2006, pp. 502-505. 
 
? C. R. Castro-Pareja, O. Dandekar, and R. Shekhar, ?FPGA-based real-time 
anisotropic diffusion filtering of 3D ultrasound images,? in SPIE Real-Time 
Imaging, 2005, pp. 123-131. 
 
? S. Venugopal, C. R. Castro-Pareja, and O. Dandekar, ?An FPGA-based 3D 
image processor with median and convolution filters for real-time 
applications,? in SPIE Real-Time Imaging, 2005, pp. 174-182. 
 
 
 v 
 
Acknowledgements 
I would like to express my sincere gratitude to Dr. Raj Shekhar for his 
guidance and financial support throughout my graduate education at The Ohio State 
University, the Cleveland Clinic, and University of Maryland. He has been the perfect 
mentor for my doctoral research, and a person from whom I have learnt a lot during 
the past five years. He ensured that I master not only the intricacies of medical image 
processing, but also put a strong emphasis on developing the qualities necessary for 
effective dissemination of scientific research and results. Beyond doubt, he has 
played the most important role in shaping my technical writing and presentation 
skills. All throughout my graduate career he has made himself available at any 
moment when I needed his inputs and feedback for my work or anything else. The 
time spent working with Dr. Shekhar has truly been the most rewarding career 
experience of my life. 
I would also like to thank my dissertation committee members, Prof. Shuvra 
Bhattacharyya, Prof. Rama Chellappa, Prof. Manoj Franklin, and Prof Yang Tao for 
their cooperation and support. I would especially like to thank Prof. Bhattacharyya for 
his help and collaboration in my work that resulted in an important chapter of this 
dissertation. The research work reported in this dissertation was partly supported by 
U.S. Department of Defense (TATRC) under grant DAMD17-03-2-0001. 
  My research at the University of Maryland would not have been 
possible without the support and encouragement of Drs. Rueben Mezrich, Adrian 
Park, Eliot Siegel, Khan Siddiqui, Nancy Knight, Faaiza Mahmoud, Steve Kavic, and 
all clinical staff at the University of Maryland and Baltimore VA. Whenever I 
 vi 
 
requested, they have spared valuable time from their busy schedules for discussions 
with me, and have been immensely helpful especially during the clinical validation 
studies. Dr. Siddiqui, in particular, was instrumental in providing clinical perspective 
on some of the research problems I have explored.  
I would like to thank Prof. Jogikal Jagadeesh for allowing me to work in his 
lab during my first two years at the Ohio State University, and for his crucial 
guidance during early stages of my graduate education. I am also thankful to Dr. 
Carlos R. Castro-Pareja, Dr Vivek Walimbe, Dr William Plishker, Dr. Jianzhou Wu, 
Peng Lei,  and Venkatesh Bhat from Dr. Shekhar?s research group for providing 
valuable help and inputs at various times during my research. 
My parents have always been the biggest source of inspiration in my life. 
They have always stressed the importance of education and instilled in me the virtues 
of honest and dedicated effort, for which I will forever be indebted to them. Their 
love and constant encouragement has been an important driving force throughout my 
life. I would like to thank my sister, Mangal, for her kind words of encouragement 
from time to time during the last few years. I would like to especially mention my 
long time friends Mukta, Prashant, Sandip, Siddharth, Rahul, Rakhi, and Vinayak, for 
always being there with me.  
Last, and most importantly, I would like to thank Himani ? my wife and my 
best friend. She has shown incredible patience and understanding throughout the 
course of my graduate studies. I would not have been able to successfully complete 
my doctoral program without the constant encouragement and motivation she 
provided. My achievements are my tribute to her unconditional love and support. 
 vii 
 
Table of Contents 
 
 
Dedication..................................................................................................................... ii 
Publications.................................................................................................................. iii 
Acknowledgements....................................................................................................... v 
Table of Contents........................................................................................................ vii 
List of Tables ................................................................................................................ x 
List of Figures............................................................................................................ xiii 
Chapter 1: Introduction................................................................................................. 1 
1.1. Overview............................................................................................................ 1 
1.2. Contributions of this Dissertation ...................................................................... 3 
1.2.1. Real-time 3D Image Preprocessing ............................................................ 4 
1.2.2. Hardware-Accelerated Deformable Image Registration............................. 5 
1.2.3. Framework for Optimization of Finite Precision Implementations............ 6 
1.3. Outline of this Dissertation ................................................................................ 7 
Chapter 2: Background and Related Work................................................................... 9 
2.1. Image-Guided Interventions .............................................................................. 9 
2.1.1. Role of Preprocedural Imaging................................................................. 10 
2.1.2. Need for Image Registration..................................................................... 12 
2.2. Classification of Image Registration................................................................ 13 
2.2.1. Image Registration using Extrinsic Information....................................... 14 
2.2.2. Image Registration using Intrinsic Information........................................ 15 
2.3. Intensity-Based Image Registration................................................................. 17 
2.3.1. Transformation Models............................................................................. 18 
2.3.2. Image Similarity Measures ....................................................................... 22 
2.3.3. Optimization Algorithms .......................................................................... 26 
2.4. Image Preprocessing ........................................................................................ 28 
2.4.1. Anisotropic Diffusion Filtering................................................................. 30 
2.4.2. Median Filtering........................................................................................ 31 
2.5. Optimization of Finite Precision Implementations.......................................... 32 
2.5.1. Optimal Wordlength Formulation............................................................. 33 
2.5.2. Simulation-Based Optimal Wordlength Search........................................ 34 
2.5.3. Multiobjective Optimization..................................................................... 35 
2.6. Related Work ................................................................................................... 37 
2.6.1. Real-Time Image Preprocessing............................................................... 37 
2.6.2. Acceleration of Image Registration .......................................................... 40 
2.6.3. Optimization of Finite Precision Implementations................................... 43 
Chapter 3: Real-time 3D Image Processing................................................................ 47 
3.1. Motivation........................................................................................................ 47 
3.2. Filtering Algorithms......................................................................................... 50 
3.2.1. Anisotropic Diffusion Filtering................................................................. 50 
3.2.2. Median Filtering........................................................................................ 51 
3.3. Architecture...................................................................................................... 52 
 viii 
 
3.3.1. Memory Controller and Brick-caching Scheme ....................................... 54 
3.3.2. 3D Anisotropic Diffusion Filtering........................................................... 58 
3.3.3. Median Filtering........................................................................................ 64 
3.4. Implementation and Results............................................................................. 68 
3.4.1. Effects of Finite Precision Representation................................................ 69 
3.4.2. Hardware Requirements............................................................................ 73 
3.4.3. Filtering Performance ............................................................................... 75 
3.5. Summary.......................................................................................................... 78 
Chapter 4: Hardware-Accelerated Deformable Image Registration........................... 80 
4.1. Motivation........................................................................................................ 80 
4.2. Algorithm for Deformable Image Registration................................................ 83 
4.2.1. Calculating MI for a Subvolume............................................................... 85 
4.3. Acceleration Approach .................................................................................... 86 
4.4. Architecture...................................................................................................... 88 
4.4.1. Voxel Counter........................................................................................... 89 
4.4.2. Coordinate Transformation....................................................................... 90 
4.4.3. Partial Volume Interpolation..................................................................... 92 
4.4.4. Image Memory Access ............................................................................. 94 
4.4.5. Updating Mutual Histogram ..................................................................... 99 
4.4.6. Entropy Calculation ................................................................................ 105 
4.4.7. Operational Workflow ............................................................................ 108 
4.5. Implementation and Results........................................................................... 111 
4.5.1. Execution Speed...................................................................................... 114 
4.5.2. Performance Comparison........................................................................ 117 
4.5.3. Qualitative Evaluation of Deformable Image Registration .................... 122 
4.6. Summary........................................................................................................ 124 
Chapter 5: Framework for Optimization of Finite Precision Implementations ........ 126 
5.1. Motivation...................................................................................................... 126 
5.2. Multiobjective Optimization.......................................................................... 129 
5.2.1. Problem Statement.................................................................................. 129 
5.2.2. Parameterized Architectural Design ....................................................... 131 
5.2.3. Multiobjective Optimization Framework ............................................... 134 
5.3. Experiments and Results................................................................................ 142 
5.3.1. Metrics for Comparison of Pareto-optimized Solution Sets................... 146 
5.3.2. Accuracy of Image Registration ............................................................. 148 
5.3.3. Post-synthesis Validation........................................................................ 150 
5.4. Summary........................................................................................................ 154 
Chapter 6: Clinical Applications............................................................................... 156 
6.1. Radiation Dose Reduction ............................................................................. 157 
6.1.1. Motivation............................................................................................... 157 
6.1.2. Dose Reduction Strategy......................................................................... 158 
6.1.3. Evaluation of Registration Accuracy with Low-Dose CT...................... 159 
6.1.4. Experiments ............................................................................................ 163 
6.1.5. Results..................................................................................................... 164 
6.1.6. Summary................................................................................................. 167 
6.2. Incorporation of PET into CT-Guided Liver Radio-Frequency Ablation ..... 168 
 ix 
 
6.2.1. Motivation............................................................................................... 168 
6.2.2. Registration of PET and CT.................................................................... 170 
6.2.3. Experiments ............................................................................................ 171 
6.2.4. Results..................................................................................................... 174 
6.2.5. Summary................................................................................................. 178 
Chapter 7: Conclusions and Future Work................................................................. 180 
7.1. Conclusion ..................................................................................................... 180 
7.2. Future Work................................................................................................... 185 
Bibliography ............................................................................................................. 189 
 
 
 
 
 x 
 
List of Tables 
Table 2.1: Broad classification of image registration in the context of IGI......... 13 
Table 3.1: Software execution time of 3D anisotropic diffusion filtering and 3D 
median filtering of 8-bit images for common kernel sizes (N)........... 48 
Table 3.2: Average error in intensity per voxel for a Gaussian filtered image 
resulting from fixed-point representation of Gaussian coefficients.... 69 
Table 3.3: Average error per sample of diffusion function resulting from fixed-
point representation of diffusion coefficients employed in the 
presented architecture. ........................................................................ 70 
Table 3.4: Average error in intensity per voxel for anisotropic diffusion filtered 
resulting from fixed-point representation of Gaussian coefficients and 
the diffusion function.......................................................................... 72 
Table 3.5: Hardware requirements of the architecture for real-time 3D image 
preprocessing. ..................................................................................... 73 
Table 3.6: Hardware requirements for the components of the linear systolic 
implementation of the 3D median filtering......................................... 74 
Table 3.7: Execution time of 3D anisotropic diffusion filtering and 3D median 
filtering................................................................................................ 75 
Table 3.8: Performance comparison of the 3D anisotropic diffusion filtering 
kernel................................................................................................... 76 
Table 3.9: Performance comparison of the 3D median filtering kernel............... 78 
Table 4.1: Configurations of LUT-based entropy calculation module that were 
considered in the presented architecture........................................... 106 
 xi 
 
Table 4.2: Operational workflow for performing volume subdivision?based 
deformable image registration using the presented architecture....... 109 
Table 4.3: Comparison of mutual information calculation time for subvolumes at 
various levels in volume subdivision?based deformable registration 
algorithm........................................................................................... 115 
Table 4.4: Execution time of deformable image registration............................. 116 
Table 4.5: Performance comparison of the presented FPGA-based 
implementation of intensity-based deformable image registration with 
an equivalent software implementation and prior approaches for 
acceleration of intensity-based registration. ..................................... 121 
Table 5.1: Design variables for FPGA-based architecture. Integer wordlengths are 
determined based on application-specific range information, and 
fractional wordlengths are used as parameters in the multiobjective 
optimization framework.................................................................... 136 
Table 5.2: Number of solutions explored by search methods............................ 142 
Table 5.3: Parameters used for the EA-based search......................................... 143 
Table 5.4: Validation of the objective function models using post-synthesis 
results. The wordlengths in a design configuration correspond to the 
FWLs of the design variables identified earlier................................ 151 
Table 6.1: Execution time for deformable image registration using low-dose CT.
........................................................................................................... 167 
Table 6.2: Execution time for deformable image registration using 
intraprocedural CT and preprocedural PET images.......................... 176 
 xii 
 
Table 6.3: Interexpert variability in landmark identification across 20 image 
pairs. PET
ALGO
 corresponds to the software implementation of the 
algorithm........................................................................................... 177 
Table 6.4: Interexpert variability in landmark identification across 20 image 
pairs. PET
ALGO
 corresponds to the FPGA-based implementation of the 
algorithm........................................................................................... 178 
 xiii 
 
List of Figures 
Figure 1.1: A typical IGI workflow and the scope of this dissertation work .......... 3 
Figure 2.1: Two examples of pre- and intraprocedural image pairs. The arrows 
indicate the targets that are visible in preprocedural images but not 
visible in intraprocedural images........................................................ 11 
Figure 2.2: An example of volumetric image guidance using intraprocedural 
multislice CT and preprocedural MR. ................................................ 12 
Figure 2.3: Flowchart of image similarity?based image registration.................... 17 
Figure 2.4: Example of preprocessing techniques employed prior to intensity-
based image registration. .................................................................... 29 
Figure 2.5: Pareto front in the context of multiobjective optimization. ................ 36 
Figure 3.1: A median filtering example using majority voting technique. ........... 52 
Figure 3.2: Block diagram of the FPGA-based real-time 3D image preprocessing 
system. ................................................................................................ 53 
Figure 3.3: Typical voxel access pattern for neighborhood operations?based image 
processing. .......................................................................................... 54 
Figure 3.4: Block diagram showing the input image memory and the input buffer 
configuration....................................................................................... 56 
Figure 3.5: Pictorial representation of the notation used in the brick-caching 
scheme................................................................................................. 57 
Figure 3.6: Top-level block diagram of 3D anisotropic diffusion filtering. This 
diagram indicates paths that are executed in parallel.......................... 60 
 xiv 
 
Figure 3.7: Block diagram of the embedded Gaussian filter bank (for N = 7, 
corresponding Gaussian kernel size is 5)............................................ 61 
Figure 3.8: Pipelined implementation of an individual Gaussian filter element 
(Gaussian kernel size = 5)................................................................... 62 
Figure 3.9: A single stage (processing element) of the linear systolic median 
filtering kernel..................................................................................... 65 
Figure 3.10: Linear systolic array architecture for median filter kernel using 
majority voting technique. .................................................................. 68 
Figure 4.1: Pictorial representation of hierarchical volume subdivision?based 
deformable image registration and associated notation...................... 83 
Figure 4.2: Pictorial representation of the acceleration approach. ........................ 86 
Figure 4.3: Top-level block diagram of the architecture for accelerated 
implementation of deformable image registration.............................. 88 
Figure 4.4: Functional block diagram of voxel counter. ....................................... 89 
Figure 4.5: Functional block diagram of coordinate transformation unit. ............ 91 
Figure 4.6: Fundamentals of interpolation schemes.............................................. 92 
Figure 4.7: Functional block diagram of partial volume interpolation unit. ......... 94 
Figure 4.8: Voxel access patterns of the reference and floating images encountered 
during image registration. ................................................................... 95 
Figure 4.9: Organization of the reference image memory. ................................... 97 
Figure 4.10: Organization of the floating image memory....................................... 98 
Figure 4.11: Pipelined implementation of MH accumulation using dual port 
memory. ............................................................................................ 100 
 xv 
 
Figure 4.12: Preaccumulate buffers to eliminate RAW hazards in MH accumulation 
pipeline.............................................................................................. 102 
Figure 4.13: A flow diagram of steps involved in calculating MH
Rest
. ................. 103 
Figure 4.14: Error in entropy calculation corresponding to the two configurations of 
the multiple LUT?based implementation. ........................................ 107 
Figure 4.15: Qualitative validation of deformable registration between iCT and 
preCT images performed using the presented FPGA-based solution.
........................................................................................................... 122 
Figure 4.16: Qualitative validation of deformable registration between iCT and PET 
images performed using the presented FPGA-based solution. ......... 123 
Figure 5.1: Examples of parameterized architectural design style...................... 133 
Figure 5.2: Framework for multiobjective optimization of FPGA-based image 
registration. ....................................................................................... 134 
Figure 5.3: Comparison of the area values predicted by the adopted area models 
with those obtained after physical synthesis..................................... 141 
Figure 5.4: Pareto-optimized solutions identified by various search methods.... 144 
Figure 5.5: Qualitative comparison of solutions found by partial search and EA-
based search. ..................................................................................... 145 
Figure 5.6: Quantitative comparison of search methods using the ratio of non-
dominated individuals (RNI). ........................................................... 147 
Figure 5.7: Quantitative comparison of search methods using cover rate. ......... 148 
Figure 5.8: Relationship between MI calculation error and resulting image 
registration error................................................................................ 149 
 xvi 
 
Figure 5.9: Results of image registration performed using the high-speed, FPGA-
based implementation for design configurations offering various 
registration errors.............................................................................. 153 
Figure 6.1: Integration of deformable registration into IGI workflow................ 156 
Figure 6.2: Important steps for evaluating registration accuracy with low-dose CT.
........................................................................................................... 159 
Figure 6.3: Low-dose CT images generated by the dose-simulator.................... 160 
Figure 6.4: Comparison of techniques for preprocessing low-dose CT images.. 162 
Figure 6.5: Qualitative comparison of registration accuracy with low-dose CT. 165 
Figure 6.6: Average registration error with respect to dose using software and 
FPGA-based implementations. ......................................................... 166 
Figure 6.7: Graphic illustration of the quantitative validation approach used in the 
context of deformable registration between intraprocedural and 
preprocedural PET. ........................................................................... 173 
Figure 6.8: Registration of intraprocedural CT and preprocedural PET images 
using the FPGA-based implementation of deformable image 
registration. ....................................................................................... 176 
 1 
 
Chapter 1:   Introduction 
1.1.   Overview 
Image-guided interventions (IGIs), including surgeries, biopsies, and 
therapies, have the potential to improve patient care by enabling new and faster 
procedures, minimizing unintended damage to healthy tissue, improving the 
effectiveness of the procedures, producing fewer complications, and allowing for 
clinical intervention at a distance. As a result, IGIs has been identified by clinical 
experts to have a significant impact on the future of clinical care [1]. With further 
invention and development of imaging and image processing techniques, innovative 
minimally invasive image-guided inventions will replace conventional open and 
invasive techniques. Continuous three dimensional (3D) imaging and visualization for 
intraprocedural navigation, critically important to the success of IGI, has been 
technologically difficult until recently. However, the advances in medical imaging 
technology and visualization capabilities, leading to improved imaging speed and 
coverage, have prompted developments in imaging protocols and enabled volumetric 
image-guided procedures. 
The efficiency and efficacy of IGIs is critically dependant on accurate and 
precise target identification and localization. Lack of clear target delineation could 
lead to lengthy procedures, larger than necessary safety margins and unintended 
damage to healthy tissue??factors that undermine the very motivation behind IGIs. 
Intraprocedural imaging techniques provide a rich source of accurate spatial 
information that is crucial for navigation but often suffer from poor signal-to-noise 
 2 
 
ratio (SNR) and poor target definition from background healthy and/or benign tissue. 
As in most clinical protocols, IGIs are preceded by one or more preprocedural 
images, containing additional information, such as contrast-enhanced structures or 
functional details such as metabolic activity, which are used for diagnosis, 
treatment/navigation planning, etc. Combining this functional and/or contrast 
information with intraprocedural morphological and spatial information, through co-
registration between pre- and intraprocedural images, has been shown to improve the 
intraprocedural target delineation [2-6].  
Achieving this registration between intraprocedural and preprocedural images 
is a fundamental need during the IGI workflow. Moreover, given the on-demand 
nature of IGIs, this alignment should be achieved sufficiently fast so as not to affect 
the clinical workflow. Earlier approaches to meet this need primarily employed rigid 
body approximation, which can be less accurate because of non-rigid tissue 
misalignment between these images. Intensity-based deformable registration is a 
promising option to correct for this misalignment. These algorithms are automatic, 
which is an important aspect that enables their easy integration into many 
applications; However, the long execution times of these algorithms have prevented 
their use in clinical workflow. In addition, since this technique is based on intensity-
based alignment between images, it is sensitive to the SNR of the images to be 
registered. Consequently, the images (in particular, intraprocedural images that are 
characterized with poor SNR) need to be preprocessed and de-noised before they can 
be registered. This workflow for providing improved visualization during IGIs is 
illustrated in Figure 1.1.  
 3 
 
The overall goal of this dissertation work is to improve the identification and 
localization of targets during image-guided interventions through automatic, fast, and 
accurate deformable image registration between preprocedural and intraprocedural 
images. With this accurate registration and fusion of complementary information an 
interventionist will be able to visualize accurately aligned anatomical structures (such 
as vasculature) and/or functional (metabolic) activity not natively present in the 
routine intraprocedural scans and thereby improving the targeting capability.  
1.2.   Contributions of this Dissertation 
The specific goal of this dissertation work is to develop and validate the core 
components of this advanced image processing system, which will enable improved 
visualization and target-delineation during image-guided procedures. These core 
components are identified in Figure 1.1. First, we employ reconfigurable hardware 
platform to develop an architecture for real-time implementation of image 
 
Figure 1.1: A typical IGI workflow and the scope of this dissertation work 
 4 
 
preprocessing techniques commonly used in the context of IGI. Second, we develop 
an field-programmable gate array (FPGA)?based architecture for accelerated 
implementation of intensity-based deformable image registration. Third, we propose a 
multiobjective optimization framework to analyze conflicting tradeoffs between 
accuracy and hardware complexity of finite precision implementations of signal 
processing application, such as presented in this work. Finally, we demonstrate the 
feasibility of developing novel IGI applications leveraging the aforementioned 
components. 
In the following sections we elaborate further on the main contributions of this 
dissertation. 
1.2.1.   Real-time 3D Image Preprocessing 
Image preprocessing, which consists of filtering and de-noising, is a 
prerequisite step in many image processing applications. Especially in the context of 
IGI, where intraprocedural images are characterized by poor signal-to-noise ratio, 
image preprocessing is required prior to advanced image analysis operations such as 
registration, segmentation, and volume rendering. Moreover, the interactive nature of 
IGIs necessitates equivalent image processing speed so that these operation can be 
performed in a streamlined manner without any additional processing latency. 
Most reported techniques for accelerated implementation of image processing 
algorithms have primarily focused on one-dimensional (1D) or two-dimensional (2D) 
cases [7-11]. These techniques do not adequately address the need for accelerating 
these operations in 3D, which is required for providing volumetric image-guidance 
during minimally invasive procedures. Furthermore, some of the earlier techniques 
 5 
 
used for acceleration cannot be extended to 3D, whereas for some others the 3D 
extension is nontrivial. 
This dissertation presents an FPGA-based novel architecture for accelerated 
implementation of common image preprocessing operations. This architecture is 
reconfigurable and supports multiple filtering kernels such as 3D median filtering, 
and 3D anisotropic diffusion filtering within the same framework. The architecture 
presented in this work is faster than earlier reported techniques, supports larger kernel 
dimensions, and is capable of meeting the real-time data processing need of most 
IGIs. Although developed in the context of IGIs, this architecture is general-purpose 
and can be applied to meet preprocessing needs of many medical as well as non-
medical applications. 
1.2.2.   Hardware-Accelerated Deformable Image Registration 
Image registration between preprocedural images (acquired for diagnosis and 
treatment planning) and intraprocedural images (acquired for up-to-date spatial 
information) is an inherent need in the IGI workflow. Accurate and fast registration 
between these images will enable the fusion of complementary information from 
these two image categories and can enable improved treatment site identification and 
localization and navigation during the procedure. 
Several fiducial or point-based, mechanical alignment-based and intensity-
based rigid alignment techniques [12-16] have been proposed for this purpose. Some 
of these techniques are not automatic and almost all of them employ the rigid body 
approximation, which is often not valid due to tissue deformation between these two 
image pairs. Deformable image registration techniques can compensate for both local 
 6 
 
deformation and large-scale tissue motion and are the ideal solution for achieving the 
aforementioned image registration. Some studies, in particular, have independently 
underlined the importance of deformable image registration for IGIs [17-19]. 
However, despite their advantages, deformable image registration algorithms are 
seldom used in current clinical practice due to their computational complexity and 
associated long execution times (which can be up to several hours).  
This dissertation presents a novel FPGA-based architecutre for accelerated 
implementation of a proven automatic and deformable image registration algorithm, 
specially geared toward improving target delineation during image-guided 
interventions. This architecture accelerates calculation of image similarity, a 
necessary and the most time consuming step in image registration, by greater than an 
order of magnitude and thereby reducing the time required for deformable registration 
time from hours to minutes. This design is tuned to offer registration accuracy 
comparable to that achievable using software implementation. Furthermore, we 
validate this high-speed design and demonstrate its feasibility in the context of 
clinical applications such as computed tomography (CT)-guided interventional 
applications. This accuracy, coupled with the speed and automatic nature of this 
approach represents a first significant step toward assimilation of deformable 
registration in the IGI workflow. 
1.2.3.   Framework for Optimization of Finite Precision Implementations  
An emerging trend in image processing, and medical image processing, in 
particular, is custom hardware implementation of computationally intensive 
algorithms for achieving high-speed performance. The work presented in this 
 7 
 
dissertation has a similar spirit in the context of advanced image processing required 
during IGIs. For reasons of area-efficiency and performance, these implementations 
often employ finite-precision datapaths. Identifying effective wordlengths for these 
datapaths while accounting for tradeoffs between design complexity and accuracy is a 
critical and time consuming aspect of this design process. Having access to optimized 
tradeoff curves can equip designers to adapt their designs to different performance 
requirements and target specific devices while reducing design time. 
This dissertation proposes a multiobjective optimization framework developed 
in the context of FPGA?based implementation of medical image registration. Within 
this framework, we compare several search methods and demonstrate the 
applicability of an evolutionary algorithm?based search for efficiently identifying 
superior multiobjective tradeoff curves.  In comparison with some earlier reported 
techniques, this framework allows non-linear objective functions, multiple fractional 
precisions, supports a variety of search methods, and thereby captures more 
comprehensively the complexity of the underlying multiobjective optimization 
problem. We also demonstrate the applicability of this framework for the image 
registration application through synthesis and validation results using Altera Stratix II 
FPGAs. This strategy can easily be adapted to a wide range of signal processing 
applications, including areas of image and video processing beyond the medical 
domain. 
1.3.   Outline of this Dissertation 
The rest of the dissertation is organized as follows: Chapter 2 provides 
background on image-guided interventions, image preprocessing, and image 
 8 
 
registration; and presents related work in the context of the contributions of this 
dissertation. In Chapter 3, FPGA-based architecture for real-time implementation of 
3D image processing techniques such as median filtering and anisotropic diffusion 
filtering are presented. Chapter 4 deals with deformable image registration. We 
outline the intensity-based deformable image registration algorithm and present a 
novel architecture for accelerated implementation of this algorithm. In Chapter 5, a 
framework for multiobjective optimization of limited precision implementations of 
signal processing algorithms is presented. Chapter 6 introduces some novel image-
guided procedures and demonstrates the feasibility of our approach in the context of 
these applications. Finally, in Chapter 7 conclusions and future work are presented. 
 
 9 
 
Chapter 2:   Background and Related Work 
2.1.   Image-Guided Interventions 
IGIs began to emerge in the last quarter of the 20th century, picked up pace in 
the 1990s, and may become routine in the 21st century. Minimal invasiveness is the 
defining characteristics of these procedures. This feature can lead to less patient 
morbidity, time and cost efficient procedures, faster recovery and improve the 
procedure outcomes. During these procedures, the internal anatomy is accessed 
through a single or few small holes on the patient?s skin rather than though large 
incisions. The interventionist introduces the appropriate tool (electrode or biopsy 
needle, or/and endoscope) through this port and tries to navigate his/her way to the 
target (typically a malignant spot) in order to deliver a localized treatment or take out 
a sample for further investigation. Now, because the access to the internal anatomy is 
through a single port, the only way to visualize the location, orientation and the path 
of approach of the tool is by using external imaging techniques (that is there is no 
direct visual feedback).  
Any intraprocedural imaging technique used must be near real-time and thus 
allow tracking underlying anatomy and flexible instruments and catheters as and 
when required (?on-demand? performance) during the procedure. 2D Ultrasound 
(US) and CT fluoroscopy have been conventionally used to guide placement of 
biopsy needles and therapy delivery devices during IGIs [18, 20, 21]. However, 
technological improvements such as multi-slice CT scanners, interventional MR, 3D 
ultrasound (US), isocentric C-arms and other advanced imaging systems have enabled 
 10 
 
the application of IGI to clinical domains such as interventional radiology, 
neurosurgery, orthopedics, ENT surgery, cranio- and maxillofacial surgery and other 
surgical specialties [22-24]. For example, Philips medical systems, one of the leading 
medical imaging equipment manufacturers, has announced a 256-slice CT scanner 
[25] which provides higher imaging speed (up to 8 volumes/s) and coverage (8 cm) is 
ideally suited for performing CT-guided procedures. Availability of easy access MR 
scanners, such as open-configuration MR scanner from GE Healthcare [26] along 
with its improved imaging speed has enabled development of MR-guided procedures. 
Image quality and acquisition speed of 3D ultrasound have also been enhanced 
through use of latest transducer technology and digital reconstruction and it can now 
be used for providing image-guidance during procedures. Moreover, real-time 
volumetric visualization capabilities, that enable interactive display of images during 
the procedure, are also now available [27, 28]. As a result, an emerging trend in IGI 
workflow is to use volumetric imaging modalities for providing real-time 
intraprocedural guidance. This dissertation, therefore, focuses on 3D image 
processing and registration in the context of IGIs. 
2.1.1.   Role of Preprocedural Imaging 
Intraprocedural imaging techniques provide (or, are a rich source of) accurate 
spatial information which is crucial for navigation but offer poor target identification 
from the background healthy and/or benign tissue (see Figure 2.1). Most image-
guided procedures are preceded by a preprocedural image which is used for 
diagnosis, treatment/navigation planning, etc. These preprocedural images are 
primarily acquired under different (often slow) imaging protocol and typically contain 
 11 
 
additional information, such as contrast-enhanced structures or functional information 
such as metabolic activity which is used for diagnosis and tissue differentiation prior 
to the treatment. Figure 2.1(a) shows contrast-enhanced structures in a preprocedural 
image which are not clearly visible in intraprocedural images. Figure 2.1(b) illustrates 
the metabolic activity shown in the PET scans which can be used to identify 
cancerous tumors. Availability of this functional and contrast information from the 
preprocedural images can be used to augment the purely morphological and spatial 
information from the intraprocedural images which will greatly improve the 
intraprocedural target delineation [2-6, 29]. Therefore, there is a clear need to 
combine this complementary information from the pre and intraprocedural images to 
facilitate this task. 
 
Figure 2.1: Two examples of pre- and intraprocedural image pairs. The arrows 
indicate the targets that are visible in preprocedural images but not visible in 
intraprocedural images. 
 12 
 
2.1.2.   Need for Image Registration 
Aligning or registering the intraprocedural images with the preprocedural 
image is a fundamental need in the IGI workflow. In fact, image registration has been 
identified as an enabling technology for image-guided surgical and therapeutic 
applications [30]. Figure 2.2 shows an example of volumetric image-guidance using 
image registration between volumetric CT and magnetic resonance imaging (MRI) 
scans for a neurosurgical application. There are, however, many technological and 
logistic challenges in achieving this image registration. First, the intra- and 
preprocedural images to be registered are acquired at different times and using 
different scanners. As a result, there is invariably misalignment of anatomical 
structures between these two images. This misalignment is caused because of the 
 
Figure 2.2: An example of volumetric image guidance using intraprocedural 
multislice CT and preprocedural MR. 
 13 
 
systemic offsets in scanner coordinate systems and due to non-rigid anatomical 
changes arising from pose and diurnal variations at the time of image acquisition. 
Second, the images to be combined can be of two completely different modalities 
(such as PET and CT). Furthermore, given the on-demand nature of IGI applications 
this registration should be achieved in a reasonably fast time. In summary, accurate, 
multi-modal, and fast image registration is essential for IGIs [17, 31]. The following 
section provides an overview of image registration.  
2.2.   Classification of Image Registration  
Medical image registration is the process of aligning two images that 
represent the same anatomy at different times, from different viewing angles, or using 
different imaging modalities. Image registration is an active area of research and over 
the last several decades there have been numerous publications outlining various 
methodologies to perform image registration and its applications. Maintz and 
Viergever [32] and Hill et al. [33] have presented a comprehensive summaries of the 
entire gamut of the image registration domain. In general, image registration can be 
classified based on image dimensionality, nature of registration basis, nature of 
transformation models, type of modalities involved, etc. From the context of IGI, 
Table 2.1: Broad classification of image registration in the context of IGI. 
Registration 
Basis 
Method 
based on 
Retrospective Automatic Deformable 
Compute 
Intensive 
Fiducial N Y N N Extrinsic  
Information Stereotactic N Y N N 
Landmark Y N Y N 
Segmentation 
Or Surface 
Y N Y N 
Intrinsic 
Information 
Intensity Y Y Y Y 
 
 14 
 
however, we broadly classify image registration into two main approaches. First, 
techniques based on extrinsic information and second, techniques based on 
information that is intrinsic to the image. We briefly describe these two techniques 
and outline some popular image registration methods in each category. A summary of 
this classification is also presented in Table 2.1. 
2.2.1.   Image Registration using Extrinsic Information 
Methods based on extrinsic information rely on information that is not 
natively a part of the medical image. This includes artificial external objects that may 
be attached to the patient and are within the field of view of the image. These objects 
are designed such that they are clearly visible and accurately detectable in all of the 
pertinent modalities that are to be registered. As a result, the registration of the 
acquired images is usually easy, fast, and can be automated with relative ease. In 
addition, because the registration involves simply establishing correspondence 
between external objects, it can be achieved explicitly without a need for complex 
optimization techniques. One major limitation of these methods, however, is that they 
are not retrospective. This means that advanced planning is required and provisions 
must be made at the time of preprocedural imaging for that image to be used at a later 
point. Furthermore, due to the nature of the registration these methods are mostly 
limited to rigid transformation model only.  
Stereotactic frame is another commonly used external object. There are many 
reported image registration applications, especially in the context of neurosurgery, 
that employ a stereotactic frame to establish spatial correspondence between images 
[34, 35]. These methods employ a frame screwed rigidly to the patient?s skull that is 
 15 
 
usually fitted with imaging markers that are visible in imaging modalities such as CT, 
MRI, and X-ray. Visibility of these markers in both pre- and intraprocedural images 
will then allow registration of these images using a least-square based alignment 
technique. These techniques have been shown to be relatively accurate for rigid 
anatomy such as the brain [36], but are relatively more invasive. Less invasive 
techniques using markers attached to the skin have also been reported [37], but they 
tend to offer less accurate image registration because skin can move. More recently, 
there have also been efforts toward developing systems based on optical tracking 
methods that will allow frameless stereotaxy [38]. Despite these advances, these 
methods are fundamentally limited to providing only rigid alignment between a pair 
of images.  
2.2.2.   Image Registration using Intrinsic Information 
These methods are based on intrinsic properties and contents of patient-
generated images. Registration may be based on a limited set of identified salient 
points (landmarks), on the alignment of segmented anatomical structures 
(segmentation or feature based) such as organ surfaces or directly based on the image 
intensity values (voxel property based). 
Landmark-based registration [35, 39, 40] involves identification of the 
locations of corresponding points within different images and determination of the 
spatial transformation with these paired points. These landmarks are usually 
identified by a user in an interactive fashion. Landmark-based methods are often used 
to find rigid or affine transformations. However, if the sets of points are large enough, 
they may be used for more complex non-rigid transformations as well. Registration 
 16 
 
methods based on landmark identification can be retrospective, but they are not fully 
automatic becasue they require user interaction.  
Segmentation-based image registration methods are based on extracting 
matching features and organ surfaces from the two images to be registered. These 
features and organ surfaces are then used as the only input for the alignment 
procedures. The alignment between the features/surfaces can be either based on rigid 
transformation models or achieved using deformable mapping. The rigid model?
based approaches are more popular and a ?head-hat? registration method based on this 
approach has been successfully applied to the registration of multimodal images such 
as PET, CT, and MR [41-43]. Popular segmentation-based techniques that involve 
deformable mapping of surfaces, such as the ones based on snakes or active contour 
models, have been shown to be effective in intersubject and atlas registration, as well 
as for registration of a template to a mathematically defined anatomical model [44, 
45].  Segmentation-based techniques are retrospective, support multi-modal 
registration, and are computationally efficient. However, the accuracy of registration 
is dependant on the segmentation accuracy. Moreover, these methods are not fully 
automatic as the segmentation step is often performed semi-automatically. 
Voxel property-based methods, which are based on image intensity values, are 
the most interesting methods in the current research. Theoretically, these are the most 
flexible of the registration methods since they use all of the available information 
throughout the registration process. In addition, these methods can be completely 
retrospective, fully automatic, allow multi-modal registration and generally are more 
accurate. The following section provides a detailed overview of intensity-based image 
 17 
 
registration. Although these methods have existed for a long time, their extensive use 
in clinical applications with 3D images has been limited because of associated 
computational costs. This dissertation work addresses this aspect through the use of 
hardware acceleration. 
2.3.   Intensity-Based Image Registration 
Image registration that is based on voxel intensities is the most versatile, 
powerful, and inherently automatic way of achieving the alignment between two 
images. This method attempts to find the transformation 
?
T  that optimally aligns a 
reference image RI, with coordinates x, y, and z, and a floating image FI under a 
image similarity measure F. This process is summarized in the following equation 
and is represented pictorially in Figure 2.3.  
 
?
arg max ( ( , , ), ( ( , , )))
T
T RIxyz FITxyz= F  (2.1) 
 
Figure 2.3: Flowchart of image similarity?based image registration. 
 18 
 
In the case of intensity based registration, the similarity measure F, which 
provides a numerical value to indicate the degree of misalignment between the 
images, is completely based on voxel intensities in the reference and the floating 
images. Image transformation T maps the reference image voxels into the floating 
image space. Depending on the transformation model employed, this mapping is 
either rigid, affine, or deformable. The optimization algorithm, on the other hand, 
searches for the best transformation parameters that optimally align the given two 
images. These three components form an integral part of intensity-based image 
registration and are described in the following sections. 
2.3.1.   Transformation Models 
A transformation model provides a way to describe the misalignment between 
the reference and the floating images. The ability of image registration to accurately 
represent and recover this misalignment is fundamentally limited by the nature of the 
transformation model employed. For example, rigid transformation model typically 
offers inferior image registration accuracy as compared with computationally 
intensive, non-rigid transformation models, if the underlying misalignment is non-
rigid. A comprehensive survey on image transformation models can be found in [46, 
47]. The following subsections describe the transformations most commonly used in 
intensity-based image registration. 
2.3.1.1.   Rigid and Affine Models 
Affine or linear registration is a combination of rotation, translation, scaling 
and shear parameters that map the reference image voxels into floating image space. 
 19 
 
Voxel scaling and shearing factors are constant for rigid registration, which is a 
special case of affine transformation, and as such are excluded from optimization 
process. Both these transformation models can be represented using a 4 ? 4 
transformation matrix. For example, a rigid transformation matrix T
global
 can be
 
constructed as:
 
 ,
0001
xx xy xz x
yx yy yz y
global
zx zy zz z
rrrd
rrrd
T
rrrd
? ?
? ?
=
? ?
? ?
? ?
 
(2.2)
 
where r
ij 
entries represents the components of the rotation matrix, while the d
i 
entries 
represent the translation parameters. The coordinate transformation of a reference 
image voxel  
r
v  into floating image space (
f
v ) can then simply be achieved through 
matrix multiplication: 
 .
f global r
vT v= ?  (2.3) 
Techniques based on rigid and affine transformation models have been successfully 
employed previously [47-49]. These techniques, however, offer limited degrees of 
freedom in the transformation model. 
2.3.1.2.   Deformable Models 
The strength of the deformable transformation models comes from the large 
number of degrees of freedom they offer for representing the misalignment between 
images. This allows modeling of not only gross misalignment between the images, 
but also local deformations. As a result, image registration techniques based on 
deformable transformation models are inherently capable of correcting for local 
misalignments and therefore are more accurate. 
 20 
 
Methods based on physical models perform image transformation by 
considering a set of internal and external forces and obtaining the corresponding 
deformation by applying these forces to a given model based on differential 
equations. Some examples of physical models used for image transformation found in 
the literature are elastic body [50], viscous fluid [51] and incompressible flow (optical 
flow) [44]. Some methods based on finite element models have also been employed 
for image registration, which apply predefined physical models to represent 
deformation in the images [52, 53]. The key idea is to divide the image into subsets, 
each with some defined physical properties. For example, a subset can be labeled as 
rigid, while some others can be labeled as fluid (elastic). During the transformation 
process, the shape of rigid tissues will not change, while the shape of fluid tissues will 
vary according to their corresponding properties such as viscosity. Image 
transformation techniques using physical models have been successfully applied to 
deformable image registration. However, most of these techniques involve solving 
partial differential equations and are particularly computationally complex. 
Another popular method to represent deformable transformation model is to 
use mathematical basis functions. These transformation techniques use basis 
functions to define the correspondence between the original and the transformed 
image. The basis functions may be defined in either Fourier or Wavelet domain, and 
the deformation field is modeled using trigonometric or wavelet basis functions, 
respectively. Ashburner and Friston [54] have reported a method based on this 
approach. The deformation between the two images may also be modeled in the 
spatial domain using polynomials. Polynomial-based image transformation 
 21 
 
techniques use a global transformation function defined by a transformation matrix 
that contains the transformation coefficients and a polynomial vector that contains the 
components of the polynomial used to model the transformation. The simplest case of 
polynomial-based image transformation is the affine transformation, which uses first-
degree polynomials. By increasing the degree of the polynomials, it is possible to 
model complex non-rigid transformation as well. However, this method is seldom 
employed due to difficulty in modeling small local transformations and that higher-
degree polynomials suffer from several artifacts [46]. These drawbacks are addressed 
by spline-based representation. Splines are inherently continuous and consist of 
piecewise-polynomial functions. Splines are a generalization of the polynomial-based 
approach to image transformation in the sense that a polynomial representation is a 
spline with just one segment. Using piecewise-polynomial functions allows modeling 
of local deformations accurately without using high-order polynomials. Two different 
spline families that have been used extensively in the literature to model 3D 
transformations are thin-plate splines and B-splines. Kim et al. [55] and Rohr et al. 
[56] have reported methods based on thin-plate splines to perform deformable image 
registration. However, one major drawback of thin-plate splines is that they have 
infinite support. This means that even small local changes are propagated throughout 
the entire image, an effect that is undesirable in medical image registration. In 
comparison, B-splines offer finite support. For this reason, B-splines are currently the 
preferred basis functions for modeling deformable transformations [57, 58]. A 
limitation with B-spline-based transformations is that they tend to fail at tracking 
rotation of local features. Moreover, algorithms based on B-splines tend to be 
 22 
 
computationally intensive due to additional complexity associated with B-spline 
interpolations. 
 More recently, some algorithms based on hierarchical image subdivision 
approaches have been reported [59, 60]. These algorithms achieve deformable 
registration through registering image subvolumes using a locally linear 
transformation and then applying quaternion-based interpolation to obtain the 
transformation field. Algorithms based on such transformation models allow the 
modeling of internal rotations better than the spline-based approaches. These 
algorithms are computationally efficient and yet are capable of recovering local 
deformations. The deformable registration algorithm considered in this dissertation is 
also based on hierarchical volume (3D image)-subdivision. This algorithm and the 
architecture for its accelerated implementation is describes in Chapter 4. 
2.3.2.   Image Similarity Measures 
An important component of image registration is the metric that quantitatively 
determines how similar two images are. This metric can then be used to judge how 
well a pair of images is aligned and also to guide the optimization procedure during 
image registration. In the case of intensity-based image registration this metric, or 
image similarity measure, is computed using the voxel intensities of the images 
involved in registration. There are many reported intensity-based similarity measures. 
These can be broadly classified into measures using only image intensities (for 
example, mean of square difference of intensities), measures using spatial (or 
neighborhood) information (for example, pattern intensity or gradient-based 
measures) and measures based on information theory (mutual information). 
 23 
 
The following sections briefly describe some widely used similarity measures. 
We use the following notation for this description. The images to be aligned are 
reference image (RI) and the floating image (FI). A transformation T is applied to the 
voxels of the reference image. An image similarity measure is calculated over the 
region of overlap (
0
X ) between the RI and the FI and x represents the location of a 
voxel in RI. N represents the number of RI voxel that belong to 
0
X . The 
notations
RI
p ,
FI
p , and 
,RIFI
p represent the individual probability distribution function 
(PDF) of RI, individual PDF of FI, and the mutual PDF of RI and FI, respectively. 
2.3.2.1.   Sum of Squared Intensity Differences (SSD) 
One of the simplest ways to achieve image alignments is to minimize the 
intensity difference between the RI and FI. The sum of squared intensity differences 
(SSD) measure tries to achieve that. The SSD between the two images is defined as: 
 
0
2
1
(,) (() (()).
xX
SSD RI FI RI x FI T x
N
?
=??  (2.4) 
As expected this measure will be minimized when two images are aligned well. 
However, this measure is limited to work with images with same intensity patterns, or 
in other words, for mono-modality image registration. Furthermore, Holden et al. [48] 
have shown this similarity measure to be error-prone in the presence of noise. 
2.3.2.2.   Normalized Cross-Correlation (NCC) 
If the assumption that registered images differ only by Gaussian noise is 
replaced with a less restrictive one, namely that there is a linear relationship between 
the two images, then the optimum similarity measure is the normalized cross-
 24 
 
correlation. Cross-correlation in both space and frequency domains has been used as a 
voxel similarity metric. Cross-correlation in the space domain is defined by: 
 
0
2
(())((()))
1
(,)
xX
RI FI
RI x RI FI T x FI
NCC RI FI
N ??
?
???
=
?
 (2.5) 
where RI and FI are the mean intensities of the RI and FI respectively, whereas 
RI
? and 
FI
? represent the standard deviations of RI and FI, respectively: 
 
0
2
1
(() ),
RI
xX
RIx RI
N
?
?
=??  (2.6) 
 
0
2
1
(() ).
FI
xX
FI x FI
N
?
?
=??  (2.7) 
Computation of this similarity measure can be time consuming as it requires 
calculating the mean and the standard deviation as well as the cross-correlation 
coefficient for the entire 3D images. Because of this high computational cost of 
performing cross-correlation, spatial domain correlation is usually performed between 
a whole image and a small portion of the other image. Cross-correlation is an 
effective voxel similarity measure for images with low noise, but it high calculation 
requirements make it a poor choice for real-time applications. Furthermore, it may not 
yield optimal performance when applied to noisy images [46], such as ultrasound and 
low-dose CT.  
2.3.2.3.   Mutual Information (MI) 
Mutual information is a popular image similarity metric based on information theory. 
The rationale behind this similarity measure is to consider image registration as the 
process of maximizing the amount of information common to RI and FI, or 
 25 
 
minimizing the amount of information present in the combined images. When the 
images are perfectly aligned, the corresponding structures from both images will 
overlap, minimizing the combined-information. The use of mutual information for 
image registration was introduced by Collignon et al. [61] and Viola and Wells [62]. 
The MI is defined by:  
 (,) () () (,),MI RI FI h RI h FI h RI FI= +?  (2.8) 
where the individual and mutual entropies are calculated as: 
 () ()ln( (),
RI RI
hRI p x p x=???  (2.9) 
 () ()ln( (),
FI FI
hFI p x p x=???  (2.10) 
 
,,
(,) ()ln( ()
RI FI RI FI
hRIFI p x p x=?????  (2.11) 
A comprehensive survey of MI-based registration was presented by Pluim et al. [47]. 
Mutual information is a very effective similarity measure for multimodal image 
registration because it can handle nonlinear information relations between data sets 
[63]. Holden et al. [48] have demonstrated that mutual information-based techniques 
are, in general, superior to other techniques for deformable image registration. 
A broadly used variant of mutual information is called normalized mutual 
information. The advantage of this similarity measure over mutual information is its 
overlap-independence. It was introduced by Studholme et al. [64] as:  
 
() ()
(,) .
(,)
hRI hFI
NMI RI FI
hRIFI
+
=  (2.12) 
There are several other intensity-based similarity measures beyond the ones 
listed and described here. These include ratio of image uniformity, pattern intensity, 
entropy of the difference image, etc. Mutual information, in comparison, is versatile, 
 26 
 
inherently multimodal, and accurate; and hence has emerged as a popular choice for 
both rigid and deformable image registration. In particular, the deformable 
registration algorithm, being accelerated through custom hardware implementation in 
this dissertation work, is also based on MI. See Chapter 4, for additional details. 
2.3.3.   Optimization Algorithms 
Optimization algorithms are used to navigate the search space of 
transformation parameters and to identify the optimal combination of transformation 
parameters that best aligns a pair of images. It must be noted, that most often the 
number of parameters to be searched for is more than one and this requires multi-
dimensional optimization algorithms. Another desired feature of an optimization 
algorithm is that it requires fewer number of objective function evaluations. In the 
case of intensity-based image registration, the objective function to be optimized is 
the voxel similarity function. This calculation, usually, is compute intensive and 
hence faster convergence is ideal. We briefly summarize common multidimensional 
optimization scheme employed in the context of image registration. 
The downhill simplex method, first introduced by Nelder and Mead [65]  is an 
unconstrained nonlinear optimization technique. A simplex is a geometrical figure 
defined by N+1 points in an N-dimensional space. The simplex method starts by 
placing a regular simplex in the solution space and then moves its vertices gradually 
towards the optimum point through an iterative process. The downhill simplex 
algorithm searches for the optimum value through a series of geometrical operations 
on the simplex. Examples of these operations include reflection, reflection and 
expansion, contraction, multiple contractions etc. Shekhar et al. [49, 66] and Walimbe 
 27 
 
et al. [60, 67] have reported successful use of this optimization technique for voxel 
similarity?based image registration. 
Univariate optimization method tries to solve the multidimensional 
optimization problem by breaking it into multiple one-dimensional optimization 
problems. This is achieved by optimizing the variables, one variable at a time and 
then repeating this step until convergence. This method is simple, but can suffer from 
poor convergence in the presence of steep valleys in the search space. Powell?s 
method builds upon the univariate method with an important distinction, that the 
search direction does not have to be parallel with any of the variable axes. Thus, it is 
possible to change multiple variables at the same time. This can achieve faster 
convergence and effectively eliminate the convergence problem of the univariate 
method. This algorithm has widely been used for optimizing intensity-based image 
registration [47, 68, 69]. 
Optimization based on genetic algorithms is a technique that mimics the 
genetic processes of biological organisms. Over many generations, natural 
populations evolve according to the Darwinian principles of natural selection and the 
?survival of the fittest?. Common operations involved in this method are crossover, 
mutation and fitness evaluation. By following this process, genetic algorithms are 
able to adapt starting solutions and ultimately find the optimal solution. These 
techniques are capable of efficiently searching a complex optimization space. 
However, representation of solutions in a genetic algorithm framework can be 
challenging and limit their effectiveness especially in the context of deformable 
 28 
 
registration (due to large number of parameters). Examples of genetic algorithm?
based optimization for image registration can be found in [70, 71]. 
Optimization using simulated annealing techniques involves minimization 
methods based on the way crystals are generated when a liquid is frozen by slowly 
reducing its temperature [65]. These algorithms work distinctly from the techniques 
described earlier, in that they do not strictly follow the gradients of the similarity 
measure. Instead, they move randomly, depending on the ?temperature? parameter. 
While the ?temperature? is high, the algorithm allows greater variations in the 
variables to be optimized. As the ?temperature? decreases, the algorithm further 
constrains the variation of the variables until a global optimum is reached. In general, 
simulated annealing techniques are more robust than earlier described methods. 
However, these techniques may require a large number of iterations to converge, 
especially in the presence of local minima. Some applications of simulated annealing 
techniques for image registration are described in [72, 73]. 
2.4.   Image Preprocessing 
As described in the previous section, intensity-based image registration (both 
rigid and deformable) utilizes similarity measures that are based on the voxel 
intensities of the images to be registered. As a consequence, these similarity measures 
are sensitive to quality of the images involved. Images with poor SNR can affect the 
calculation of a similarity measure and result into less-accurate image registration. 
While some similarity measures such as MI and NMI are less sensitive to noise, 
Holden et al. [48] have demonstrated that most intensity-based similarity measures 
 29 
 
(and resulting accuracy of image registration) are adversely affected in the presence 
of noise. 
To address this aspect, several techniques for preprocessing images prior to 
image registration have been described. While some techniques focus on identifying a 
region or structures of interest in the images and exclude structures that may 
negatively influence the registration results [48, 49, 74], most preprocessing 
techniques rely on spatial-domain filtering operations on the images. Some reported 
techniques have employed low-pass filtering to remove speckle noise in ultrasound 
images, thresholding or filtering to remove noise, and blurring to correct for 
differences in the intrinsic resolution of the images [49, 64, 75]. All these spatial 
filtering techniques have shown to be effective in improving the image quality and 
the accuracy of image registration. 
In the context of IGI, which is the primary application of the work presented 
in this dissertation, low-dose computed tomography (CT) and 3D ultrasound have 
emerged as the preferred intraprocedural volumetric imaging modalities. These 
modalities, although sufficiently fast for intraprocedural use, suffer from quantum 
noise and speckle noise respectively. Furthermore, due to presence of metallic tools 
 
Ultrasound Image Filtered US Image Low-dose CT Image Filtered CT Image 
Anisotropic Diffusion Filtering Median Filtering 
Figure 2.4: Example of preprocessing techniques employed prior to intensity-
based image registration. 
 30 
 
such as needles and catheters and associated photon scattering effects, intraprocedural 
CT images also suffer from metal artifacts. As a result, these images must be 
preprocessed and enhanced prior to registration with preprocedural images and 
subsequent visualization. Toward this end, anisotropic diffusion filtering and median 
filtering have been shown to be effective. Figure 2.4 shows an example application of 
these filtering operations. In particular, anisotropic diffusion filtering has been 
successfully applied for preprocessing of ultrasound, CT, and low-dose CT images 
[76-78]. Similarly, median filtering has been employed, both in spatial and sinogram 
domains, to reduce or eliminate metal artifacts and for filtering low-dose CT images 
[79, 80]. In this dissertation we, therefore, focus on these two filtering techniques. 
2.4.1.   Anisotropic Diffusion Filtering 
Anisotropic diffusion filtering is an iterative process which progressively 
smoothes an image while maintaining the significant edges. The nonlinear anisotropic 
diffusion algorithm for edge-preserving image smoothing was first proposed by 
Perona and Malik [81]. For a 3D image I with intensities (,)I vt, where v  is a vector 
in the 3D space and t  is a given point in time (for the purposes of modeling the 
diffusion process), the diffusion process is described by the following equation:  
 () (),,,
I
divcvt Ivt
t
?
?
=?i  (2.13) 
where c  is the diffusion coefficient and takes a value between zero and 1. In general, 
the diffusion coefficient is defined as a function of the image gradient (i.e., 
()cf I=?). For noisy images, Whitaker and Pizer [82] showed that gradient 
estimates taken from the image itself tend to be unreliable and proposed, instead, the 
 31 
 
use of a Gaussian-filtered version of the image to calculate the gradient values. Their 
proposed Gaussian filter has a standard deviation 
( )
t?  that decreases as the time (t ) 
increases, thus resulting in a multiscale approach. Dorati et al. [83] demonstrated the 
usefulness of Whitaker and Pizer?s approach to 3D ultrasound image preprocessing. 
The diffusion coefficient that uses the Gaussian-filtered image (indicated as (()Gt? ) 
is then defined as: 
 
()( ) ()
( )
,.cf G t Ivt?=? i  (2.14) 
 Several diffusion functions have been proposed in the literature. The two most 
widely used are: 
 
()()()
2
1
,
exp ,
GtIvt
c
K
?
??
?
??
=?
??
i
 (2.15) 
and  
 
()()()
1
1
2
,
1.
GtIvt
c
K
?
?
?
+
??
?
??
=+
??
i
 (2.16) 
These diffusion functions depend on the gradient of the Gaussian-filtered image, 
while the parameter K  adjusts the levels at which edges are diffused or preserved, to 
achieve the desired filtering effect. 
2.4.2.   Median Filtering 
Median filtering is a nonlinear technique commonly used to eliminate speckle 
noise from ultrasound and impulse noise from other noisy images. This technique is 
called non-linear because it can not be represented as a direct convolution operation. 
 32 
 
This technique is based on rank-ordering of the intensity values present in an image. 
In addition, this filter is an edge-preserving filter. The advantage of this technique 
when compared to most linear (convolution-based) smoothing operators is that it 
smooths areas within a particular object, while preserving its edges. This feature is 
important, especially in the context of medical image registration, since it generally 
improves the accuracy of the image registration, segmentation, and visualization 
operations by preserving anatomical boundaries, while reducing random noise in the 
interiors of the structures. However, such filters, due to their nonlinear nature, tend to 
be computationally more intensive as compared with linear filtering operations. 
The 3D realizations of these preprocessing operations, despite their 
effectiveness, can take several seconds when implemented in software. Consequently, 
for seamless integration into the IGI workflow these techniques must be accelerated 
so that their performance is comparable to the acquisition speed of intraprocedural 
images. This dissertation addresses this need through real-time implementation of 3D 
realizations of these operations as described in Chapter 3.  
2.5.   Optimization of Finite Precision Implementations 
An emerging trend in image processing, and medical image processing, in 
particular, is custom hardware implementation of computationally intensive 
algorithms in the quest to achieve real-time performance. The work presented in this 
dissertation has a similar spirit in the context of advanced 3D image processing 
required during IGIs. For reasons of area (and power)-efficiency and performance, 
these implementations often employ limited-precision datapaths. In comparison, the 
original algorithms are often developed in software using the double-precision 
 33 
 
representation. Identifying effective wordlengths for these datapaths while accounting 
for tradeoffs between design complexity and accuracy is a critical and time 
consuming aspect of this hardware design process. This problem of converting 
floating-point implementations into fixed-point (or other limited precision 
representations) through identification of optimum wordlengths is an important 
problem in signal processing applications and has received considerable attention in 
the literature. Cantin et al. [84] and Todman et al. [85] provide a comprehensive 
review of techniques to identify optimal wordlengths. We briefly summarize some 
important approaches here. 
2.5.1.   Optimal Wordlength Formulation 
Consider a system with m internal variables and a wordlength
i
w  associated 
with each variable. Further, each variable can take values between the lower (
min
i
w ) 
and upper (
max
i
w ) bound on the wordlength such that 
min max
(, )
ii
i
www? . Each 
wordlength is an integer variable, and the wordlength configuration for the entire 
system can then be represented using a wordlength vector ()
m
WW I?  such as 
12
{, , , }
m
ww wnull . Furthermore, 
min max
(, )WWW? ; where 
12
min min min min
{, ,, }
m
Www w= null  
and 
12
max max max max
{, ,, }
m
Www w= null . Consider a function H associated with the system 
that defines the hardware implementation cost associated with a wordlength 
configuration W . Also, consider that the performance of this limited-precision, 
quantized system is characterized by function ()p W  and that the system must 
achieve a certain performance 
min
P . The wordlength optimization problem can then be 
presented as: 
 34 
 
 
min max
min
(,)
arg min ( ), such that ( ) .
WW W
HW pW P
?
?  (2.17) 
It must be noted that this formulation, for simplicity, considers only one objective 
function H with respect to a predefined performance criterion 
min
P . The more general 
multi-objective formulation is briefly described below and revisited in detail later in 
the context of FPGA-based implementation of deformable image registration (see 
Chapter 5). 
2.5.2.   Simulation-Based Optimal Wordlength Search 
Optimal wordlength configuration that meets a certain performance criterion 
can be identified by solving analytical expressions, when the performance function 
p can be represented analytically. Some earlier reported approaches have adopted 
this technique [86-90]. However, if the performance function can not be represented 
analytically, which is often the case for practical complex systems, simulation-based 
methods can be used to search for the optimal configuration. This involves searching 
the design space (defined by the wordlength vector ranges) and finding a solution that 
satisfies the design criteria. Some popular methods in this category are briefly 
described below. A detailed description of these methods can be found in [84, 91]. 
An exhaustive search attempts every possible combination of wordlengths 
between the predefined lower and upper bounds and evaluates the performance of 
each combination through simulation. The optimum wordlengths can then be selected 
from the simulation results. An exhaustive search is guaranteed to find the global 
optimal configuration, however, the number of solutions explored and the associated 
execution time increase exponentially as the number of variables increases.  
 35 
 
Another method, proposed by Sung and Kum [92] searches for the first 
solution that satisfies a given performance requirement or an error criterion. These 
method starts with an initial guess for the system configuration based on uni-variable 
simulations. The wordlength of each variable, as provided by this initial guess, is then 
sequentially incremented by one until a configuration that meets the error criterion is 
found. Although, this method is more efficient than the exhaustive search, finding the 
globally optimal configuration is not guaranteed. 
A sequential search method [84, 93] that takes into account the performance 
sensitivity to determine the direction of the search is another way to approach this 
problem. This method starts with an initial guess based on uni-variable simulations; 
however, the further search direction is determined by the sensitivity of the 
performance to each variable. This sensitivity is estimated by calculating the gradient 
of the system performance with respect to all the variables and the search progresses 
in the direction of the variable (that is wordlength of that variable is incremented) that 
offers most improvement. It is also possible to consider hardware cost sensitivity 
instead of the performance sensitivity in this search method. 
2.5.3.   Multiobjective Optimization 
One of the limitations of the optimization formulation described above is that 
search methods based on this formulation are limited to finding a single solution that 
satisfies a design objective. Most real-world problems (including the wordlength 
optimization problem), however, can have several objectives (that generally conflict 
with each other) that need to be achieved at the same time. For example, in the case 
of finite-precision implementations, hardware resource requirements and the 
 36 
 
implementation accuracy are two such conflicting objectives. Because of the 
conflicting nature of the involved objectives, multiobjective optimization problems do 
not normally have a single optimal solution and even necessitate a new definition of 
optimality. 
The most commonly adopted notion in multiobjective optimization problems 
is that of Pareto optimality. A vector of decision variables 
*
x ?Fis Pareto optimal if 
there does not exist another solution x ?Fsuch that f
i
(x) ? f
i
(x
?
), for all i, and 
f
j
(x) < f
j
(x
?
), for at least one j, where f
i  
represents an objective function defined for 
every  x?F . This definition of optimality almost always provides a set of solutions 
called the Pareto-optimal set. The set of vectors x
?
 corresponding to the solutions in 
the Pareto-optimal set are called non-dominated solutions. This concept is pictorially 
illustrated in Figure 2.5. 
Formulating the wordlength optimization as a multiobjective problem has 
merit because it allows finding a set of Pareto-optimal configurations representing 
strategically-chosen tradeoffs among the various objectives. This allows a designer to 
choose an efficient configuration that satisfies given design constraints and provides 
ease and flexibility in modifying the design configuration as the constraints change. 
 
Figure 2.5: Pareto front in the context of multiobjective optimization. 
 37 
 
For example, Leban and Tasic [94] used error, delay, and area as objectives. Han and 
Evans [91] performed optimization of area and error through linear aggregation, 
while Givargis et al. [95] considered power and execution performance trade-off for 
system-on-chip architecture through series of monobjective optimizations. There are 
also some heuristic techniques that take into account tradeoffs between hardware cost 
and implementation error and enable automatic conversion from floating-point to 
fixed-point representations [96]. In this dissertation work, we develop a framework 
for multiobjective optimization of finite precision implementations. This framework 
has been developed for optimization of the FPGA-based image registration and has 
been validated through post-synthesis evaluation. 
2.6.   Related Work 
2.6.1.   Real-Time Image Preprocessing 
Image preprocessing plays a crucial role in image understanding based 
systems, video processing, and in the medical imaging domain. Over last two decades 
much work was done on implementing image processing components in hardware. A 
detailed description of various single instruction multiple data (SIMD) and multiple 
instruction multiple data (MIMD) architectures can be found in [97, 98]. With 
availability of variety of computing platforms such as digital signal processors 
(DSPs), graphics processing units (GPUs), and FPGAs some of the image processing 
algorithms have also been mapped to these platforms for achieving superior 
performance. Most reported techniques, however, focus on 1D or 2D realizations and 
do not adequately address the need for accelerating these operations in 3D. Moreover, 
 38 
 
because of the larger 3D neighborhoods (N
3 
as opposed to N or N
2
), data input 
requirements are increased and the performance achieved for 1D or 2D realizations 
may not translate to their corresponding 3D implementations. This additional 
complexity, coupled with the sheer size of 3D images (typically 2?5 million voxels), 
makes achieving real-time performance extremely challenging. Consequently, only a 
few designs for high-speed implementation of 3D image preprocessing techniques 
have been reported in the literature. We focus on implementations of anisotropic 
diffusion and median filtering and summarize those efforts here.  
Rumph et al. [10] implemented the 2D nonlinear diffusion process on a 
graphics hardware. The primary focus of this work was to achieve acceleration 
through parallelism and better memory bandwidth. Gijbels et al. [8], on the other 
hand, have reported a VLSI architecture based on linear array technique for 
implementation of iterative diffusion process. A similar VLSI-based approach was 
also reported recently for 1D nonlinear signal processing [11]. Accelerated 
implementations of 3D anisotropic diffusion filtering using computing clusters have 
also been reported. Bruhn et al. [99, 100] have reported an approach using a 256-node 
Myrinet  cluster, whereas Tabik et al. [101] have explored multiple parallel 
programming paradigms built on message passing and shared-memory architectures. 
Both these techniques have yielded near-linear speedups. 
Accelerated implementations of median filters based on searching, sorting, 
and bit-level methods have previously been reported in the literature. We particularly 
focus on bit-level methods because they are well suited to finding the median of large 
3D neighborhoods in hardware. Bit-level methods for median filtering can be 
 39 
 
classified into the bit-serial sorting, bit-serial searching, threshold decomposition, and 
majority voting?based methods. Bit-serial sorting is performed using sorting 
networks such as the odd?even exchange network and reduced bubble sort network 
[102, 103]. Bit-serial searching [104], also called the radix method, involves a bit-by-
bit search to find the median. The threshold decomposition method [105] provides a 
modular and parallel design, but the hardware requirements grow exponentially with 
the number of bits used to represent images. Majority voting methods are based on 
determining bit-wise majority starting from the most significant bits (MSBs). Lee and 
Jen [9, 106] have described a novel binary majority gate that can determine the 
majority of binary input signals using an inverter circuit. A compact majority voting 
circuit using an adder array to count the number of 1s and a threshold comparator to 
determine an individual bit of the median is described by Benkrid et al.[7]. Variations 
on this approach have been described in the literature [107-109]. Systolic array 
architectures for bit-level sorting networks have been shown to improve concurrency 
of the bit-serial sorting designs [102, 103, 110-113]. The median filter design 
presented in this work is a combination and 3D extension of bit-serial searching and 
majority voting approaches. 
In this dissertation work, we introduce a novel FPGA-based architecture of 3D 
anisotropic diffusion filtering. In addition, this work develops an architecture for 3D 
median filtering kernel, which is faster than existing solutions and is capable of 
supporting higher 3D kernel sizes. Our solution is compact, easily deployable and is 
capable of processing the intraprocedural images faster than their acquisition speeds. 
 40 
 
2.6.2.   Acceleration of Image Registration 
Intensity-based automatic image registration is a key component of modern 
medical imaging. Fast and accurate image registration can enhance many diagnostic 
and interventional applications. However, this task is also computationally intensive 
due to dimensionality of the images involved and memory-bound nature of the 
operation. It is this aspect that has limited the integration of intensity-based image 
registration (and that of deformable nature, in particular) in clinical applications. To 
address this aspect many researches have independently attempted to accelerate 
intensity-based image registration. Classification of these acceleration attempts has 
been reported by Plishker et al. [114]. We briefly summarize these attempts here. 
2.6.2.1.   Multi-Processor and Supercomputer?Based Approaches 
Image registration problem lends itself well for acceleration through parallel 
implementation. Inherent data-parallel nature (same operations to be performed on 
every voxel of an image) of these algorithms makes them readily amenable to 
parallelization. The majority of earlier reported attempts to accelerate intensity-based 
deformable registration have primarily employed a multiprocessor approach. Ourselin 
et al. [115] reported a parallel implementation of affine registration using a 10-
processor cluster that provided a 6-fold speedup. Stefanescu et al. [116] implemented 
Demons algorithm [44] on a similar cluster of 15 2-GHz Pentium CPUs and achieved 
an 11-fold speedup for non-rigid image registration between a pair of  magnetic 
resonance images. Similarly, Ino et al. [117] have reported a fast implementation of 
MI-based deformable registration using a 128-processor cluster. Another acceleration 
approach has been to use supercomputers, which offer a high degree of parallelism. 
 41 
 
Warfield et al. [118] performed deformable registration on a Sun supercomputer in 15 
sec. However, interactive segmentation of the brain surface in the intraprocedural MR 
images took several minutes. Moreover, this implementation was specific to brain 
MR images because of high surface correspondence. Rohlfing et al. [57] have 
reported a speedup of 40 for a splines-based deformable registration algorithm using a 
64-processor shared-memory supercomputer (SGI Origin 3800). Although these 
solutions delivered high performance by virtue of parallelization, the speedup 
achieved per processor was less than unity. Moreover, these solutions may not be cost 
effective, and because of their size, are unlikely to be suitable for clinical deployment. 
2.6.2.2.   Graphics Processor (GPU)?based Approaches 
The recent emergence of powerful graphics processors (GPUs) has enabled a 
new direction for accelerating computationally intensive applications. Modern GPUs 
offer an array of processing elements that can offer customized data parallel 
processing. Many high level languages, such as Cg, Brook, CUDA, are emerging to 
aid the task of programming GPUs. This has enabled the use of GPUs for many other 
applications such as image registration beyond graphics domain. Strzodka et al. [119] 
reported the first implementation of image registration using the graphics hardware. 
This implementation accelerated a gradient flow?based image registration using 
graphics hardware. However, it was limited to registration of 2D images only and 
offered only limited speedup. Kohn et al. [120] have reported another implementation 
of gradient-flow based image registration that supports 3D images. Although, this 
implementation offered moderate speedup for rigid registration, the performance 
achieved for 3D deformable image registration was poor. Plishker et al. [121] have 
 42 
 
employed GPUs for applying transformations to images during rigid registration. This 
implementation achieved 3-fold improvement in execution time over a CPU-based 
implementation. More recently, Vetter et al. [122] have reported acceleration of MI-
based multimodal registration using graphics hardware. Although, this 
implementation achieved accuracy comparable to that achieved using a software 
implementation, the speedup achieved was only about 5-fold. In summary, these 
reported solutions demonstrate how this promising platform can be utilized for certain 
image registration techniques. However, the architecture of GPUs along with their 
lack of efficient scatter operation is not optimally suitable for operations such as 
accumulation which is a prerequisite (accumulation of the mutual histogram [MH]) 
for calculation of MI. As a result, GPU-based solutions, despite being compact and 
low-cost, may not provide substantial acceleration for calculation of MI, which is the 
most versatile and robust image similarity measure. 
2.6.2.3.   Other Approaches 
Emerging multi-core processors are able to accelerate medical imaging 
applications by exploiting the parallelism available in their algorithms. Ohara et al. 
[123] have implemented an MI-based 3D rigid registration algorithm on the Cell 
Broadband Engine (CBE) processor, which has nine processor cores on a chip and 
has a 4-way SIMD unit for each core. By exploiting the highly parallel architecture 
and its high memory bandwidth, this implementation with two CBE processors can 
compute MI around 11-times faster than a sequential implementation. However, this 
implementation does not support deformable image registration. 
 43 
 
General purpose hardware languages and compilers for transforming high-
level descriptions into hardware are becoming increasingly popular. Streams-C, 
Handel-C, Mitrion-C are examples of such tools. These tools allow direct translation 
of code developed using high-level languages such as C, Java, or Matlab into efficient 
hardware implementations. Although, these techniques have provided considerable 
speedup for applications involving matrix operations, linear algebra, and search, their 
performance in complex applications requiring architectural insights has been limited. 
For example, Jiang et al. [124] have reported a method for acceleration of splines-
based deformable image registration using Handel-C. The converted design, when 
implemented using a Xilinx device, could achieve speedup a of only 3.2 when 
compared with an equivalent software implementation.  
In comparison with the techniques mentioned above, this dissertation work 
presents a novel FPGA-based architecture for high-speed implementation of MI-
based deformable 3D image registration. This architecture is capable of accelerating 
MI calculation by a factor of 40 using a single computing element. Consequently, the 
execution time for deformable image registration is reduced from hours to a few 
minutes. Furthermore, this implementation is accurate, automatic, compact, and 
completely retrospective. 
2.6.3.   Optimization of Finite Precision Implementations 
With the need for real-time performance in signal processing applications an 
increasing trend is to accelerate computationally intensive algorithms using custom 
hardware implementation. The architectures presented in this dissertation, for 
accelerated implementation of image preprocessing and image registration, fall into 
 44 
 
the same category. A critical step in going to a custom hardware implementation is 
converting floating-point implementations to fixed-point realizations for performance 
reasons. This conversion process is an inherently multidimensional problem, as 
several conflicting objectives, such as area and error, have to be simultaneously 
minimized. By systematically deriving efficient tradeoff configurations, one can not 
only reduce the design time [125] but can also enable automated design synthesis [96, 
126]. Furthermore, these tradeoff configurations will allow designers to identify 
optimized, high quality designs for reconfigurable computing applications. The work 
presented in this dissertation develops a framework for optimizing tradeoff relations 
between hardware cost and implementation error in the context of FPGA-based image 
registration. 
Earlier approaches to optimizing wordlengths used analytical approaches for 
range and error estimation [86-90]. Some of these have used the error propagation 
method (e.g., see [89]), whereas others have employed models of worst-case error 
[87, 90]. Although, these approaches are faster and do not require simulation, 
formulating analytical models for complex objective functions, such as MI, is 
difficult. Statistical approaches have also been employed for optimizing wordlengths 
[127, 128]. These methods employ range and error monitoring for identifying 
appropriate wordlengths. These techniques do not require range or error models. 
However, they often need long execution times and are less accurate in determining 
effective wordlengths. 
Some published methods search for optimum wordlengths using error or cost 
sensitivity information. These approaches are based on search algorithms such as 
 45 
 
?Local,? ?Preplanned,? and ?Max-1? search [84, 93]. However, for a given design 
scenario, these methods are limited to finding a single feasible solution, as opposed to 
a multiobjective tradeoff curve. In contrast, the techniques we present in this 
dissertation are capable of deriving efficient tradeoff curves across multiple objective 
functions.  
Other heuristic techniques that take into account tradeoffs between hardware 
cost and implementation error and enable automatic conversion from floating-point to 
fixed-point representations are limited to software implementations only [96]. Also, 
some of the methods based on heuristics do not support different degrees of fractional 
precision for different internal variables [87].  In contrast, the framework presented in 
this dissertation allows multiple fractional precisions, supports a variety of search 
methods, and thereby captures more comprehensively the complexity of the 
underlying multiobjective optimization problem. 
Other approaches to solve this multiobjective optimization problem have 
employed weighted combinations of multiple objectives and have reduced the 
problem to mono-objective optimization [91]. This approach, however, is prone to 
finding suboptimal solutions when the search space is nonconvex [129]. Some 
methods have also attempted to model this problem as a sequence of multiple mono-
objective optimizations [95]. The underlying assumption in this approximation, 
however, is that the design parameters are completely independent, which is rarely 
the case in complex systems. Modeling this problem as an integer linear 
programming formulation has also been shown to be effective [86]. But this approach 
 46 
 
is limited to cases in which the objective functions can be represented or 
approximated as linear functions of design variables.  
Evolutionary algorithms (EAs) have been shown to be effective in solving 
various kinds of multiobjective optimization problems [130, 131] but have not been 
extensively applied to finding optimal wordlength configurations. An exception is the 
work of [94], which employs mono-objective EAs. In contrast, our work 
demonstrates the applicability of EA-based search for finding superior Pareto-
optimized solutions in an efficient manner, even in the presence of a non-linear 
objective function, such as MI. Moreover, our optimization framework supports 
multiple search algorithms and objective function models; and can easily be extended 
to a wide range of other signal processing applications. This optimization framework, 
which is developed and validated in the context of FPGA-based 3D image 
registration, is described in detail in Chapter 5.  
 47 
 
Chapter 3:   Real-time 3D Image Processing 
This chapter presents an FPGA-based architecture for real-time 
implementation of 3D image pre-processing techniques commonly employed in IGI. 
First, we outline the filtering algorithms that are being accelerated in this work. Next, 
we present the architecture for their real-time implementation. Finally, we describe 
the realization of this architecture, analyze the effects of finite precision 
implementation, and compare the performance of this implementation with earlier 
reported efforts. 
3.1.   Motivation 
Real-time and high-quality three-dimensional (3D) intraprocedural 
visualization is a critical need for IGIs. Recent advances in computer and transducer 
technology have made high-speed 3D imaging possible with high resolution and 
acquisition speed. Notably, low-dose computed tomography (CT) and 3D ultrasound 
have emerged as the preferred volumetric imaging modalities during many image-
guided procedures [18, 132-135]. The advent of multislice CT allows high-resolution 
and high-frame-rate volumetric imaging of the operative field. In the continuous 
volumetric mode, multislice CT is capable of acquiring images with 256 ? 256 ? 64 
dimensions and resolutions of 0.625 mm, 8 times per second. Similarly, advances in 
transducer technology have led to improvements in the field of 3D ultrasound 
imaging, which can now acquire images with 128 ? 128 ? 128 dimensions and 
resolution of 1 mm, 20 times per second. These intraprocedural images, acquired 
during the procedure for navigation, represent the most current anatomical 
 48 
 
information but often suffer from poor signal-to-noise ratio. To achieve desired 
accuracy for IGIs, these intraprocedural 3D images, therefore, must be preprocessed 
and enhanced before they can be used for advanced image processing operations such 
as segmentation, registration, and visualization. Toward this end, anisotropic 
diffusion filtering and median filtering have been shown to be effective in enhancing 
and improving the visual quality of these images. It is important to note that the 
interactive nature of IGIs necessitates equivalent image processing speed so that these 
procedures can be performed in a streamlined manner without any additional 
processing latency. 
The aforementioned filtering techniques are based on neighborhood (window) 
operations. For volumetric (3D) images, these neighborhoods are considerably larger 
(N
3
), thus increasing the complexity of filtering operations. This complexity, coupled 
with the sheer size of intraprocedural volumetric images, results in execution times of 
several seconds for software implementations running on general-purpose 
workstations (Table 3.1). This processing speed is only a fraction of the acquisition 
speed of the intraprocedural images and is clearly unacceptable to meet the real-time 
Table 3.1: Software execution time of 3D anisotropic diffusion filtering and 3D 
median filtering of 8-bit images for common kernel sizes (N). 
Filter kernel 
Kernel 
size (N) 
Image size 
(voxels) 
Execution 
time 
(seconds) 
Voxel  
processing 
rate 
(MHz) 
128 ? 128 ? 128 2.28 0.92 3D anisotropic 
diffusion filter 
7 
256 ? 256 ? 64 4.58 0.92 
128 ? 128 ? 128 0.85 2.46 
3 
256 ? 256 ? 64 1.59 2.63 
128 ? 128 ? 128 3.01 0.7 
3D median 
filter 
5 
256 ? 256 ? 64 5.67 0.74 
 49 
 
requirements of IGIs. Previously reported techniques for accelerated implementation 
of these filtering operations primarily focused on one-dimensional (1D) or two-
dimensional (2D) filters [7-11], with only a few implementations attempting to 
accelerate these operations in 3D.  
This dissertation presents an FPGA?based architecture for real-time 
processing of intraprocedural 3D images. Earlier attempts to accelerate 3D 
anisotropic diffusion filtering were targeted toward multiprocessor clusters [100, 
101]. Despite the near-linear speedup offered by these techniques, the need to employ 
up to 256 processors to achieve real-time performance makes them less suitable for 
clinical deployment. In this dissertation, we introduce a novel FPGA-based 
implementation of 3D anisotropic diffusion filtering. The developed solution is 
compact, easily deployable, and capable of processing the intraprocedural images 
faster than acquisition speeds. Some researchers have recently reported high-speed 
implementations of 3D median filtering using graphics processing units [136] and 
FPGAs [137]. This work presents an FPGA-based 3D median filtering module that is 
faster than currently existing solutions and supports higher 3D kernel sizes (3,5,7). 
The designed architecture can achieve a processing rate close to 200 Megavoxels per 
second for both the 3D anisotropic diffusion and 3D median filtering, which is 
equivalent to about 50 processing iterations or operations per second for images of 
size 256 ? 256 ? 64. Consequently, this design is capable of meeting the real-time 
data processing need of most IGIs. 
 50 
 
3.2.   Filtering Algorithms 
This section briefly describes the 3D image preprocessing algorithms that are 
being accelerated in the current work. The architecture for their real-time 
implementation is presented in the subsequent section. 
3.2.1.   Anisotropic Diffusion Filtering 
As described earlier, anisotropic diffusion filtering is an iterative process 
which progressively smoothes an image ( (,)I vt) while maintaining the significant 
edges. This process can be summarized by the following equation: 
 () (),,,
I
divcvt Ivt
t
?
?
=?i  (3.1) 
where c is the diffusion coefficient. Since the intraprocedural images typically have 
poor SNR, the current implementation employs Gaussian-filtered version of the 
image to estimate the gradient values, as proposed by Whitaker and Pizer [82]. The 
corresponding discrete expression, which is implemented in our design, for this 
filtering operation (shown for a 2D case for simplicity) is: 
( )
()
( 1,,) (,,) [( 1,,) (,,)]
( 1,,) (,,) [( 1,,) (,,)]
(, , ) (, ,)
(, 1,) (, ,) [(, 1,) (, ,)]
(, 1,) (,,) [(, 1,) (,,
GG
GG
cI x yt I xyt Ix yt Ixyt
cIxytIxyt IxytIxyt
Ixyt t Ixyt t
c I xy t I xyt Ixy t Ixyt
c I xy t I xyt Ixy t Ixy
+? +?
+?? ??
+? = +?
++ +
+?? ??
i
i
i
i
i
,
)]t
??
??
??
 (3.2) 
where 
G
I  is the Gaussian-filtered version of the image, c is the discrete realization of 
the chosen diffusion function, and the time step t?  controls the rate and stability of 
the diffusion process. Gerig et al. [138] calculated maximum values for t?  for 
different neighborhood structures. For a 3D realization, diffusion is calculated in a 3D 
 51 
 
space with 6-connected neighborhood, and that configuration corresponds to a 
maximum t?  value of 1/7, which is implemented by the presented design. 
3.2.2.   Median Filtering 
The 3D median filter design presented in this dissertation is based on a 
combination of bit-serial searching and majority voting approaches. This section 
describes this median finding scheme by means of an example. The algorithm is 
executed in b  (for b-bit images) steps, where each step finds 1 bit of the resulting 
median value starting from the most significant to the least significant bit. 
Specifically, at the j
th
 step, the majority bit (?0? or ?1?) amongst the j
th
 significant 
bits of all the input elements in the neighborhood is calculated and represents the j
th
 
bit of the median of the neighborhood ( 0 1j b? ??).  
At a given step, when a bit of a input element differs from the majority bit 
calculated at that step, the bit value for that element is fixed in subsequent steps and is 
considered to be masked with its current bit value. Bits already masked in a previous 
step are not altered in subsequent steps (i.e., if the j
th
 bit of input value n , 
represented using b bits as: 
12 0bb
nn n
??
null , is masked, then the algorithm considers 
, 
ij
nn ij=?<). The process of finding the median using this approach is illustrated 
in Figure 3.1. In this example, for simplicity, we consider a small input neighborhood 
consisting of only five elements with four bits per voxel ( 4b = ); therefore, only four 
processing steps are required. Processing starts at the MSB position of the data 
elements. The bits of the data elements being considered for calculating the majority 
bit at any step are indicated in gray in the figure. The masking operation that takes 
 52 
 
place at the end of every step is indicated by arrows. The masked bits are shown to be 
crossed out. One bit of the median is determined at every processing stage starting 
from the MSB position, and the results from all the stages are combined to produce 
the final median value. 
3.3.   Architecture 
We present an FPGA-based architecture that is capable of performing 3D 
anisotropic diffusion and 3D median filtering of intraprocedural images faster than 
their acquisition speed. A top-level block diagram of this architecture is shown in 
Figure 3.2. Input and output images are stored in two independent external memory 
banks, and the memory controller, input and output image buffers, and the filtering 
modules are implemented using an FPGA. The presented architecture supports two 
filtering modules, one for 3D anisotropic diffusion filtering and the other for 3D 
median filtering. The filtering module can be selected and reconfigured statically, 
whereas the memory controller and the image buffers are designed to be common to 
 
Figure 3.1: A median filtering example using majority voting technique. 
 53 
 
all supported filtering modules. The role of input and output memory banks can be 
switched at runtime, thus enabling execution of consecutive filtering operations (or 
iterations) without additional data transfers between the memory banks. 
In order to achieve real-time performance, it is imperative to aim at a 
throughput of one processed (output) image voxel per clock cycle. Because both 
anisotropic diffusion and median filtering involve neighborhood operations, meeting 
this throughput requirement is challenging, given that an entire neighborhood (
3
N  
voxels, where N is the filter kernel size) must be accessed in order to compute one 
output voxel. Moreover, adjoining neighborhoods must be continuously fetched from 
the input memory bank as next output voxels are computed sequentially. These 
neighborhoods are read by the memory controller from the input image memory bank 
and are stored into the input buffer in an NNN? ?  arrangement. 
The filtering module receives the neighborhood to be processed in a pipelined 
fashion: NN?  new voxels every clock cycle. Once the filtering module pipeline is 
full (after N clock cycles), the filtering module computes one output voxel per clock 
cycle. The sequential input neighborhoods are continuously processed, and the 
resulting output voxels are stored into the output buffer. The memory controller then 
 
Figure 3.2: Block diagram of the FPGA-based real-time 3D image preprocessing 
system. 
 54 
 
transfers these resultant voxels to the output image memory in a burst of one image 
row at a time. The memory controller uses a brick-caching scheme, specifically 
devised to meet the high input data rate required by this task. This brick-caching 
scheme takes advantage of the fact that adjacent neighborhoods share 
(1)NN N?? ?voxels and only NN?  new voxels need to be supplied every clock 
cycle for continuous neighborhood processing. The implementation of this scheme is 
described in the following sections. For the remainder of this chapter, we use the 
following notations: image dimensions are represented as 
x yz
NNN??. The 
parameter b indicates the number of bits used to represent the voxel intensity in the 
image, and N  is the filter kernel size with corresponding neighborhood size of 
3
N . 
.Input and output images are arranged in the memory banks along the zyx?? order, 
with rows of the memory aligned with the z direction of the image. The output voxels 
are also calculated in zyx??order. 
3.3.1.   Memory Controller and Brick-caching Scheme 
Memory organization and neighborhood access techniques are often the 
limiting factors in 3D image processing systems [139-142]. However, most practical 
filtering techniques employ standard neighborhood operations that require block-
 
Figure 3.3: Typical voxel access pattern for neighborhood operations?based image 
processing. 
 55 
 
sequential voxel access, as shown in Figure 3.3. The presented FPGA-based 
architecture uses a raster scan order distribution of voxels in the image memory, 
along with a brick-caching scheme to take advantage of this block-sequential access 
pattern. For every output voxel calculation, an entire neighborhood of NNN?? 
voxels must be accessed. This neighborhood cannot be retrieved in a single burst 
access of a sequentially organized image memory. Moreover, in a pipelined 
implementation, data must be continuously fetched for successive neighborhood 
operations. To sustain the high data rate required to achieve real-time processing 
speeds, this architecture employs a brick-caching scheme that loads the image into the 
input buffer that stores an ( 1)NN? +  array of image rows (i.e., it stores up to 
(1)
z
NN N?+? voxels). This input buffer is implemented using high-speed and dual-
ported memory blocks internal to the FPGA. The input buffer can be accessed in a 
single clock cycle, which enables fast updates and reads. Figure 3.4 shows the block 
diagram of the input image memory and the input buffer, which consists of an 
(1)NN?+ array of internal memory blocks, each holding 
z
N voxel intensity values. 
We use the following terminology: a brick is an 
z
NNN? ? block of image intensity 
values stored in the internal buffer. A brick plane is an 1
z
NN? ? section of a brick. A 
brick slice is an 1NN?? section of a brick. A brick row (or simply row) is a 
11
z
N?? section of a brick. Each row corresponds to and contains one input image 
row containing 
z
N voxels. The pictorial representation of this terminology is shown 
in Figure 3.5. Bricks are loaded into the buffer one brick row at a time for an 
available brick plane and are then fed to the filtering pipeline one brick slice at a time. 
 56 
 
The input buffer, therefore, can store a whole brick plus an extra brick plane. The 
brick-caching operation is described below.  
The memory controller fills up the input buffer row by row. Each row 
contains N
z
 voxels, and thus transfer of each row takes WNtt
zlatRow
/+=  clock cycles, 
where 
lat
t is the number of clock cycles necessary to start a burst memory transfer and 
W  is the effective data bus width in terms of number of voxels (e.g., double-data-rate 
[DDR] dynamic random access memory [DRAM] will offer twice the effective bus 
width of single-data-rate DRAM with a similar configuration). After the first row is 
cached in, the controller starts caching the row next to it in the x direction. A 
complete brick plane ( N image rows, 1
z
NN? ? voxels) can be loaded in 
Plane Row
tNt=? clock cycles. Associated with every brick plane is a ready flag. This 
flag serves a dual purpose; when ?1?, it indicates the availability of data for that 
particular brick plane, and, when ?0?, it indicates that the brick plane is empty and 
available for caching image voxels. After one brick plane is loaded into the input 
buffer, the memory controller sets the corresponding ready flag and starts loading the 
 
Figure 3.4: Block diagram showing the input image memory and the input buffer 
configuration. 
 57 
 
next brick plane (along the y direction). Once a complete 
z
NNN? ? brick is available 
in the input buffer, it is fed into the filtering module pipeline one brick slice ( NN?  
voxels) at a time. 
The filtering module pipeline operates on one NNN? ?  neighborhood at a 
time and is fed with a new brick slice every clock cycle. Loading an entire 
z
NNN?? brick into the filtering module pipeline thus takes N
z
 clock cycles. While 
this operation is in progress, the memory controller loads the next brick plane (along 
the y direction) into the buffer plane that is not being used for processing (indicated 
by the ready flag), which requires t
Plane
 clock cycles. After processing of all the 
neighborhoods in the 
z
NNN??  brick is complete, the processing window shifts 
along the y  dimension of the image, and the processing of the new neighborhoods 
begins. Simultaneously, the ready flag corresponding to the brick plane that is no 
longer used is set to ?0?. This available brick plane in the input buffer is then used for 
caching the next image rows (along y direction). In this fashion all brick planes in the 
input buffer are cyclically used for brick caching during processing. These steps 
 
Figure 3.5: Pictorial representation of the notation used in the brick-caching 
scheme. 
 58 
 
continue until the processing window reaches the end of the column (i.e., until y = 
N
y
). At this point, the processing window moves along the x direction. To accomplish 
this, data in the internal buffers are invalidated, and the complete 
z
NNN??  brick in 
the next column is cached, which requires a pipeline stall. After the initial brick in the 
new x coordinates is loaded, the processing continues as described earlier. The 
processing of the entire 3D image is completed accordingly. For continuous pipelined 
operation with minimum stalls, the memory controller must provide the next brick 
plane before the processing of the previous brick is completed. Therefore, the 
relationship expressed in the following equation must be met:  
 
( )/,
lat z mem z proc
Nt NWT NT?+ ? ??  (3.3) 
where T
mem
 is the clock period of the external memory clock and T
proc
 is the clock 
period of the internal filtering pipeline. The left-hand side of Eq. (3.3)  refers to the 
total time required to load a new brick plane. The right-hand side refers to the total 
time required to process a whole brick. Assuming that efficient burst accesses 
(supported by most modern dynamic memories) are being used (which implies: 
/
lat z
tNWnull ), the following relationship must be maintained to minimize pipeline 
stalls: 
 .
mem proc
NT WT? ??  (3.4) 
3.3.2.   3D Anisotropic Diffusion Filtering 
This architecture supports 3D anisotropic diffusion filtering by pipelined 
implementation of the 3D extension of the formulation shown in Eq. (3.2). As 
indicated by that formulation, we use a Gaussian-filtered version of the image for 
 59 
 
improved diffusion coefficient estimation. Our design implements this Gaussian 
filtering at runtime using an embedded 3D Gaussian filtering module. Figure 3.6 
shows a top level block diagram of the 3D anisotropic diffusion filtering module. This 
filtering pipeline operates on NNN? ? voxel neighborhoods. On each clock cycle, 
the input data buffer feeds an NN?  voxel neighborhood (brick) slice into the 
pipeline. The center voxel intensity value is passed to the delay element for 
accumulation at the end of the pipeline per Eq. (3.2). The 3 ? 3 voxel neighborhood 
located at the center of the incoming NN? neighborhood is passed to the image 
gradient calculator, which calculates the image gradients with respect to each of the 
6-connected neighbors of the center voxel. The embedded Gaussian filtering module 
calculates, in parallel, the Gaussian-filtered values for each of the six-connected 
neighbors and passes them to the diffusion coefficient calculator, which calculates the 
diffusion coefficients 
0..5
c  corresponding to each of the input gradients. Taking 
advantage of the parallelism native to FPGAs, these operations are executed in 
parallel, and, as a result, this filtering module can calculate the output at the rate of 
one voxel per clock cycle. The resulting voxel intensity values are fed into the output 
buffer, and the memory controller then stores them in the output memory bank.  
3.3.2.1.   Embedded Gaussian Filtering Module 
Equation (3.5) shows the formula to calculate the coefficients of a 3D 
Gaussian filter kernel, where ?  is the standard deviation of the Gaussian function 
and d is the Euclidean distance between the desired coefficient location and the kernel 
center. For a chosen? , the coefficient values depend exclusively on the Euclidean 
 60 
 
distances from the kernel center; thus, the Gaussian filter kernel exhibits symmetries 
with respect to its center (i.e. it is radially symmetric).  
 ()
2
2
exp
2
d
d
G ?
?
??
=?
??
??
 (3.5) 
Our architecture takes advantage of these symmetries to reduce the number of 
multipliers needed for implementing the 3D Gaussian kernel to (1)(2)6kk k?+?+  
from 
3
(2)N ? ; where N  is the size of anisotropic diffusion filtering kernel, the 
corresponding size of the embedded Gaussian filtering kernel is ( 2)N ? , and 
(1)2kN=? . For example, a 555? ?  embedded Gaussian kernel that arises in 
anisotropic diffusion filtering with 7N =  can be implemented using only 10 
multipliers (as opposed to 125) as we reported previously [143]. For this kernel, each 
individual slice (55?  plane of the kernel) has six isodistance regions and the whole 
3D kernel has 10. During filtering operation, all voxels that are equidistant from the 
kernel center are multiplied against the same Gaussian coefficient. The intensities 
corresponding to these voxels in the same isodistance region can, therefore, be pre-
added before being multiplied against the Gaussian coefficients. Because a 5?5?5 
 
Figure 3.6: Top-level block diagram of 3D anisotropic diffusion filtering. This 
diagram indicates paths that are executed in parallel. 
 61 
 
Gaussian kernel contains 10 isodistance regions, the minimum number of multipliers 
necessary to implement this filter kernel is, therefore, 10.  
A block diagram of the Gaussian filter bank is shown in Figure 3.7. On each 
clock cycle, the input buffer feeds an NN? voxel neighborhood into the bank. This 
neighborhood is decomposed into five ( 2) ( 2)NN? ?? neighborhoods by the input 
demux, and these neighborhoods are then passed to five embedded 3D Gaussian 
filters. Figure 3.8 shows a block diagram of an embedded 3D Gaussian filter. The 
pre-adder accumulates values corresponding to the isodistance groups in the incoming 
(2)(2)NN?? ? neighborhood, thus compressing the neighborhood based on the 
intraneighborhood plane isodistance criterion (e.g., each single 55? slice of 
a
3
5 Gaussian neighborhood has 6 isodistance regions). These pre-added values are 
then passed to ( 2)N ?  pipeline buffers, which make values corresponding to the 
entire 
3
(2)N ?  neighborhood available in parallel. The sorter?accumulator 
aggregates these values corresponding to the isodistance groups between the slices, 
 
Figure 3.7: Block diagram of the embedded Gaussian filter bank (for N = 7, 
corresponding Gaussian kernel size is 5). 
 62 
 
compressing them further using the isodistance criterion for the entire neighborhood 
(e.g., 10 values that correspond to the 10 unique coefficients in a 5?5?5 Gaussian 
neighborhood). These values are passed to the multiplier array, where they are 
multiplied against their corresponding Gaussian coefficients. The adder tree then adds 
the resulting values and outputs the result for the current 3D neighborhood. The 
results for the subsequent neighborhoods are produced continuously as a result of 
pipelined implementation of the operation. The Gaussian coefficients are 
precomputed for a given value of? and are stored in internal registers using fixed-
point representation. The effect of this finite precision representation is analyzed in 
the flowing section. 
The results from the five individual Gaussian filters correspond to a cross-
shaped region of a neighborhood slice. In order to operate on the 3D six-connected 
neighborhood, these results are passed to pipelined registers composed of two cross 
buffers and the center buffer. The cross buffers store all five values in a neighborhood 
slice, whereas the center buffer stores only the center value. As a result, the entire six-
connected neighborhood is available between these buffers. The buffers then send the 
Gaussian-filtered, six-connected 3D voxel neighborhood to the subtractor array, 
 
Figure 3.8: Pipelined implementation of an individual Gaussian filter element 
(Gaussian kernel size = 5). 
 63 
 
which calculates the six corresponding gradient values and passes them to the 
diffusion coefficient calculator.  
3.3.2.2.   Diffusion Coefficient Calculation 
As noted previously, gradients calculated after Gaussian filtering are used to 
estimate the diffusion coefficients. For a b -bit image, the absolute value of the 
gradient is limited to the range 0  and 2 1
b
? . Taking advantage of this fact the desired 
diffusion function is discretized in 2
b
steps and implemented using a lookup table 
(LUT). The use of a LUT allows an efficient implementation of any diffusion 
function. It must be noted that, because the dynamic range of all diffusion functions is 
limited to [0,1] , there is no significant loss in precision by a using a fixed-point 
representation. The effect of this finite precision representation is further analyzed in 
the following section.  
3.3.2.3.   Image Gradient and Result Calculation 
Image gradient calculation is performed by an array of six parallel subtractors. 
These subtractors calculate the difference between the intensity of the voxel located 
in the center of the kernel against its six-connected neighbors. These values are then 
multiplied against their corresponding diffusion coefficients (supplied by the 
diffusion coefficient calculator) using an array of six parallel multipliers. The 
resulting filtered intensity is then obtained by adding the six results from the 
multipliers to the original center voxel intensity. After rounding and truncation, this 
result is then sent to the output buffer and is then further saved into output memory 
bank. 
 64 
 
3.3.3.   Median Filtering 
The 3D median filtering design presented in this work is an extension of 
majority finding?based implementation proposed by Benkrid et al.[7]. That design 
was reported for a 2D realization and computed only one bit of the median value per 
clock cycle. All bits of the median value were obtained using a feedback loop and 
hence for b -bit images, this approach required b clock cycles to compute the 
resulting median value. The implementation presented in this dissertation extends that 
design to 3D and unrolls the feedback loop by using multiple processing stages. 
Moreover, our implementation exploits the regularity of this median finding 
algorithm with a systolic array architecture that allows a pipelined implementation 
and, therefore, can achieve a throughput of one median value per clock cycle. Thus, 
our implementation can achieve a voxel processing speed b times higher than the 
previously reported architecture [7]. Our linear systolic array employs b identical 
processing stages for filtering a b-bit image. Figure 3.1 illustrates execution of this 
algorithm for a small example and can be used to gain further insights into its 
hardware implementation. Each processing stage of our systolic array implementation 
corresponds to one step of the algorithm execution. Starting from the MSB, each 
stage generates one bit of the resulting median value of the neighborhood being 
processed. We first describe the operation of an individual processing element and 
then explain the functioning of the entire linear systolic pipeline, which contains b-
processing elements. 
 65 
 
3.3.3.1.   Processing Element 
The processing element is the atomic unit of the proposed linear systolic array 
design. A functional block diagram of the processing element at the j
th
 stage is shown 
in Figure 3.9. The data inputs to this processing element are Data_Bits
j
 and 
Next_Data_Bits
j
, the N
3
 bits used considered for majority calculation and the (j+1)
th
 
significant bits (from MSB) of the 
3
N  neighborhood elements, respectively. It must 
be noted that, although Next_Data_Bits
j 
are corresponding image intensity bits, 
Data_Bits
j
 are provided by the (j?1)
th
 processing stage and may have been masked in 
the earlier stages. The accompanying input Mask_Bits
j
 is a binary flag that indicates 
the bits in Data_Bits
j
 that have been masked in the prior stages. A processing element 
performs two important tasks. First, it computes the majority bit within the N
3 
input 
data bits (Data_Bits
j
); second, it performs the masking operation based on the 
majority bit calculated and outputs masked data bits (Data_Bits
j+1
) and the 
corresponding binary flag (Mask_Bits
j+1
) to be used in the next processing stage. The 
units that perform these two operations are described below. 
 
Figure 3.9: A single stage (processing element) of the linear systolic median 
filtering kernel. 
 66 
 
Majority Finding Unit (MFU) 
The MFU consists of a bit-counting circuit that counts the number of 1s in the 
input bits that are considered for majority calculation (Data_Bits
j
). This counting is 
performed using a bit adder tree customized for a chosen neighborhood size. This 
count is then compared against a threshold, which is programmed to be half of the 
number of elements contained in the neighborhood. The binary result of this 
comparison is the j
th
 significant (from MSB) bit of the output median value. The 
highly compact, pipelined, and customized implementation of the MFU minimizes 
the combinational delay within the processing element.  
Mask Selection Unit (MSU) 
After the median bit has been calculated, the MSU performs the masking 
operation. It computes the mask bit for each bit of Data_Bits
j
, based on whether it 
matches with the majority bit or not. In addition, it considers and preserves the bits 
that were masked in the prior stages (Mask_Bits
j
). Thus, the mask calculated at the j
th
 
stage (Mask_Bits
j+1
) is a combination of the mask bits from the prior stages and the 
mask calculated at the current stage. This masking operation is implemented using an 
exclusive OR (XOR) operation and two multiplexing operations for each data bit. The 
calculated mask (Mask_Bits
j+1
) is then used to selectively generate Data_Bits
j+1 
from 
the input Next_Data_Bits
j
, while ensuring that values corresponding to the masked 
bits are preserved. Data_Bits
j+1 
is then used in the next processing stage to calculate 
the (j+1)
th
 significant bit of the median value.  
 67 
 
3.3.3.2.   Linear Systolic Design for Median Finding 
The proposed linear systolic design is realized by cascading b processing 
elements for filtering b-bit images. On every clock cycle, a complete neighborhood 
containing N
3 
voxels, b-bits each, is fed to this linear systolic array. However, the 
processing stage (j+1) can not perform its operation until stage j finishes its 
processing and provides Data_Bits
j+1 
and Mask_Bits
j+1
. Similarly, stage j produces its 
output (j
th
 significant bit of the median) one clock cycle earlier than the corresponding 
output by the stage (j+1). In order to compensate for these delay and processing 
latencies and to provide synchronized operation, additional line delay units and data 
shift registers must be inserted at the input and output of the systolic array. Figure 
3.10 shows a diagram of this configuration with b processing elements and required 
delay buffers. These delays are introduced for synchronization only, and it must be 
noted that as long as the input sequential neighborhoods are continuously supplied 
(new N
3 
voxels every clock cycle), the systolic array design is capable of computing 
one median result per clock cycle. This result is then sent to the output buffer and 
subsequently saved into the output memory bank by the memory controller. 
For correct operation, the Mask_Bits
1 
input of the first processing stage (stage 
1, MSB) is grounded (set to ?0?), indicating that no bits from the input data 
(Data_Bits
1
) are masked. Also, in the final processing stage (stage b, LSB), 
Mask_bits
b+1
 and Data_Bits
b+1
 do not need to be calculated, because the next stage 
does not exist. Consequently, the MSU is not needed in the final stage, which 
contains only an MFU to compute the last median bit. In general, for large 3D 
neighborhoods, the speed of the MFU is the limiting factor of the systolic array 
 68 
 
performance. In applications requiring high voxel throughput and large filtering 
kernels, the operation of the MFU can be pipelined. However, in those cases the 
depth of the delay elements used to synchronize the inputs and outputs of the different 
processing elements must be adjusted. 
3.4.   Implementation and Results 
The architecture described above was implemented using an Altera Stratix II 
EP2S180F1508C4 FPGA (Altera Corp., San Jose, CA) with two external memory 
banks to serve as input and output image memory. The memory banks used were 1-
GB DDR2 small-outline dual-inline memory DRAM modules with 64-bit data bus 
(i.e., 16W = , for 8b = ) running at a 200-MHz clock speed. The architecture was 
designed using VHSIC hardware description language (VHDL) and synthesized using 
Altera Quartus II 6.1. The memory controller was also implemented using VHDL and 
was built around the DDR2 DRAM controller megacore supplied with Altera Quartus 
 
Figure 3.10: Linear systolic array architecture for median filter kernel using 
majority voting technique. 
 69 
 
II. Both filtering modules were custom designed using VHDL, as per the design 
description in provided earlier. Functional verification and postsynthesis timing 
simulation for the entire system were performed using Modelsim SE 6.2 (Mentor 
Graphics, San Jose, CA). For this purpose, DDR2 DRAM simulation models 
provided by Micron (www.micron.com) were used. The presented design was then 
realized to support 8-bit images (b = 8) and, consequently, all results in this section 
are presented for 8-bit images. The execution speed of the presented architecture was 
obtained from postsynthesis timing simulation of the design. 
3.4.1.   Effects of Finite Precision Representation 
Real-time filtering performance offered by the presented design is critical for 
the time-sensitive nature of IGIs, but of equal importance is the accuracy of the 
filtering process. Most software implementations represent the arithmetic operations 
involved in the filtering algorithms using a double precision floating-point format. 
This format offers high dynamic range and precision, which may or may not be 
required depending on the filtering technique to be implemented. Median filtering, for 
example, is performed exclusively using integer data (because digital images are 
represented using b-bit integer data), and, hence, there is no loss in precision by using 
Table 3.2: Average error in intensity per voxel for a Gaussian filtered image 
resulting from fixed-point representation of Gaussian coefficients. 
Average error in intensity per voxel resulting from  
fixed-point representation of Gaussian kernel coefficients 
? of the 
Gaussian kernel 
8-bits 12-bits 16-bits 
0.3 0.20 ? 0.47 0.07 ? 0.26 0.004 ? 0.06 
0.5 0.63 ? 0.68 0.02 ? 0.11 0.004 ? 0.07 
0.7 0.42 ? 0.81 0.03 ? 0.16 0.003 ? 0.06 
1.0 0.21 ? 0.47 0.001 ? 0.04 0.001 ? 0.03 
 70 
 
a fixed-point implementation with sufficient dynamic range (i.e., using b bits for b-bit 
images). Our implementation of the 3D median filtering uses b-bit integer 
representation for b-bit images, and therefore provides identical results to those 
provided by a software implementation. 
Anisotropic diffusion filtering, however, involves operations with the data in 
real format. Our implementation, for the sake of efficiency in area and execution 
speed, used fixed-point representation to implement these arithmetic operations. A 
general framework for analyzing optimized tradeoff relationships between hardware 
resources and implementation accuracy for finite-precision designs is presented later 
(see Chapter 5). For this implementation, however, given the simplicity of the 
arithmetic operations and relatively minor impact of the fixed-point datapath on the 
total hardware resource requirement, we employed simulation-based wordlength 
search techniques. We analyzed the effect of the number of bits used for this fixed-
point representation on the filtering accuracy, by treating a software (C++) 
implementation employing double-precision floating-point representation as a 
reference. This analysis was performed with an 8-bit image with dimensions 
256 256 64??. There are two sources at which error resulting from fixed-point 
precision can affect the accuracy of the filtering operation: embedded Gaussian 
Table 3.3: Average error per sample of diffusion function resulting from fixed-
point representation of diffusion coefficients employed in the presented architecture. 
Average error per sample of diffusion function resulting from  
fixed-point representation of diffusion coefficients K 
8-bits 12-bits 16-bits 
10 42 ? 10
-5
 ? 97 ? 10
-5
 3 ? 10
-5
 ? 7 ? 10
-5
 <10
-5
 
20 78 ? 10
-5
 ? 114 ? 10
-5
 6 ? 10
-5
 ? 8 ? 10
-5
 <10
-5
 
30 121 ? 10
-5
 ? 125 ? 10
-5
 9 ? 10
-5
 ? 8 ? 10
-5
 <10
-5
 
50 196 ? 10
-5
 ? 116 ? 10
-5
 13 ? 10
-5
 ? 7 ? 10
-5
 <10
-5
 
 71 
 
filtering and diffusion function calculation. To gain additional insight, we evaluated 
accuracy for these individual sources and the accuracy of the anisotropic diffusion 
filtering as their combined effect. 
Table 3.2 presents the average error in intensity per voxel after embedded 
Gaussian filtering (kernel size, N = 5) where the Gaussian kernel coefficients are 
represented using the fixed-point format with the designated number of bits. Because 
the Gaussian kernel was normalized, all coefficients were within the range [0,1] , and, 
hence, we used one bit to represent the integer part and the rest for the fractional part. 
We performed this analysis for typical choices of ? for a Gaussian kernel size of 5, 
which corresponds to the anisotropic diffusion filtering kernel size of 7. The average 
error for various choices of ?  with 8-bit representation is less than one intensity 
value, and, as expected, the average error reduces with the increasing number of bits. 
It must be noted, however, that embedded Gaussian filtering is used only to estimate 
the diffusion coefficients, and, hence, small errors introduced in this operation may 
not have a significant impact on the final anisotropic diffusion filtered intensity value. 
Because this design supported 8-bit images, a 256-entry LUT was used to implement 
the diffusion function described in the background section. We implemented this 
function for reasonable choices of the parameter K, which controls the level of the 
gradient at which edges are diffused or preserved. The value of K depends on the 
image modality and the amount of edge preservation desired. For ultrasound and low-
dose CT images, however, its value is typically less than 20% of the intensity range. 
As the selected diffusion function takes values in the range [0,1] , we used one bit to 
represent the integer part and the rest for the fractional part. Table 3.3 presents the 
 72 
 
average error per sample of diffusion function resulting from fixed-point 
representation with the designated number of bits for various choices of K. Although 
the average error increases with the choice of K, its mean and standard deviations are 
consistently less than 0.1% of the data range, even with representation using 8-bits. 
Finally, Table 3.4 reports average error in intensity per voxel resulting from the 
combined effect of finite precision implementation of both the Gaussian coefficients 
and the diffusion function. For this analysis, we used the same number of bits for 
fixed-point representation of both entities, with one bit for the integer part and the rest 
for the fractional component. The kernel size (N) of the anisotropic diffusion filter 
was chosen to be 7, with embedded Gaussian filtering with ? = 0.5, and the diffusion 
function shown described earlier was implemented with K = 20. To evaluate error 
accumulation over multiple iterations of anisotropic diffusion filtering, we performed 
this analysis up to 5 iterations, which is typical for filtering of intraprocedural images. 
The average error in intensity increases with the number of iterations, but its mean 
and standard deviations are consistently less than 0.04% of the intensity range, even 
with 8-bit representation. 
Overall, our precision analysis indicates that even when using 8-bit fixed-
Table 3.4: Average error in intensity per voxel for anisotropic diffusion filtered 
resulting from fixed-point representation of Gaussian coefficients and the diffusion 
function 
Average error in intensity per voxel resulting from fixed-point 
representation of the Gaussian kernel and diffusion coefficients 
Number of  
filtering 
iterations 8-bits 12-bits 16-bits 
1 0.008 ? 0.092 0.001 ? 0.018 <0.001 
3 0.021 ? 0.144 0.001 ? 0.031 <0.001 
5 0.030 ? 0.171 0.002 ? 0.039 <0.001 
 73 
 
point representation to perform Gaussian filtering and diffusion function calculation, 
the average error in intensity is only a very small percentage of the intensity range. 
Such small errors in intensity may not be significant for advanced operations such as 
registration, segmentation, and visualization and are unlikely to affect the accuracy 
and precision of IGIs. Our implementation, therefore, uses 8-bit fixed-point 
representation for these operations. 
3.4.2.   Hardware Requirements 
Table 3.5 lists the significant hardware requirements for the important 
modules in the proposed architecture, parameterized on filter kernel size ( N ) and the 
number of bits used to represent the voxel intensity (b ). The parameter k , introduced 
in the context of 3D anisotropic diffusion filtering, represents the number of unique 
isodistances in the Gaussian kernel and is related to the filter kernel size N  (usually 
odd) as: 
 
(1)
.
2
N
k
?
=  (3.6) 
Table 3.5: Hardware requirements of the architecture for real-time 3D image 
preprocessing. 
Significant hardware resources 
Logic resources and performance 
(as implemented) Hardware 
module Multipliers 
( bb? bit) 
Internal memory
(bits) 
N  
Number of ALUTs 
(% utilization) 
max
f  
(MHz) 
Input buffer 
and controller 
? 
((1) )
z
N NNb? +? ?
7 1957 (1.5%) 233 
Output buffer 
and controller 
? 
(2 )
z
N b? ?  
7 1743 (1.5%) 233 
Anisotropic 
diffusion filter 
(1)(2)
5
6
kk k?+?+
?
 
(3 2 )
b
b? ?  7 3824 (3%) 236 
Median filter ? ? 5 11308 (8%) 224 
 74 
 
The linear systolic array implementation of the 3D median filter requires logic 
resources only, and the resource requirements for important components of this filter 
kernel are listed separately in Table 3.6. These two tables indicate how the hardware 
requirements of our architecture scale with the parameters N and b . As dictated by 
the resource limitations imposed by the target device (Altera Stratix II 
EP2S180F1508C4) and real-time speed requirements, our current implementation can 
support filter kernel sizes ()N  from the list { }5, 7  and { }3, 5 for anisotropic diffusion 
filtering and median filtering, respectively. The corresponding kernel sizes for the 
embedded Gaussian filtering in the case of anisotropic diffusion filter supported by 
our architecture are{ }3, 5 . Table 3.5 also lists the absolute and percentage logic 
resources consumed by the important modules in the architecture and the maximum 
operating frequency (
max
f ) at which these modules can run for a specific instantiation 
(choice of N ). The percentage logic resources are reported in reference to the target 
device Altera Stratix II EP2S180F1508C4. The images used for this performance and 
logic consumption analysis were 8-bit images ( 8b = ). The choices for the value of 
N  for this analysis represent common kernel size choices and are listed in the fourth 
column of the table. 
Table 3.6: Hardware requirements for the components of the linear systolic 
implementation of the 3D median filtering. 
Hardware component Number required 
Processing elements b  
Data registers 
3
bN?  
Mask select registers 
3
bN?  
Data pipeline registers 
3
(1)bN??  
Line delay elements ( ) ( )
3
21(1)2bbN???? +  
 75 
 
3.4.3.   Filtering Performance  
The 3D median filtering module and 3D anisotropic diffusion filtering module 
were synthesized for kernel sizes of {3, 5} and 7 , respectively, for filtering 8-bit 
images. The rest of the system, including the memory controller and input and output 
buffers, was parametrically synthesized to support the desired filtering operation and 
kernel size. The entire system was clocked at 200 MHz, which also corresponds to the 
filtering pipeline frequency (i.e., 
proc
T = 5 ns). The image memories were also clocked 
at 200 MHz (
mem
T = 5 ns). For this configuration, Table 3.7 reports the execution time 
for 3D anisotropic diffusion filtering and 3D median filtering as obtained during 
postsynthesis timing simulation of the entire system. The image sizes used for this 
measurement correspond to typical dimensions of intraprocedural images. 
As indicated in Table 3.7, our implementation of 3D anisotropic diffusion 
filtering and 3D median filtering can easily achieve a processing rate of 46 frames per 
second (fps) for images of size 256 256 64? ?  voxels, which is a typical size of an 
intraprocedural volumetric CT scan. The corresponding processing rate for 
intraprocedural 3D ultrasound scan with typical dimensions of 128 128 128?? voxels 
Table 3.7: Execution time of 3D anisotropic diffusion filtering and 3D median 
filtering. 
Filter kernel 
Kernel 
size  
(N) 
Image size 
(voxels) 
Execution 
time 
(ms) 
Voxel  
processing rate
(MHz) 
128 ? 128 ? 128 10.90 192 3D anisotropic 
diffusion filter 
7 
256 ? 256 ? 64 21.63 194 
128 ? 128 ? 128 10.75 195 
3 
256 ? 256 ? 64 21.44 196 
128 ? 128 ? 128 10.82 194 
3D median 
filter 
5 
256 ? 256 ? 64 21.58 194 
 76 
 
is around 92 fps. For iterative operations such as anisotropic diffusion filtering or 
sequential filtering operations, this processing rate translates to 18 fps with five 
iterations (or sequential operations) per frame, which is sufficient to meet the real-
time needs of most IGIs. 
Table 3.8 and Table 3.9 compare the execution speed of the presented 
architecture for 3D anisotropic diffusion filtering and 3D median filtering, 
respectively, against a corresponding software implementation and previously 
reported high-speed implementations using different computing platforms. The 
execution time has been normalized by the image dimensions for all implementations, 
and the performance is presented in terms of voxel processing rate to facilitate a fair 
comparison independent of image dimensions. The software implementation was 
developed using C++, and its performance was measured on an Intel Xeon 3.6 GHz 
workstation with 2 gigabytes of DDR2 400 MHz main memory. Although this 
architecture can support various kernel sizes for the filtering operations, for 
consistency the performance has been compared for a kernel size (N) common to all 
implementations: N=3 for the median filtering and N=7 for anisotropic diffusion 
filtering. 
Table 3.8: Performance comparison of the 3D anisotropic diffusion filtering 
kernel. 
Implementation Platform 
Filter 
kernel 
Voxel 
processing rate 
(MHz) 
Speedup 
Software (C++) Xeon workstation 3D 0.92 208 
Bruhn et al. 
[100] 
256-processor cluster 3D 
105 1.83 
Tabik et al. [101] 16-processor cluster 3D 5.66 33.9 
Dandekar  
et al. [144] 
FPGA 3D 192 - 
 77 
 
As indicated by Table 3.8, our implementation of 3D anisotropic diffusion 
filtering provides more than two orders of magnitude speedup over the software 
implementation using a single workstation. Moreover, the performance of the current 
architecture represents an improvement over a corresponding implementation using a 
256-processor computing cluster reported previously [100]. Our work presented a 
novel FPGA-based implementation of 3D anisotropic diffusion filtering. Salient 
features of this filtering module are an embedded Gaussian filtering implementation 
that minimizes the number of multipliers and a pipelined design that allows 
throughput of one output voxel per clock cycle. This filtering module offers the 
flexibility to support several anisotropic diffusion techniques previously reported in 
the literature. For example, the multiscale approach proposed by Whitaker and Pizer 
[82] can be implemented by changing the embedded Gaussian filter coefficients at the 
end of each iteration, and a time-dependent diffusion function [145] can be 
implemented by reprogramming the values in the diffusion function LUT. One 
limitation of this filtering module is the limit on the size of the embedded Gaussian 
filter kernel; implementing Gaussian kernels larger than 7 would result in 
prohibitively high hardware requirements. Such large kernels, however, are 
uncommon in most applications. Although this architecture performs some of the 
arithmetic functions using fixed-point representation, the variation in the output 
intensity values is only a small fraction of the intensity range, and these variations are 
unlikely to affect the accuracy and precision of IGIs. 
Table 3.9 compares the performance of the FPGA-based 3D median filtering 
operation described in this work with previously reported high-speed 
 78 
 
implementations. The present implementation provides more than an order of 
magnitude speedup over software- and GPU-based 3D implementations and DSP-
based 2D implementation. Jiang and Crookes [137] recently reported an FPGA-based 
3D implementation that is capable of achieving a voxel processing rate of 50 MHz. 
That design, however, was based on a partial sorting technique and cannot be easily 
extended to kernel sizes beyond 3. Our implementation, in contrast, achieved a 
superior voxel processing rate and is sufficiently compact to allow implementation of 
kernel sizes up to 7, which is sufficient for most common image processing tasks. The 
logic resources required by the described systolic array?based median filter indeed 
scale up as kernel sizes get larger; but as modern FPGAs become denser and offer 
improved logic capacity, this requirement is still a small percentage of the total 
available resources (see Table 3.5).  
3.5.   Summary 
This chapter presented an FPGA-based architecture for real-time 
preprocessing of volumetric images acquired during IGIs. The developed architecture 
enables 3D anisotropic diffusion filtering and 3D median filtering of intraprocedural 
Table 3.9: Performance comparison of the 3D median filtering kernel. 
Implementation Platform 
Filter 
kernel 
Voxel processing 
rate (MHz) 
Speedup 
Software (C++) Xeon workstation 3D 2.63 74 
Viola et al. [136] GPU 3D 0.76 257 
Gallegos-Funes and 
Ponomaryov [146] 
DSP 2D 4.5 43 
Jiang and Crookes 
[137] 
FPGA 3D 50 3.9 
Dandekar et al. [144] FPGA 3D 195 - 
 79 
 
images at the rate of 50 fps, which is faster than current acquisition speeds of most 
imaging modalities. The solution presented offers real-time performance, is compact 
and accurate, and, hence, suitable for integration into IGI workflow. As IGI 
applications become increasingly popular, intraprocedural imaging modalities 
continue to offer wider coverage and higher imaging speed. Thus, there is a 
corresponding need for real-time processing of these images. The real-time 
performance of our design along with the throughput of one voxel per cycle can cater 
to these 4D (3D + time) image processing needs. 
 80 
 
Chapter 4:   Hardware-Accelerated Deformable Image 
Registration 
Intensity-based deformable image registration plays a critical role in many 
diagnostic and interventional applications requiring image combination. Despite the 
advantages (such as accuracy, automation, and retrospective nature) of this approach, 
these algorithms find limited use in clinical applications due to their computational 
complexity. This chapter presents a novel FPGA-based architecture for accelerated 
implementation of MI-based deformable registration. This architecture is capable of 
reducing the execution time of MI-based deformable registration from hours to a few 
minutes. First, we describe the registration algorithm that is being accelerated. Next, 
we present the architecture for its high-speed implementation. Finally, we 
characterize the execution performance of this architecture and provide qualitative 
validation results. The optimization of this architecture for accuracy and hardware 
resources is presented in Chapter 5. The quantitative validation and the clinical 
applicability of this architecture are presented later, in Chapter 6. 
4.1.   Motivation 
Combining complementary information from intraprocedural and 
preprocedural images is a fundamental need in IGI applications. These images, 
however, are acquired at different times and using different imaging scanners and 
protocols and as a result are usually misaligned. Therefore, they need to be registered 
(or aligned) for a meaningful combination and fusion of the information they contain. 
Deformable image registration techniques can compensate for both local deformation 
 81 
 
and large-scale tissue motion and are the ideal solution for achieving the 
aforementioned image registration. Some studies, in particular, have independently 
underlined the importance of deformable registration and/or soft tissue modeling for 
IGIs [18, 19]. However, despite their advantages, deformable registration algorithms 
are seldom used in current clinical practice. The large number of degrees of freedom 
that these algorithms employ makes them extremely computationally intensive. On a 
modern workstation most deformable registration algorithms can take several hours, 
which is clearly unacceptable for IGIs requiring on-demand performance. As a result, 
most earlier reported techniques for aligning preprocedural and intraprocedural 
images employ rigid body approximation, which is often not valid because of 
underlying nonrigid tissue deformation. In addition, some of these techniques are not 
retrospective (i.e. they require some advanced planning at the time of preprocedural 
imaging), which further limits their applicability. 
Mutual information (MI)?based deformable registration has been shown to be 
effective in multimodality image registration because of the robustness of the 
similarity measure [69]. Moreover, MI-based image registration is automatic and 
completely retrospective because it uses image intensities to achieve the alignment. 
Walimbe and Shekhar [60, 67] have earlier reported an MI-based deformable 
registration algorithm that utilizes volume subdivision. Hierarchical volume 
subdivision?based image registration techniques are inherently faster than most other 
deformable registration techniques and are more amenable to hardware acceleration. 
This algorithm has been used and rigorously validated in the context of PET-CT 
registration [66]. This clinical validation has demonstrated the registration accuracy 
 82 
 
of the aforementioned algorithm to be comparable to a group of clinical experts and 
the mean registration accuracy for the abdominal region to be superior to an earlier 
reported free-form deformation (FFD)?based technique [147]. This algorithm is 
theoretically general and has been shown to be effective for various applications 
employing multimodal deformable registration [66, 77, 148-151]. Although 
computationally efficient, software implementation of this algorithm can take several 
hours, which is still slow for direct integration into the IGI workflow. It is, therefore, 
necessary to accelerate this algorithm and reduce the processing time to the order of 
minutes and ultimately to seconds for its assimilation into clinical workflow. 
Although, accelerated implementations of MI-based deformable registration 
algorithms using very large multiprocessor clusters have been proposed earlier [57, 
115, 117, 118], their per-node performance does not compare favorably with our 
implementation. Furthermore, these solutions may not be cost effective and are 
unlikely to be suitable for clinical deployment.  
The chapter presents a novel field-programmable gate array (FPGA)?based 
accelerated implementation of the aforementioned deformable registration algorithm, 
specially geared toward improving target delineation during image-guided 
interventions. The reported solution provides a speedup of about 40 for MI 
calculation, thus reducing the deformable registration time from hours to minutes. In 
Chapter 6, we demonstrate fast and successful registration of intraprocedural 
abdominal CT scans with preprocedural CT and PET scans using the developed 
architecture. We further demonstrate that the registration accuracy of the hardware 
implementation is comparable with that using a software implementation and is on 
 83 
 
the order of a few millimeters. This registration accuracy coupled with the execution 
speed and compact implementation of the reported solution makes it suitable for 
integration in the IGI workflow. 
4.2.   Algorithm for Deformable Image Registration 
Hierarchical volume subdivision?based deformable image registration 
techniques are inherently faster than most intensity-based deformable registration 
techniques (e.g., FFD?based techniques) and are more amenable to acceleration 
through hardware implementation. 3D image registration using volume subdivision 
has been proposed earlier, but the earlier implementations were limited to a locally 
translation-based model. Walimbe and Shekhar [60, 67] enhanced this model by 
incorporating local rotations and reported a quaternion-based scheme for interpolating 
multiple 3D rigid-body transformations for deformable registration using the volume 
subdivision approach. For a pair of images, one treated as reference and the other as 
floating, this algorithm performs deformable registration using a series of 
hierarchical, locally rigid-body registrations. The six-parameter rigid registration at 
Figure 4.1: Pictorial representation of hierarchical volume subdivision?based 
deformable image registration and associated notation. 
 84 
 
each level is optimized by maximizing the MI between the reference and floating 
images (RI and FI, respectively). This hierarchical registration scheme is shown in 
Figure 4.1.  
The initial optimal rigid alignment (at the root level) between RI and FI can be 
represented using a transformation matrix 
0
0
T  (where 
i
j
T  represents the cumulative 
optimal transformation at level i  for subvolume j ). Next, the algorithm uses a 
hierarchical octree-based subdivision scheme. At each subdivision level i , the RI is 
divided into 8
i
 subvolumes, numbered from 0 to 8 1
i
? . Each of these subvolumes is 
then individually registered with the FI, under transformation range constraints 
derived from the transformation of its parent subvolume at the earlier level 
1
()
i
parent j
T
?
. 
The notation ()parent j  refers to the subvolume at the previous subdivision level 
1i? , which contains the current subvolume j . For example, at the root level ( 0i = ), 
there is a single subvolume (entire image) numbered 0j = . After one level of 
subdivision ( 1i = ), there will be eight subvolumes numbered from 0j =  to 7j = . At 
level 1i = , (3)parent  refers to subvolume numbered 0  at level 0i = , because it 
contains subvolume 3j =  at the current level ( 1i = ) of subdivision (see Figure 4.1). 
The optimal alignment of the subvolume j  within the FI is also determined by 
maximizing MI under a six-parameter rigid-body transformation model. 
Volume subdivision and subvolume registration continue until the voxel count 
for an individual subvolume remains above a predefined limit (usually 
3
16 ) to yield a 
statistically significant similarity measure. Thus, this algorithm achieves hierarchical 
refinement of the localized matching between RI and FI. The final cumulative non-
 85 
 
rigid alignment between the image pairs is computed by quaternion-based direct 
interpolation of the individual subvolume transformations at the final subdivision 
level. 
4.2.1.   Calculating MI for a Subvolume 
Registration of a subvolume during the hierarchical refinement process is 
based on maximization of the MI, which is a statistical measure. With progressive 
subdivision, subvolumes at every level become increasingly smaller. The mutual 
histogram (MH) corresponding to an individual subvolume becomes sparse, thus 
rendering MI unreliable. The aforementioned algorithm addresses this issue by using 
the MH of the entire image (all the subvolumes) to calculate MI during the 
registration of a subvolume. The contribution of the current subvolume k  at level i  
to the MH is computed under the current candidate transformation 
*i
k
T  (
*
T denotes a 
candidate transformation during the optimization process). The contribution to the 
MH from the rest of the subvolumes remains constant during this registration process 
and is derived from their parent subvolumes. Thus, MI is computed over the entire 
image with local variations corresponding to the subvolume under optimization. 
Equations (4.1)-(4.3) summarize this process. The function ()
jk
T
Accumulate
=
, 
contributes to the MH using the voxels in a given subvolume k , using the mapping 
provided by the given transformation T . The detailed description of this deformable 
registration algorithm can be found in [60]. 
 
kkk
ii i
Total Subvolume Rest
MH MH MH=+ (4.1) 
 
*
()
k
Subvolume j
jk
MHT
Accumulate
=
=  (4.2) 
 86 
 
 
1
,
()
k
ii
Rest parent(j)
jj k
MH Accumulate T
?
??
=  (4.3) 
4.3.   Acceleration Approach 
The aforementioned algorithm uses MI as a measure of image similarity. MI 
is an intensity-based similarity measure and calculation of MI requires processing of 
all the voxels in the RI. Registration through maximization of MI attempts to find the 
transformation that best aligns an FI with an RI. This MI-based registration typically 
requires thousands of iterations (MI evaluations), depending on the image complexity 
and the degree of initial misalignment between the images. Repeated MI 
computation, which requires accessing both the images (RI and FI), is memory access 
intensive, and in particular, the memory access in the FI is completely governed by 
the transformation applied. This operation, therefore, does not benefit from the cache-
based memory architectures present in most modern PCs (the caches are too small to 
fit 3D images). Because memory speed has not evolved at the same rate as 
microprocessor speed, introduction of faster microprocessors is not expected to 
significantly speed up image registration. Thus, a factor limiting the performance of 
software implementations is calculating MI for different candidate transformations. 
Castro-Pareja et al. [139] have shown that, for typical medical images, accumulating 
 
Figure 4.2: Pictorial representation of the acceleration approach. 
 87 
 
the MH and calculation of MI can take up to 99% of the total image registration time 
in software. Our efforts for acceleration of this algorithm, consequently, are targeted 
toward optimized and pipelined implementation of MH accumulation and MI 
calculation. This approach is pictorially represented in Figure 4.2. 
In general, the execution time T required by a pipelined implementation (with 
n-stages) of an operation is given as: 
 (),
N
Tt n
m
=? +  (4.4) 
where t  is the latency of each stage, N is the amount of data to be processed, and 
m is the number of data units processed in parallel at each stage. In the case of MI 
calculation, N represents the number of RI voxels to be processed. For typical 
volumetric images (which are usually larger than 128
3
) it can be assumed that 
N
n
m
null , thus the main parameters that control the MI calculation time are then t  and 
m . Supporting 1m > requires a superscalar architecture and multiple processing 
pipelines with individual image memory access, which is not practical. Therefore, to 
provide maximum pipeline performance, our architecture focuses on reducing t , the 
latency of each pipeline stage. The lower bound on t  is the period of the system clock 
and achieving this bound means that all the pipeline stages (including image memory 
access) complete their operations in one clock cycle. The following section describes 
an architecture geared toward meeting this goal for accelerated calculation of MI. 
 88 
 
4.4.   Architecture 
MI-based image registration can be thought of as an optimization problem of 
finding the best alignment between two images. During the execution of the 
algorithm, the optimization process is executed from the host workstation. This host 
provides a candidate transformation, while the presented FPGA-based solution 
applies it to the images and performs corresponding MI computation. The computed 
MI value is then further used by the host to update the candidate transformation and 
eventually find the optimal deformable alignment between the RI and the FI. This 
workflow is indicated in Figure 4.2. The top-level block diagram of the FPGA-based 
architecture for accelerated implementation of volume subdivision-based image 
registration is shown in Figure 4.3. The important modules in this architecture are 
described below. 
 
Figure 4.3: Top-level block diagram of the architecture for accelerated 
implementation of deformable image registration. 
 89 
 
4.4.1.   Voxel Counter 
Calculation of MI requires processing every voxel in the RI. This can be 
achieved by sequentially cycling though all the voxels in the RI. In addition, because 
the implemented algorithm is based on volume-subdivision, RI voxels within a 3D 
neighborhood corresponding to an individual subvolume must also be processed 
sequentially. This is implemented as follows: the host programs the FPGA-based MI 
calculator with subvolume start and end addresses. In case of the entire image (as 
required for rigid registration), the start address is (0, 0, 0) whereas the end address is 
equal to the dimensions of the RI. When a subvolume needs to be processed (for 
example, during subvolume optimization after image subdivision), the start and end 
addresses programmed by the host correspond to that of the subvolume. The voxel 
counter sequentially computes the addresses corresponding to all voxels within a 
given neighborhood range (a subvolume, for example) in z?y?x order. This is 
implemented using three synchronized counters, one for each dimension. A functional 
diagram of this module is presented in Figure 4.4. Through pipelined implementation, 
this module is capable of generating address for one RI voxel per clock cycle. This 
 
Figure 4.4: Functional block diagram of voxel counter. 
 90 
 
module operates in two modes: in the reference image processing mode the address 
generated is used to fetch the RI image voxels from the external image memory, 
whereas in the floating image processing mode the RI address generated is 
transformed to the floating image space for further processing. 
4.4.2.   Coordinate Transformation 
The initial step in MI calculation involves applying a candidate transformation 
(
*i
j
T ), to each voxel coordinate (
r
v
null
) in a subvolume j  of the RI to find the 
corresponding voxel coordinates in the FI (represented using 
f
v
null
). This is 
mathematically represented as shown in (4.5). Because the algorithm is linear at every 
subvolume, this is implemented using the six-parameter rigid transformation model.  
 
*i
f jr
vTv= ?
null null
 (4.5) 
An interesting aspect of this coordinate transformation and subsequent 
operations is that they involve operations in both real (millimeter) space as well as 
voxel address (image index) space. For example, the transformation (translations and 
rotations) is defined in the real space, whereas the voxels to be fetched from RI and 
FI are identified in voxel address space. Thus, there is a need to convert between 
these address spaces during the calculation of MI. This conversion can be performed 
by utilizing the voxel size information associated with each image. To circumvent the 
need for performing this conversion in our FPGA-based architecture, we 
appropriately scale the transformation matrix (in millimeter space) and represent it 
using a mathematically equivalent matrix in voxel address space. This conversion is 
performed in software by the host just prior to MI calculation. This matrix is then fed 
 91 
 
to the FPGA-based MI calculator and the MI calculator performs all its operations in 
voxel address space. This converted transformation can be represented as: 
 
xxx
x
xx xy xz
xyz
x
cx cx
yyy y
yx yy yz cy cy
xyz y
cz cz
zz z
zx zy zz
xyz z
vr vr vr
T
rrr
vf vf vf
vf
iRI FIp
vr vr vr T
qr r r jRI FI
vf vf vf vf
r
kRI FI
vr vr vr T
rrr
vf vf vf vf
??
??
???
??
??
??
? ????
??
??
?? ?
=? ? ? ? + +??
??
??
?
??
? ???
??
??
???
??
??
i .
?
?
 (4.6) 
In this equation, the tuple (i, j, k) represents the address of a voxel in RI space. This 
address is calculated by the voxel counter. The 
ij
r s represent the components of the 
rotation matrix, whereas 
i
T s represent the components of the translation vector. 
i
vr s 
and 
i
vf s represent the voxel sizes (in millimeters) of the reference image and the 
floating image, respectively. ( , , )
cx cy cz
RIRIRIand ( , , )
cx cy cz
FI FI FI represent the 
centers of the reference image and floating image in voxel address space, 
respectively. 
This transformation model is represented using a 3 ? 3 rotation matrix and a 
 
Figure 4.5: Functional block diagram of coordinate transformation unit. 
 92 
 
3 ? 1 translation vector. The host calculates these transformation components by 
appropriate scaling (as described above) of the current candidate transformation 
provided by the optimization routine and sends it to the MI calculator. Fixed-point 
representation is used to store the individual elements of these transformation 
components. The coordinate transformation is accomplished by simple arithmetic 
computations such as additions and multiplications as illustrated in Figure 4.5. The 
pipelined implementation of this module allows transformation of one RI voxel 
address per clock cycle. 
4.4.3.   Partial Volume Interpolation 
The coordinates mapped in the FI space (
f
v
null
) do not normally coincide with a 
grid point (integer location), thus requiring interpolation. This scenario is illustrated 
in Figure 4.6 (a). In this figure, for simplicity, we have presented a 2-dimensional 
case; but similar scenario presents itself in three dimensions as well. The calculation 
of interpolation weights associated with the neighborhood (again, for a 2-dimensional 
case), in which the mapped coordinate lands, is illustrated in Figure 4.6 (b). In 
comparison, a 3-dimensional case will have 8-interpolation weights. Nearest-
neighbor, linear, and partial volume (PV) interpolation schemes have been 
Figure 4.6: Fundamentals of interpolation schemes. 
 93 
 
traditionally used for this purpose. Mathematical formulation for calculation of the 
interpolation weights using these techniques is presented in Figure 4.6 (c). The 
primary difference between PV and linear interpolation is that, PV interpolation does 
not consolidate the weights and associated intensity values through weighted-
summing. Instead, it preserves the intensity-interpolation weight pairs and updates the 
mutual histogram (MH) at corresponding 8-locations (for a 3D case). PV 
interpolation scheme increases the memory access requirement of the MH 
accumulation operation, but is capable of providing smooth changes in the MH values 
(and associated smoother optimization curve) with small changes in transformation 
[69]. 
Consequently, our architecture implements PV interpolation as the choice of 
interpolation scheme. 
f
v
null
, in general, will have both fractional and integer 
components and will land within an FI neighborhood of size 2 2 2??. The 
interpolation weights required for the PV interpolation are calculated using the 
fractional components of 
f
v
null
. These weights can be calculated as follows:  
 
0
1
2
3
4
5
6
7
(1 ) (1 ) (1 )
(1 ) (1 )
(1 ) (1 )
(1 )
(1 ) (1 )
(1 )
(1 )
Wxfyfzf
Wxfyfzf
Wxfyfzf
Wxfyfzf
Wxf yf zf
Wxf yfzf
Wxfyf zf
Wxfyfzf
= ?????
=? ?? ?
=? ???
=? ??
=?? ??
=?? ?
=???
=??
 (4.7) 
In this equation, the tuple ( , , )xfyfzf correspond to fractional components of 
f
v
null
 in 
x, y, and z dimensions respectively. This interpolation scheme is implemented in this 
architecture by implementing the above formulation using simple arithmetic 
 94 
 
computations such as additions and multiplications as illustrated in Figure 4.7.  The 
pipelined implementation of these operations allows processing of one transformed 
RI voxel per clock cycle. Fixed-point arithmetic is used to perform these operations. 
The corresponding floating voxel intensities are fetched by the image controller in 
parallel using the integer component of 
f
v
null
. The image controller also fetches the 
voxel intensity corresponding to
r
v
null
. The MH, then must to be updated for each pair of 
reference and floating voxel intensities (8 in all), using the corresponding weights 
computed by the PV interpolator. 
4.4.4.   Image Memory Access 
The dimensions of typical 3D medical images are in the range of 128 ? 128 ? 
128 to 512 ? 512 ? 512 voxels. Because intensity-based image registration typically 
 
Figure 4.7: Functional block diagram of partial volume interpolation unit. 
 95 
 
operates on images with 8-bit intensity values per voxel, this corresponds to image 
sizes ranging from 2 to 128 Megabytes. In comparison, the internal, high-speed 
memory provided by modern FPGAs is limited to 16 Megabits (or equivalent to 2 
Megabytes). Consequently, the images to be registered can not be stored using the 
memory that is internal to the FPGA and external storage must be employed for this 
purpose.   
Because of the size of the medical images, use of static random access 
memory (SRAM) modules (which are fast, but offer poor density and capacity) is not 
very efficient. Dynamic random access memory (DRAM) modules, in comparison, 
offer the necessary density to store the large images that are typically encountered in 
medical imaging. Although, these memories do not allow random accesses 
throughout the entire memory module with a uniform latency, they support burst-
mode accesses that allow efficient random accesses within a single row of the 
memory module. Our architecture takes advantage of this feature and uses external 
DRAM memories to store the images to be registered (in other words, the RI and FI). 
 
Figure 4.8: Voxel access patterns of the reference and floating images encountered 
during image registration. 
 96 
 
During calculation of MI for a given voxel location in the RI, the reference image 
memory must be accessed only once, whereas a complete 2 ? 2 ? 2 neighborhood (as 
required for PV interpolation) must be fetched from the floating image memory. 
Thus, there is a disparity between the memory access needs for these two images. The 
following subsections briefly describe the access requirements for both these images 
and present the strategies we adopted to meet those requirements. 
4.4.4.1.   Reference Image 
An example of memory access pattern for the RI during image registration is 
illustrated in Figure 4.8. Between the two images, the RI has more relaxed access 
requirements, because it is accessed in a sequential manner (in z?y?x order) and only 
one voxel must be fetched from the RI for a given RI voxel address. This kind of 
access benefits from burst accesses and memory caching techniques. Furthermore, the 
implemented algorithm is based on volume-subdivision and requires traversing 
through the RI on a subvolume basis. This means that the voxels that belong to a 
subvolume need to be consecutively accessed. 
An interesting feature of modern DRAMs is that they are logically partitioned 
into rows and columns. As a result, these memories incur additional latency of several 
clock cycles when switching between different rows of the memory. Accessing 
different columns within a given row, however, is efficiently supported through burst 
accesses and can be performed within one clock cycle. Our architecture takes 
advantage of this feature and stores the RI in such a way that each subvolume is 
aligned with a row of the DRAM memory. This memory organization is illustrated in 
Figure 4.9. This allows efficient access to the voxels that belong to a subvolume. 
 97 
 
For the architecture presented, both the RI and FI are stored in separate logical 
partitions of the same DRAM module. However, during MI calculation both these 
memories must be accessed simultaneously over a single bus. This can lead to 
memory access conflicts and result into poor performance. To address this issue, our 
architecture uses internal memory of the FPGA to cache a sub-block of RI voxels. 
Thus, during the processing of that block of RI voxels, the image controller has 
parallel access to both RI and FI voxels. The RI voxels are fetched from the internal 
FPGA memory (with access time of 1 clock cycle), whereas the FI voxels are fetched 
directly from the external memory.  
4.4.4.2.   Floating Image 
An example of memory access pattern for the FI during image registration is 
illustrated in Figure 4.8. The FI has much higher memory access requirements when 
compared to the RI. There are two primary reasons for this:  
 
Figure 4.9: Organization of the reference image memory. 
 98 
 
1. For every RI voxel address to be processed, 8 FI voxels (corresponding to a 
222?? neighborhood) must be fetched. This neighborhood is then further used 
for performing PV interpolation and MH accumulation. 
2. Although, the RI is accessed in a sequential manner, the corresponding access 
pattern in the FI is completely governed by the transformation (
*i
j
T ) that is 
currently being applied. For example, in Figure 4.8 the access pattern in FI is not 
aligned with the natural dimensions (x, y, or z) of the image.  
The first aspect of this memory access requirement is similar to that 
encountered in the context of volume rendering. In the graphics domain, this problem 
has been solved by developing techniques that allow parallel access to the entire 
222?? voxel neighborhood. One way to provide this parallel neighborhood access 
is through the use of cubic addressing [140]. Cubic addressing employs eight memory 
modules that are accessed in parallel at different addresses, through arithmetic 
manipulation of the corresponding single ?neighborhood address?. These parallel 
memory modules are thus capable of providing the entire 3D neighborhood in one 
 
Figure 4.10: Organization of the floating image memory. 
 99 
 
clock cycle. This approach, however, requires custom addressing of the memory 
modules. Castro-Pareja et al. [139] have reported a method to implement cubic 
addressing using standard DRAMs, by storing multiple copies of the image. This 
approach does not require special addressing and take advantage of burst-mode 
accesses provided by DRAMs. The presented architecture implements a similar 
scheme by storing eight copies of the FI. This scheme is illustrated in Figure 4.10. 
The FI voxels are arranged sequentially such that, performing a size 2 burst fetches 
two adjacent 2 2?  neighborhood planes, thus making the entire neighborhood 
available simultaneously. Although this approach increases the image storage, it 
enables accessing the entire 3D neighborhood in one clock cycle. Further, the density 
of DRAM chips allows storing multiple copies of the FI. 
The second aspect of the FI memory access requirement is addressed through 
the following strategy: we group the neighborhoods that belong to an image block 
(within the FI), and assign those neighborhoods to a single row in the DRAM module. 
Thus, as long as the transformed address location (
f
v
null
) stays within a FI subvolume, 
the corresponding neighborhoods can be fetched within one clock cycle. Because, the 
RI is traversed one subvolume at a time, it is highly likely that the transformed 
location of that subvolume is confined to one or few FI subvolumes. Our architecture 
takes advantage of this spatial locality of reference. 
4.4.5.   Updating Mutual Histogram 
For a given RI voxel RV , there are eight intensity pairs (
07
,:RVFV FV) and 
corresponding interpolation weights. Because the MH must be updated (read?
 100 
 
modify?write) at these eight locations, this amounts to 16 accesses to MH memory 
for each RI voxel. This means that MH memory has more stringent access 
requirements when compared with the floating image memory. Moreover, the 
locations that need to be accessed within the MH for a given RI voxel, may not have 
any relationship with each other (such as a neighborhood-relation in the context of FI 
memory) and can be completely random. Fortunately, the size of the MH memory is 
typically much smaller (less than 128 128? ) than that of the images ( 256 256 256??). 
As a result, the internal (on-chip), high-speed memory can be used for storing the 
MH. 
Our first strategy to meet this high memory access requirement is to use high-
speed, dual-ported memory internal to the FPGA to store the MH. Because this 
memory is dual-ported, it can be read and written simultaneously (using different 
ports), an operation that is integral to MH accumulation. The overview of pipelined 
MH accumulation design, which takes advantage of this feature, is presented in 
Figure 4.11. Furthermore, because this memory can work at high speeds, the MH 
accumulation pipeline can work at twice the clock rate of the voxel processing 
 
Figure 4.11: Pipelined implementation of MH accumulation using dual port 
memory. 
 101 
 
pipeline. Thus, a pipelined implementation of MH accumulation can process (read?
modify?write) two voxels per every clock of the main pipeline. Our second strategy is 
multiple parallel MH accumulation pipelines to support the high-memory access 
requirement. We employ four copies of the MH and associated parallel pipelines that 
independently and concurrently update the partial MHs. The partial MHs, stored in 
each MH copy, are eventually combined when the MH is read during the entropy 
calculation. Similar strategy is used for storing the individual histogram for the FI. 
These two strategies, when combined together, provide sufficient memory access 
speed to meet the requirement of MH accumulation.  
While the MH is being computed, the individual histogram accumulator unit 
computes the histograms for the RI and the FI. These individual histograms are also 
stored using internal, dual-ported memories. The valid voxel counter module keeps 
track of the number of valid voxels accumulated in the MH and calculates its 
reciprocal value. The resulting value is then used by the entropy calculation unit for 
calculating the individual and joint probabilities. 
4.4.5.1.   Data Hazards and Preaccumulation Buffers 
The presented architecture implements the operation of updating the MH in a 
pipelined fashion.  This means that for each MH accumulation pipeline (there are four 
such pipelines) one interpolation weight must be accumulated per clock cycle. The 
MH address, which must be updated with this interpolation weight, is determined by 
the RI-FI intensity pair. Furthermore, the realization of this operation using dual-
ported memories (as described above) has a latency of three clock cycles (one clock 
cycle each for read, modify, and write operations). As a result, read-after-write 
 102 
 
(RAW) hazards could arise if sequential MH update transactions that attempt to 
access identical locations (or in other words, transactions with same RI-FI intensity 
combination) are separated by fewer than three clock cycles.  
To detect and eliminate these RAW hazards, the presented architecture 
employs a preaccumulate buffer in each pipeline. The functional block diagram of 
this buffer is shown in Figure 4.12. This buffer compares the address of the current 
transaction with that of three (equal to the latency of the operation) prior transactions. 
If conflicting transactions are detected, the buffer invalidates the current transaction 
and adds the interpolation weight of the current transaction to that of the first 
conflicting transaction. In summary, this is equivalent to aggregation of the weights 
from all the conflicting transactions and all the transactions that constitute a RAW 
hazard are converted into a single update to the MH. This scheme entirely eliminates 
any possible RAW hazards.  
 
Figure 4.12: Preaccumulate buffers to eliminate RAW hazards in MH accumulation 
pipeline. 
 103 
 
4.4.5.2.   Calculating 
Rest
MH  
During the optimization process of finding the best alignment for a given 
subvolume k  at level i , 
k
Total
MH  is computed as shown in equation (4.1). For this 
calculation, it is necessary to compute the contribution of the remaining subvolumes 
to 
k
Total
MH  using the registration information at the previous level of subdivision 
(
1
()
,
i
parent j
Tjk
?
?? ). This process must be repeated for every subvolume at the current 
level i , and the contents of 
k
Rest
MH  will be different every time depending on the 
subvolume under consideration. Computing 
k
Rest
MH , from scratch, for every iteration 
of each subvolume will not be efficient. To avoid this repeated calculation, we 
introduce MH buffers (
Prior
MH ,
Rest
MH , and 
Local
MH ), which store the previous-level 
MH and partial MH during various stages of the algorithm. A flow diagram depicting 
the interplay between these MH buffers during various steps of calculating 
Rest
MH is 
shown in Figure 4.13, and the detailed description is provided here. 
 
Figure 4.13: A flow diagram of steps involved in calculating MH
Rest
. 
 104 
 
At a given level i , 
Prior
MH contains the MH for the entire image, which is 
computed using all the subvolumes at the earlier level 1i?  and corresponding 
transformations 
1i
j
T
?
. At the beginning of the registration of each subvolume k  at the 
current level i , 
Local
MH is cleared (thus, all its entries are set to 0). Next, the 
transformation of its parent from the previous level, 
1
()
i
parent k
T
?
, is applied to the current 
subvolume k and the resultant MH is accumulated in 
Local
MH . 
Local
MH  now contains 
the contribution of the subvolume k  to 
Prior
MH . 
Local
MH  is then subtracted from 
Prior
MH , and the resultant MH is stored in the buffer 
Rest
MH . This step is 
mathematically equivalent to computing 
k
Rest
MH (as in equation (4.3)) for the 
subvolume k . For every subsequent optimization iteration involving this subvolume, 
Local
MH  is initially cleared. The candidate transformation 
*i
k
T  is then applied to the 
current subvolume ( k ) only, and the resultant histogram (contribution of this 
subvolume) is accumulated in 
Local
MH . It is then added to 
k
Rest
MH , to form the MH 
for the entire image (
k
Total
MH in equation (4.1)) for the current optimization step. This 
final histogram is then further used for computing the image similarity measure (MI) 
corresponding to the current transformation 
*i
k
T . This process is repeated for all the 
subvolumes at the current level. At the end of each level (after optimizing all the 
subvolumes at that level), the MH for the entire image is computed using the updated 
transformations ,
i
k
Tk?  and is stored in
Prior
MH , which will then subsequently be used 
at the next level of subdivision, 1i+ . 
 105 
 
4.4.6.   Entropy Calculation 
Acceleration of MH accumulation contributes to the most of the performance 
improvement during calculation of MI. However, the final step in MI calculation, 
which is entropy calculation, must also be implemented in the hardware to eliminate 
the overhead of copy the MH back to the host.  The mutual and individual entropies 
are computed by using the probability distributions represented by the mutual and 
individual histograms, respectively.  
To calculate entropy, it is necessary to evaluate the function () ln()f pp p=?  
for all the probabilities. As probability p  takes values between [0,1] , the 
corresponding range for the function ln( )p is [,0]?? . In addition, ln( )p  is undefined 
for 0p = . In comparison, the corresponding range for ( )f p  is 
1
[,0]e
?
? . Thus, ( )f p  
has a finite dynamic range and is defined for all values of p . Several methods for 
calculating logarithmic functions in hardware have been reported earlier [152-154], 
but of particular interest is the multiple lookup table (LUT)?based approach 
introduced by Castro-Pareja et al. [155]. This approach minimizes the error in 
representing ( )f p  for a given number and size of LUTs and, hence, is accurate and 
efficient. Moreover, this approach preserves the shape of the MI curve and the 
location of the extrema and thus does not significantly affect the outcome of the 
optimization process. Following this approach, the presented design implements 
()f p  using multiple LUT?based piecewise polynomial approximation. We provide a 
brief overview of this method here and identify the multiple LUT configuration 
adopted by our architecture. 
 106 
 
4.4.6.1.   Multiple LUT?Based Approach for Entropy Calculation 
A LUT-based approach represents the target function (such as ( )f p ) using a 
piecewise-polynomial approximation with n  segments. In particular, approximation 
using Chebyshev polynomials is employed because: 1) the Chebyshev approximation 
is simple to calculate for continuous functions such as ( )f p  and 2) it is very close to 
the true minimax approximation, the most accurate polynomial approximation. The 
polynomials used can be of arbitrarily high order. Higher degree of polynomials 
typically offer better approximation, but require higher degree of arithmetic 
computation (and associated hardware resources) for function evaluation. For this 
reason, only 1
st
 order polynomials are considered.  
In case of a single-LUT based approach, all the segments are stored in a single 
large LUT. Although this approach is simple and allows easy decoding and 
addressing during function evaluation, it suffers from several drawbacks. First, the 
precision of the LUT ( p? ) is constant throughout the entire table. As a result, 
segments of the approximated function that require finer precision can not be 
accurately represented. Second, the size of the LUT to achieve a desired 
approximation error is usually quite large and sometimes not practical. For example, 
Table 4.1: Configurations of LUT-based entropy calculation module that were 
considered in the presented architecture. 
LUT Range Configuration #1 Configuration #2 
LUT No. 
min
p  
max
p  p?  
Entries 
p?  
Entries 
1 0 2
-13
 2
-23
 1024 2
-24
 2048 
2 2
-13
 2
-6 -16
 1024 2
-17
 2048 
3 2
-6 -2
 2
-12
 1024 2
-13
 2048 
4 2
-2
 1 2
-10
 1024 2
-11
 2048 
 
 107 
 
to keep the error in entropy calculation less than 10
-6
, a single LUT with 16K entries 
will be required. 
The multiple LUT?based approach introduced by Castro-Pareja et al. [155] 
attempts to address these limitations. In this approach, each segment is stored in a 
different LUT. In addition, each LUT has a different precision ( p? ) and uses a 
different polynomial coefficients to represent the individual segments of the function 
()f p . Thus, this approach allows superior approximation of the target function and 
thereby keeping the approximation error below the allowable maximum. 
Furthermore, this method provides a realization that is optimized for approximation 
accuracy and hardware resources [155]. 
Using this approach the 1
st
-order polynomial, 4-LUT configuration was 
designed. We realized two such configurations each with a different LUT precision 
( p? ). These configurations are listed in Table 4.1. The choice of these configurations 
 
Figure 4.14: Error in entropy calculation corresponding to the two configurations of 
the multiple LUT?based implementation. 
 108 
 
was based on meeting the error constraints on the entropy calculation and the 
availability of resources in the target FPGA device. We then evaluated the 
approximation error offered by these two realizations. This was done by comparing 
the entropy values calculated by using these designs against that calculated using a 
double-precision software implementation.  Figure 4.14 shows a plot of the error 
magnitude vs. p  for both these realizations. The maximum error was of the order of 
10
-8
 for a both these configurations, which is sufficient for MI calculation in the 
context of image registration [155]. For our architecture we selected realization with 
8K entries, as it offers superior error performance at a relatively moderate increase in 
the LUT size. As described earlier, using the lower-order polynomials has the 
advantage of reducing the LUT data width and the number of required arithmetic 
operations for function evaluation. 
4.4.7.   Operational Workflow 
The sections above described the design and functional behavior of the 
important modules in the architecture we developed for accelerated calculation of MI. 
Our ultimate goal, however, is to accelerate deformable image registration algorithm 
described earlier. This algorithm utilizes this high-speed implementation to efficiently 
calculate the image similarity measure (MI), the most computationally complex step 
within the algorithm. This deformable registration algorithm is based on volume 
subdivision, and depending on the operational stage of the algorithm, image similarity 
is calculated in a slightly different manner. For example, steps to calculate MI during 
rigid registration are slightly different than those taken during optimization of a 
subvolume after volume subdivision. This section identifies different operational 
 109 
 
stages of the algorithm and describes how the architecture described above is utilized 
to calculate MI during those stages. The summary of this description is provided in 
Table 4.2. 
4.4.7.1.   Rigid Registration 
Rigid registration is the first step in the aforementioned deformable 
registration algorithm. This step attempts to recover any gross misalignment between 
the RI and FI. During this step, the algorithm repeatedly calculates MI for various 
candidate transformations between RI and FI. To implement this operation using the 
described architecture, we first clear all the mutual histograms internal to the 
architecture (
Prior
MH ,
Rest
MH , and 
Local
MH ). Next, the host provides the current 
candidate transformation along with the subvolume address range (equal to the RI 
dimensions, since MI is calculated for the entire image) to the architecture. After this 
operation, the MH corresponding to the given transformation is calculated and stored 
Table 4.2: Operational workflow for performing volume subdivision?based 
deformable image registration using the presented architecture. 
Operation Workflow 
Rigid Registration 
Clear All MHs ? Accumulate (entire RI) ? Calculate 
Entropy using MH
Local
 
Calculating MH
Prior 
(First subvolume) 
Clear All MHs ? Accumulate (current subvolume) 
Calculating MH
Prior 
(Other subvolumes) 
Accumulate (current subvolume) 
Calculating MH
Prior 
(Copying MH) 
Copy MH
Local 
to MH
Prior
 
Calculating MH
Rest 
Clear MH
Local 
? Accumulate (current subvolume) ?  
Subtract MH
Local 
from MH
Prior
 and store into  MH
Rest
 
Subvolume 
Registration 
Clear  MH
Local
  ? Accumulate ( current subvolume ) ? 
Calculate Entropy using MH
Local 
+ MH
Rest
 
 110 
 
in 
Local
MH . Later on, the individual and joint entropies are calculated using 
Local
MH .  
4.4.7.2.   Calculating 
Prior
MH  
Calculating 
Prior
MH requires applying transformations to all the subvolumes at 
a given level of subdivision and calculating the cumulative MH corresponding to all 
subvolumes with their associated transformations. To implement this operation using 
the described architecture, we first clear 
Local
MH , which is used for subvolume MH 
accumulation. Next, the host provides the address range for the first subvolume and 
the transformation associated with that subvolume from the previous level.  The MH 
corresponding to the given transformation is calculated and stored in 
Local
MH . This 
step is repeated for all the subvolumes at a given level of subdivision, with a 
difference that 
Local
MH  is cleared only during the first subvolume. This allows the 
calculation of the cumulative MH for all the subvolumes. After all the subvolumes are 
processed, 
Local
MH  is copied to 
Prior
MH (an internal MH buffer).  No entropy 
calculation is required during this step. This step is repeated after each level of image 
subdivision. 
4.4.7.3.   Calculating 
Rest
MH  
Calculation of 
Rest
MH is a prerequisite for optimizing subvolumes after 
volume subdivision. As described earlier, we utilize 
Prior
MH to efficiently calculate 
Rest
MH for a given subvolume. To implement this operation using our architecture, 
we first clear 
Local
MH , which is used for subvolume MH accumulation. Next, the host 
 111 
 
provides the address range for the current subvolume and the transformation 
associated with that subvolume from the previous level.  The MH corresponding to 
the given transformation is calculated and stored in 
Local
MH . Following this operation, 
we subtract  
Local
MH  from 
Prior
MH  and store the resulting MH into 
Rest
MH (another 
internal MH buffer). This step is repeated prior to registration of every subvolume at 
a given level of subdivision. 
4.4.7.4.   Subvolume Registration 
After image subdivision, the resulting subvolumes are individually registered 
with the FI. However, the image registration algorithm also considers the contribution 
of the rest of the image for improved statistical reliability.  Our architecture supports 
this operation by taking advantage of the 
Rest
MH computed in the previous step. First, 
we clear 
Local
MH , which is used for subvolume MH accumulation. Next, the host 
provides the address range for the current subvolume and the current candidate 
transformation for that subvolume. The MH corresponding to the given candidate 
transformation is calculated and stored in 
Local
MH . Later on, the individual and joint 
entropies are calculated using ( )
Local Rest
MH MH+ , thus taking into account the 
contribution  of the rest of the image. This step is repeated for each iteration during 
the registration of a subvolume. 
4.5.   Implementation and Results 
The architecture presented above was implemented using an Altera Stratix II 
EP2S180F1508C4 FPGA (Altera Corp., San Jose, CA) in a PCI prototyping board 
 112 
 
(DN7000K10PCI) manufactured by the Dini Group (La Jolla, CA). The board 
featured 1 GB double-data-rate (DDR2) DRAM in a small-outline dual-inline 
memory module (SoDIMM) external to the FPGA. This memory was used to store 
the RI and the FI. The board provided a 64-bit bus interface between the memory and 
the FPGA running at 200 MHz clock speed. The architecture was designed using 
VHSIC hardware description language (VHDL) and synthesized using Altera 
Quartus II 6.1. The memory controller was also implemented using VHDL and was 
built around the DDR2 DRAM controller megacore supplied with Altera Quartus II. 
All the modules in the architecture were custom designed using VHDL, per the 
design description provided earlier. Functional verification and postsynthesis timing 
simulation for the entire system were performed using Modelsim SE 6.2 (Mentor 
Graphics, San Jose, CA). For this purpose, DDR2 DRAM simulation models 
provided by Micron (www.micron.com) were used. The resource availability of the 
target FPGA, primarily internal memory, which is used for MH accumulation, limited 
the synthesis of the reported design to support 7-bit mutual histogram and, 
consequently, all results in this section are presented for this configuration. In 
comparison, the software implementation uses 8-bit mutual histogram. In spite of this 
difference, we demonstrate that the registration accuracy for both these 
implementations is comparable (see Chapter 6), which further confirms the findings 
of Studholme  et al. [64]. 
To verify the functional correctness and evaluate the performance of this 
implementation we considered image registration between intraprocedural 
noncontrast CT (iCT) with preprocedural contrast-enhanced CT (preCT) and positron 
 113 
 
emission tomography (PET) images. The registration between these image-pair 
combinations is clinically relevant and routinely encountered in the context of CT-
guided interventions. We considered five abdominal iCT-preCT and five abdominal 
iCT-PET image pairs for this evaluation. The image size for iCT and preCT was 256 
? 256 ? 200?350 with a voxel size of 1.4?1.7 mm ? 1.4?1.7 mm ? 1.5 mm, whereas 
the typical image size for PET was 128 ? 128 ? 154?202 voxels with a voxel size of 
5.15 mm ? 5.15 mm ? 5.15 mm. The iCT, preCT, and PET images were converted to 
8 bits and 7 bits, respectively, for software and FPGA-based implementations. This 
conversion was performed using adaptive reduction in intensity levels [156]. The 
converted iCT and preCT images were then preprocessed using 3D anisotropic 
diffusion filtering. No preprocessing steps were used for the PET images. The real-
time implementation of anisotropic diffusion filtering that we presented earlier can 
facilitate this preprocessing without adding any significant latency to IGI workflow 
[144]. For the hardware implementation, the RI was stored in a sequential format 
organized by subvolumes, while eight interleaving copies of the FI were arranged in 
memory to facilitate cubic addressing. This arrangement allows simultaneous access 
to an entire 3D neighborhood within the FI, as described earlier. The execution speed 
of the reported architecture was obtained from postsynthesis timing measurements 
using the entire system. 
The design achieved a maximum internal frequency of 200 MHz, with a 
theoretical maximum RI voxel processing rate of 100 MHz. The coordinate 
transformation, PV interpolation, and MH accumulation operations were 
implemented using fixed-point representation. Entropy calculation was implemented 
 114 
 
using the 4-LUT, first-order polynomial configuration as described earlier. The 
precision employed for this fixed-point datapath can affect both the implementation 
accuracy and the hardware resource requirement of this architecture. An optimization 
framework to systematically explore this effect is presented in the following chapter. 
This framework is capable of identifying the optimized tradeoff relationship between 
the hardware resources and the accuracy. The optimization of our architecture by 
using this framework is presented in Chapter 5. We further evaluated the accuracy of 
deformable image registration offered by that optimized architectural configuration 
and those results are presented in Chapter 6. In this section, however, we focus on the 
execution performance of our architecture and demonstrate its functional correctness 
through qualitative validation of image registration. For this purpose we used a 
designer identified architectural configuration which was not optimized for area-
implementation error tradeoff. Consequently, all the results presented in this section 
are corresponding to that design configuration. 
4.5.1.   Execution Speed 
The presented architecture is targeted toward accelerating the calculation of 
MI for a hierarchical volume subdivision?based deformable registration algorithm. 
During the execution of this algorithm, MI must be repeatedly calculated under a 
candidate transformation for every subvolume at every level of subdivision. 
Moreover, as described earlier, 
Rest
MH must be calculated once for every subvolume. 
To analyze the speedup offered by the presented FPGA-based solution for calculation 
of MI, we compare its calculation time with that of a software (C++) implementation 
running on an Intel Xeon 3.6 GHz workstation with 2 GB of RAM. Table 4.3 details 
 115 
 
this performance comparison for a iCT-preCT image pair with dimensions of 256 ? 
256 ? 256. The last column of the table shows the speedup offered by the reported 
solution over the software implementation for a given level of subdivision. The time 
to calculate MI primarily depends on the size of the subvolume and is independent of 
the imaging modality and voxel size. This calculation time, however, may vary 
slightly based on the actual value of the transformation used to calculate MI. These 
variations are caused by differences in access patterns to the FI memory under 
different transformation values. To compensate for this effect and report average MI 
calculation time at a given level of subdivision, we measured the MI calculation times 
for a subvolume using 100 randomly generated transformations (within the range of 
?30 voxels for translations and ?20? for rotations).  
The same set of transformations was used for both software and hardware 
implementations. The MI calculation time reported in Table 4.3 is averaged over all 
the transformations. Hardware timings reported in Table 4.3 also include the 
communication time, required for writing the transformation matrix and reading back 
Table 4.3: Comparison of mutual information calculation time for subvolumes at 
various levels in volume subdivision?based deformable registration algorithm. 
Mutual information  
calculation time (ms) Subdivision 
level 
Subvolume 
size 
Software 
implementation 
FPGA-based 
implementation 
Speedup 
0
a
 256 ? 256 ? 256 9410 225.42 41.74 
1 128 ? 128 ? 128 1209 30.19 40.05
 
2 64 ? 64 ? 64 166 4.16 39.90 
3 32 ? 32 ? 32 18 0.78 23.08 
4 16 ? 16 ? 16 10 0.46 21.74 
a
This corresponds to rigid registration between the reference and floating images. 
 116 
 
the calculated entropy values, between the host and the MI calculator. Furthermore, 
consistent with the scenario during the execution of the registration algorithm, the 
time to compute 
Rest
MH (once per subvolume) is also included in the hardware and 
software MI calculation time. Our acrchitecture offers a speedup of about 40 for 
calculating MI up to subvolumes of size 64
3
; whereas for smaller subvolumes the 
speedup achieved is around 20. This drop in achieved speedup is explained by taking 
into account overheads incurred during computation of MI. Calculating MI requires 
accumulation of MH, which in turn, requires initial clearing of the MH memory. 
Because the current implementation employs an MH with size 128 128?  (to support 
7-bit images), the process of clearing MH, which involves writing ?0?s to all MH 
entries, can consume more than 16,000 clock cycles. For smaller subvolumes, the 
time required to clear MH memory becomes comparable to or larger than that 
required to process a subvolume (images are processed at the rate of approximately 
one voxel per clock cycle). In addition, the communication time between the host and 
the MI calculator, required for exchanging the transformation matrix and the 
calculated MI value, becomes comparable to the computation time. These two factors 
limit the net speedup achieved for smaller subvolumes. 
Table 4.4 compares the total execution time for deformable registration using 
a software (C++) implementation running on an Intel Xeon 3.6 GHz workstation with 
Table 4.4: Execution time of deformable image registration. 
Execution time (s) Image pair 
used for 
registration 
Software 
implementation
FPGA-based 
implementation 
Speedup 
iCT-preCT 11520 371 31.05 
iCT-PET 11146 339 32.88 
 117 
 
2 GB of RAM against the presented FPGA-based architecutre. This execution time 
was measured for deformable registration between five iCT-preCT and five iCT-PET 
image pairs, described earlier. The maximum number of global and local iterations 
was set to 200 and 100 respectively. The same optimization algorithm was used for 
both the software and hardware implementations, and volume subdivision continued 
until the subvolume size was larger than 16
3
. For each image modality pair, the 
execution time reported is the average of execution times of the five cases. The 
execution time of intensity-based image registration is directly proportional to the 
size of the RI. iCT image was used as the RI for both the image modality pairs and, 
hence, the execution time for deformable registration is similar for these two image 
modalities, despite the fact that PET images are smaller than preCT images (128 ? 
128 ? 154?202 and 256 ? 256 ? 200?350, respectively). For both image modality 
pairs, our architecture provided a speedup of about 30 over an equivalent software 
implementation and achieved an execution time of around 6 minutes. This speedup is 
a direct outcome of acceleration of MI calculation using the presented architecture. 
4.5.2.   Performance Comparison 
Intensity-based image registration has been identified to be a computationally 
intensive problem. Given the impact the acceleration of image registration can have 
on a variety of diagnostic and interventional applications, there have been significant 
efforts towards realizing high-speed implementations of image registration 
algorithms. Further, given that this problem is extremely compute-extensive and 
multi-faceted; researchers have applied a range of computing platforms and 
acceleration strategies at different levels of this problem to improve its execution 
 118 
 
performance. In Chapter 2, we provided a survey of some acceleration approaches 
that are somewhat related to the work presented in this dissertation. In this section we 
compare the performance improvement provided by the FPGA-based implementation 
we developed in this dissertation work against that provided by the earlier reported 
solutions.  
For this comparison we focused on acceleration approaches involving 
intensity-based image registration, since these approaches can be retrospective, 
automatic and have the potential to be integrated into the clinical workflow. The 
algorithms considered included FFD-based, and gradient flow-based approaches as 
well as those based on the demon?s algorithm. Moreover, these implementations used 
different similarity measures (MI, NMI, SSD etc.), transformation models (spline-
based, rigid, volume-subdivision based) and acceleration strategies as dictated by the 
computational platform (GPUs, clusters, FPGAs etc.) used for their implementation. 
Thus, there is a substantial variation in the implementation details (both, for original 
and accelerated implementation) of each of these realizations, even though they try to 
address the same fundamental problem of accelerating intensity-based registration. To 
account for these variations and provide a fair comparison, we consider the net 
speedup offered by these accelerated implementations against their corresponding un-
accelerated versions. This comparison can be used to judge the efficiency of various 
strategies to accelerate intensity-based image registration. To compare the raw 
computational performance these realizations, we normalize their corresponding 
execution times by the dimensions of reference image used for registration and 
calculate the effective voxel processing rate (throughput) offered. In some cases, 
 119 
 
where sufficient implementation details (such as number of iterations, number of 
voxels processed etc.) were not provided the voxel processing rate could not be 
calculated. 
Table 4.5 presents the results of this analysis and compares the performance 
of the presented architecture for high-speed implementation of deformable image 
registration, against a corresponding software implementation and previously 
reported acceleration approaches. The software implementation refers to the 
implementation of the volume subdivision?based algorithm. This implementation was 
developed using C++, and its performance was measured on an Intel Xeon 3.6 GHz 
workstation with 2 gigabytes of DDR2 400 MHz main memory. The presented 
implementation of FPGA-based MI calculation provides more than an order of 
magnitude speedup and superior voxel processing rate when compared to this 
software implementation. 
The majority of earlier reported attempts to accelerate intensity-based image 
registration have primarily employed a multiprocessor or supercomputer approach 
(see [57, 116, 117] , for example). Although these solutions delivered high 
performance and speedup by virtue of parallelization and high-speed network 
interconnects, the speedup achieved per processor was less than unity. Also, in some 
cases [116] the voxel processing rate achieved was substantially less than that offered 
by the presented implementation. In addition, these solutions may not be cost 
effective and, because of their size, are unlikely to be suitable for clinical deployment. 
Some solutions based on graphics processors [120, 122] attempt to provide 
performance improvement by exploiting data parallelism native to this platform. 
 120 
 
Although this platform yielded reasonable speedups for rigid registration, the 
performance gain and voxel throughput for deformable registration and in particular, 
for that based on MI, were poor. Ohara et al. [123] attempted to accelerate MI-based 
rigid registration using a pair of cell processors, by exploiting the inherent parallelism 
offered by multiple cores (8 per processor) and high memory bandwidth. Using these 
16 computing cores, this implementation achieved speedup of only 11 for rigid 
registration. The voxel throughput of this implementation (6 MHz), however, was 
slightly better when compared with other multiprocessor implementations. This can 
be attributed to the high memory bandwidth and additional designer optimizations to 
take advantage of spatial locality. Jiang et al. [124] employed a high-level description 
language (Handel-C) and mapped B-spline computation (a common component in 
FFD-based image registration) to Xilinx FPGA devices. This implementation 
leveraged elimination of nested loops and pipelined implementation to achieve 
performance improvement. Despite the semi-automated nature of this approach and 
high clock rate of the synthesized design, the speedup and the voxel throughput 
achieved were limited to 3.2 and 4 MHz respectively. 
In comparison with these acceleration approaches, the FPGA-based 
implementation developed in this dissertation addresses the fundamental memory 
access bottleneck in intensity-based image registration. Furthermore, voxel-level 
parallelism is achieved through pipelined implementation and this implementation 
can offer voxel processing rate close to one voxel per clock cycle. Thus, this solution 
is not only compact, but offers high speedup (around 30) and superior voxel 
throughput (> 70 MHz) using a single processing element.  
 121 
 
 
 
 
Table 4.5: Performance comparison of the presented FPGA-based 
implementation of intensity-based deformable image registration with an equivalent 
software implementation and prior approaches for acceleration of intensity-based 
registration. 
Implementation Algorithm Platform Speedup 
Voxel 
processing 
rate (MHz) 
(if available)
Stefanescu  
et al. [116] 
Demon?s 
algorithm 
15-processor 
cluster 
11 1.8 
Rohlfing  
et al. [57] 
FFD-based 
deformable,  
using MI 
64-processor 
shared-memory 
supercomputer 
40 ? 
Ino  
et al. [117] 
FFD-based 
deformable, using 
MI 
128-processor 
cluster, Myrinet 
connectivity 
90 ? 
Kohn  
et al. [120] 
Gradient flow, 
rigid/deformable 
GPU (6800) 
(using GLSL) 
12 2.15 
Vetter  
et al. [122] 
MI-based 
deformable 
GPU (7800) 
(using GLSL) 
6 3.21 
Ohara  
et al. [123] 
MI-based  
rigid 
2 cell broadband 
engine processors 
11 5.75 
Jiang  
et al. [124] 
B-spline 
interpolation 
FPGA 
 (using Handel-C) 
3 4.19 
Software  
(C++) 
MI-based 
deformable 
Xeon workstation ? 1.75 
Dandekar  
et al. [157] 
MI-based 
deformable 
FPGA 33 74.43 
 
 122 
 
4.5.3.   Qualitative Evaluation of Deformable Image Registration 
Acceleration of deformable registration, as offered by the presented 
architecture, is critical for the time-sensitive nature of IGIs; however the accuracy of 
the registration process is of equal importance. As described earlier, we present the 
results of qualitative validation in this section, which demonstrate the functional 
correctness of our architecture. The results of the quantitative validation are presented 
later, in Chapter 6.  The qualitative validation of deformable registration was 
performed using five iCT-preCT and five iCT-PET image pairs. For all the cases 
visually correct alignment was achieved after deformable image registration. 
Furthermore, the results of the registration performed using our architecture was 
qualitatively similar to that obtained using the software implementation. 
 
Figure 4.15: Qualitative validation of deformable registration between iCT and 
preCT images performed using the presented FPGA-based solution. 
 123 
 
Figure 4.15 shows an example of deformable registration between a 
representative pair of iCT and preCT images. This registration was performed using 
the presented FPGA-based solution. The initial seed alignment between these images 
was obtained using a single-click operation to compensate for different scanner 
coordinate systems and ensure reasonable overlap of common regions between two 
images. Subfigures (a) and (b) show coronal slices from iCT and preCT images, 
respectively. Subfigure (c) shows the overlay of these two images using the 
checkerboard pattern. In this checkerboard overlay, blocks from both the images are 
displayed alternately. The structural misalignment between iCT and preCT images is 
evident from mismatches at the boundaries of these blocks. Subfigure (e) shows a 
coronal slice from the preCT image registered to the iCT image, and subfigure (f) 
shows the overlay of this image with the original iCT image. Better structural 
alignment after deformable registration is evident from improved matching at the 
boundaries of the blocks of the checkerboard overlay.  
Similarly, Figure 4.16 shows an example of deformable registration between a 
representative pair of iCT and PET images. This registration was also performed 
using the reported FPGA-based solution. The initial alignment between these two 
Figure 4.16: Qualitative validation of deformable registration between iCT and PET 
images performed using the presented FPGA-based solution. 
 124 
 
images was obtained as for the previous case. Subfigures (a) and (b) show coronal 
slices from iCT and PET images, respectively. Subfigure (c) shows the fusion of 
these two images. Subfigure (d) shows the fusion of the registered PET image with 
the original iCT image. Improved alignment after deformable registration is evident 
from better matching of structures in the fusion image. 
4.6.   Summary 
In this chapter we presented a novel FPGA-based architecture for high-speed 
implementation of MI-based deformable image registration. This architecture 
achieved voxel-level parallelism through pipelined implementation and employed 
several strategies to address the fundamental bottleneck in the intensity-based image 
registration, namely memory access management. As a result of these enhancements, 
the presented architecture is capable of achieving high voxel processing rate and a 
speedup of about 30 and consequently reduces the execution time of deformable 
registration from hours to only a few minutes. The results of the qualitative validation 
indicate that this performance improvement does not significantly compromise the 
accuracy of deformable registration. As qualitatively presented in this chapter and 
demonstrated quantitatively later (see Chapter 6), this implementation enables 
improved target delineation during image-guided interventions through deformable 
registration with preprocedural images. The robustness, speed, and accuracy offered 
by this architecture, in conjunction with its compact implementation, make it ideally 
suitable for integration into IGI workflow. 
Accurate, robust, and real-time deformable image registration between intra- 
and preprocedural images is an unmet need, critical to the success of image-guided 
 125 
 
procedures. The work presented in this chapter of the dissertation represents an 
important first step toward meeting this goal. With further algorithmic and hardware 
improvements, geared toward enhancing its accuracy and performance, this approach 
has the potential to elevate the precision of current procedures and expand the scope 
of IGI to moving and deformable organs. 
 
 126 
 
Chapter 5:   Framework for Optimization of Finite Precision 
Implementations 
Computationally intensive algorithmic components are routinely accelerated 
by mapping them to custom or reconfigurable hardware platforms. The work 
presented in the earlier chapters of this dissertation has a similar flavor. An important 
consideration for these implementations, which often employ finite precision 
datapaths, is the tradeoff between hardware resources and implementation accuracy. 
This chapter presents a methodology for systematically exploring this tradeoff and 
identifying design configurations that efficiently balance these two conflicting 
objectives. First, we formulate this problem as a multiobjective optimization problem. 
Next, we present a multiobjective optimization framework developed in the context 
of FPGA-based architecture for image registration presented in the previous chapter. 
Finally, we demonstrate the applicability of the proposed approach through 
simulation and post-synthesis validation results. 
5.1.   Motivation 
An emerging trend in real-time signal processing systems is to accelerate 
computationally intensive algorithmic components (for example, MI calculation as 
described in the previous chapter) by mapping them to custom or reconfigurable 
hardware platforms, such as ASICs and FPGAs. Most of these algorithms are initially 
developed in software using floating-point representation and later migrated to 
hardware using finite precision (e.g., fixed-point representation) for achieving 
improved computational performance and reduced hardware cost. These 
 127 
 
implementations are often parameterized, so that a wide range of finite precision 
representations can be supported [158] by choosing an appropriate wordlength for 
each internal variable. As a consequence, the accuracy and hardware resource 
requirements of such a system are functions of the wordlengths used to represent the 
internal variables. Determining an optimal wordlength configuration has been shown 
to be NP-hard [159] and can take up to 50% of the design time for complex systems 
[125]. Moreover, a single optimal solution may not exist, especially in the presence of 
multiple conflicting objectives, as is the case in the current work. In addition, a new 
configuration generally needs to be derived when the design constraints or application 
requirements are altered. 
An exhaustive search of the entire design space is guaranteed to find Pareto-
optimal configurations. Execution time for such exhaustive search, however, 
increases exponentially with the number of design parameters, making it unfeasible 
for most practical systems. Some earlier reported techniques have attempted to solve 
this problem through the use of analytical modeling [86-90], gradient-based search 
[93], linear programming strategies [86], or through the use of linear aggregation of 
objective functions. These approaches, although successfully demonstrated, may have 
limited applicability for complex designs, with non-linear objective functions and a 
non-convex search space [129]. For example, in the case of FPGA-based 
implementation of image registration, analytically representing the error induced in 
MI-calculation will be non-trivial. In addition, MI is a non-linear objective function. 
Techniques based on evolutionary methods have been shown to be effective in 
searching large search spaces with complex objective functions in an efficient manner 
 128 
 
[130, 131]. Furthermore, these techniques are inherently capable of performing 
multipoint searches. As a result, techniques based on evolutionary algorithms (EAs) 
have been employed in the context of multiobjective optimization (see SPEA2 [160], 
NSGA-II [161]). 
We formulate this problem of finding optimal wordlength configurations as a 
multiobjective optimization, where different objectives ? for example, accuracy and 
area ? generally conflict with one another. Although this approach increases the 
complexity of the search, it can find a set of Pareto-optimized configurations 
representing strategically-chosen tradeoffs among the various objectives. This novel 
multiobjective optimization strategy is developed and validated in the context of 
FPGA-based implementation of image registration. The tradeoff between FPGA 
resources (area and memory) and implementation accuracy is systematically 
explored, and Pareto-optimized solutions are identified such that a designer could use 
these solutions to satisfy given design constrains or meet image registration accuracy 
required by an application. This analysis is performed by treating the wordlengths of 
the internal variables of the architecture presented in the previous chapter as design 
variables. We also compare several search methods for finding Pareto-optimized 
solutions and demonstrate in our context the applicability of search based on 
evolutionary techniques for efficiently identifying superior multiobjective tradeoff 
curves. This optimization strategy can easily be adapted to a wide range of signal 
processing applications, including applications for image and video processing. 
 129 
 
5.2.   Multiobjective Optimization 
The architecture presented in the previous chapter is designed to accelerate the 
calculation of MI for performing intensity-based image registration. We have 
demonstrated this architecture to be capable of offering execution performance 
superior to that of a software implementation [157]. The accuracy of MI calculation 
(and by extension, that of image registration) offered by this implementation, 
however, is a function of the wordlengths chosen for the internal variables of the 
design. Similarly, these wordlengths also control the hardware implementation cost of 
the design. For medical imaging applications, the ability of an implementation to 
achieve the desired level of accuracy is of paramount importance. It is, therefore, 
necessary to understand the tradeoff between accuracy and hardware implementation 
cost for a design and to identify wordlength configurations that provide effective 
tradeoffs between these conflicting criteria. This multiobjective optimization allows a 
designer to systematically maximize accuracy for a given hardware cost limitation 
(imposed by a target device, for example) or minimize hardware resources to meet the 
accuracy requirements of a medical application. 
The following section provides a formal definition of this problem and the 
subsequent section describes a framework developed for multiobjective optimization 
of FPGA-based medical image registration. 
5.2.1.   Problem Statement 
Consider a system Q  that is parameterized by N  parameters 
(1,2,,)
i
ni N= ? , where each parameter can take on a single value from a 
 130 
 
corresponding set of valid values (
i
v ). Let the design configuration space 
corresponding to this system be S , which is defined by a set consisting of all N-
tuples generated by the Cartesian product of the sets ,
i
vi? :  
 
123
.
N
Svvv v= ????null  (5.1) 
The size of this design configuration space is then equal to the cardinality of the set 
S , or in other words, the product of the cardinalities of the sets 
i
v : 
 
123
.
N
Svvv v=????null  (5.2) 
For most systems, not all configurations that belong to S may be valid or practical. 
We therefore define a subset ( )S? ?? , such that it contains all the feasible system 
configurations.  Now consider m  objective functions 
12
(, , , )
m
f ff?  defined for 
system Q , such that each function associates a real value for every feasible 
configuration c??. 
The problem of multiobjective optimization is then to find a set of solutions 
that simultaneously optimizes the m  objective functions according to an appropriate 
criterion. The most commonly adopted notion of optimality in multiobjective 
optimization is that of Pareto optimality. According to this notion, a solution c
?
 is 
Pareto optimal if there does not exist another solution c?? such that 
*
() ( )
ii
f cfc? , for all i , and 
*
() ( )
jj
f cfc< , for at least one j . The solution c
?
 
is also 
called a non-dominated solution, because no other solution dominates (or is superior 
than) solution c
?  
as per the Pareto-optimality criteria. The set of Pareto optimal 
solutions, therefore, comprises of all non-dominated solutions.  
 131 
 
Given a multiobjective optimization problem and a heuristic technique for this 
problem that attempts to derive Pareto-optimal or near-Pareto-optimal solutions, we 
refer to solutions derived by the heuristic as ?Pareto-optimized? solutions. 
5.2.2.   Parameterized Architectural Design 
Implementations of signal processing algorithms using microprocessor- or 
DSP-based approaches are characterized by a fixed datapath width. This width is 
determined by the hard-wired datapath of the underlying processor architecture. 
Reconfigurable implementation based on FPGAs, in contrast, allows the size of 
datapath to be customized to achieve better tradeoffs among accuracy, area, and 
power. The use of such custom data representation for optimizing designs is one of 
the main strengths of reconfigurable computing [85]. To take advantage of the 
described multiobjective optimization strategy, the architecture being optimized must 
be able to support various design configurations as identified by the optimization 
scheme. It has even been contended that the most efficient hardware implementation 
of an algorithm is one that supports a variety of finite precision representations of 
different sizes for its internal variables [158]. In this spirit, many commercial and 
research efforts have employed parameterized design style for intellectual property 
(IP) cores [162-166]. This parameterization capability not only facilitates reuse of 
design cores, but also allows them to be reconfigured to meet design requirements. 
During the design of the aforementioned architecture for accelerated 
computation of MI, we adopted a similar design style that allows configuration of the 
wordlengths of the internal variables. Hardware design languages such as VHDL and 
Verilog natively support hierarchical parameterization of a design through use of 
 132 
 
generics and parameters, respectively. This design style takes advantage of these 
language features and is employed for the design of all the modules described earlier. 
We highlight the main features of this design style using illustrative examples and 
design senarios. Consider a design module with two input variables that computes an 
output variable through arithmetic manipulation of the input variables. The 
wordlength of the input variables (denoted by IP1_WIDTH, IP2_WIDTH) and that of 
the output variable (denoted by OP_WIDTH) are the design parameters for this 
module. The module can then be parameterized for these design variables as 
illustrated in Figure 5.1a.  
In a pipelined implementation of an operation, a module may have multiple 
internal pipeline stages and corresponding intermediate variables. Wordlengths 
chosen for these intermediate variables can also impact the accuracy and hardware 
requirements of a design. In our implementation scheme, we do not employ any 
rounding or truncation for the intermediate variables, but deduce their wordlengths 
based on the wordlengths of the input operands and the arithmetic operation to be 
implemented. For example, multiplication of two 8-bit variables will, at the most, 
require a 16-bit wide intermediate output variable. A parameterized implementation 
of this scenario is illustrated in Figure 5.1c. Sometimes, it is also necessary to 
instantiate a vendor-provided or a third-party IP core, such as a FIFO module or an 
arithmetic unit, within a design module. In such cases, we simply pass the wordlength 
parameters down the design hierarchy to configure the IP core appropriately and 
thereby maintain the parameterized design style (see Figure 5.1b for example).  
 133 
 
When signals cross module boundaries, the output wordlength and format 
(position of the binary point) of the source module should match the input wordlength 
and format of the destination module. This is usually achieved through use of a 
rounding strategy and right- or left-shifting of the signals. Adopting ?rounding toward 
the nearest? strategy to achieve wordlength-matching is expected to introduce the 
smallest error, but requires additional logic resources. In our design, we therefore 
 
Figure 5.1: Examples of parameterized architectural design style. 
 134 
 
implement truncation (or ?rounding toward zero? strategy), while the signal shifting 
is achieved through zero-padding. Both these operations are parameterized and take 
into account the wordlengths and the format at the module boundaries (see Figure 
5.1c for example). Thus, this parameterized design style enables the architecture to 
support multiple wordlength configurations for its internal variables. The parameters 
of this architecture, and in particular the fractional wordlengths of the internal 
variables, are being treated as design variables in this multiobjective optimization 
framework. These design variables are identified in the following section. 
5.2.3.   Multiobjective Optimization Framework 
Figure 5.2 illustrates the framework that we have developed for multiobjective 
optimization of the architecture for high-speed implementation of image registration 
 
Figure 5.2: Framework for multiobjective optimization of FPGA-based image 
registration. 
 135 
 
described in the previous chapter. There are two basic components of this framework. 
The first component is the search algorithm that explores the design space and 
generates feasible candidate solutions; and the second component is the objective 
function evaluation module that evaluates candidate solutions. The solutions and 
associated objective values are fed back to the search algorithm so that they can be 
used to refine the search. These two components are loosely coupled so that different 
search algorithms can be easily incorporated into the framework. Moreover, the 
objective function evaluation module is parallelized using a message passing interface 
(MPI) on a 32-processor cluster. With this parallel implementation, multiple solutions 
can be evaluated in parallel, thereby increasing search performance. These 
components are described in detail in the following sections. 
5.2.3.1.   Design Parameters  
As described in the earlier section, the architecture performs MI calculation 
using a fixed-point datapath. As a result, the accuracy of MI calculation depends on 
the precision (wordlength) offered by this datapath. The design parameters in this 
datapath define the design space and are identified and listed along with the 
corresponding design module (see Figure 4.3) in Table 5.1. 
A fixed-point representation consists of an integer part and a fractional part. 
The numbers of bits assigned to these two parts are called the integer wordlength 
(IWL) and fractional wordlength (FWL), respectively. The individual numbers of bits 
allocated to these parts control the range and precision of the fixed-point 
representation.  
 136 
 
 
-
Range ( ) [ 2 ,2 ) ... for signed numbers
[0,2 )       ... for unsigned numbers
Precison ( ) 2
IWL IWL
IWL
FWL
R =?
=
?=
 (5.3) 
For this architecture, the IWL required for each design parameter can be deduced 
from the range information specific to the image registration application. For 
example, in order to support translations in the range of [?64, 63] voxels, 7 bits of 
IWL (with 1 bit assigned as a sign bit) are required for the translation parameter. 
Similarly, since the current architecture supports images with dimensions up to 512, 9  
bits (log
2
 512) of IWL is required for the floating image address. We used similar 
range information to choose the IWL for all the parameters, and these values are 
reported in Table 5.1.  The precision required for each parameter, which is 
determined by its FWL, is not known a priori. We, therefore, determine this by 
performing multiobjective optimization using the FWL of each parameter as a design 
variable. In our experiments, we used the design range of [1, 32] bits for FWLs of all 
the parameters. The optimization framework can support different wordlength ranges 
for different parameters, which can be used to account for additional design 
Table 5.1: Design variables for FPGA-based architecture. Integer wordlengths are 
determined based on application-specific range information, and fractional 
wordlengths are used as parameters in the multiobjective optimization framework. 
Architectural 
Module 
Design 
Variable 
Integer 
wordlength 
(IWL ) (bits) 
Fractional 
wordlength (FWL) 
range (bits) 
Translation vector 7 [1,32] 
Voxel coordinate 
transformation 
Rotation matrix 4 [1,32] 
Partial volume 
interpolation 
Floating image address 9 [1,32] 
Mutual histogram 
accumulation 
Mutual histogram 
bin 
25 [1,32] 
 
 137 
 
constraints, such as, for example, certain kinds of constraints imposed by third-party 
intellectual property. 
The entropy calculation module is implemented using a multiple-LUT?based 
approach and also employs fixed-point arithmetic. However, this module has already 
been optimized for accuracy and hardware resources, as described in [155]. The 
optimization strategy employed in [155] uses an analytical approach that is specific to 
entropy calculation and is distinct from the strategy presented in this work. This 
module, therefore, does not participate in the multiobjective optimization framework 
presented in this work, and we simply use the optimized configuration identified 
earlier. This further demonstrates the flexibility of our optimization framework to 
accommodate arbitrary designer- or externally-optimized modules. 
5.2.3.2.   Search Algorithms  
An exhaustive search that explores the entire design space is guaranteed to 
find all Pareto-optimal solutions. However, this search can lead to unreasonable 
execution time, especially when the objective function evaluation is computationally 
intensive. For example, with four design variables, each taking one of 32 possible 
values, the design space consists of 32
4
 solutions. If the objective function evaluation 
takes 1 minute per trial (which is quite realistic for multiple MI calculations using 
large images), the exhaustive search will take 2 years. Consequently, we have 
considered alternative search methods, as described below. 
The first method is partial search, which explores only a portion of the entire 
design space. For every design variable, the number of possible values it can take is 
reduced by half by choosing every alternate value. A complete search is then 
 138 
 
performed in this reduced search space. This method, although not exhaustive, can 
effectively sample the breadth of the design space. The second method is random 
search, which involves randomly generating a fixed number of feasible solutions. For 
both of these methods, Pareto-optimized solutions are subsequently identified from 
the set of solutions explored. 
The third method is performing a search using evolutionary techniques. EAs 
have been shown to be effective in efficiently exploring large search spaces [130, 
131]. In particular, we have employed SPEA2 [160], which is very effective in 
sampling from along an entire Pareto-optimal front and distributing the solutions 
generated relatively evenly over the optimal tradeoff surface. Moreover, SPEA2 
incorporates a fine-grained fitness assignment strategy and an enhanced archive 
truncation method, which further assist in finding Pareto-optimal solutions. The flow 
of operations in this search algorithm is shown in Figure 5.2. 
For the EA-based search algorithm, the representation of the system 
configuration is mapped onto a ?chromosome? whose ?genes? define the wordlength 
parameters of the system. Each gene, corresponding to the wordlength of a design 
variable i, is represented using an integer allele that can take values from the set v
i
, 
described earlier. Thus, every gene is confined to wordlength values that are 
predefined and feasible for a given design variable. The genetic operators for 
crossover and mutation are also designed to adhere to this constraint and always 
produce values from set v
i
, for a gene i within a chromosome. This representation 
scheme is both symmetric and repair-free and, hence, is favored by the schema theory 
[167], and is computationally efficient, as described in [168]. 
 139 
 
5.2.3.3.   Objective Function Models and their Fidelity 
Search for Pareto-optimized configurations requires evaluating candidate 
solutions and determining Pareto-dominance relationships between them. This can be 
achieved by calculating objective functions for all the candidate solutions and by 
relative ordering of the solutions with respect to the values of their corresponding 
objective functions. We consider the error in MI calculation and the hardware 
implementation cost to be the conflicting objectives that must be minimized for our 
FPGA implementation problem. We model the FPGA implementation cost using two 
components: the first is the amount of logic resources (number of LUTs) required by 
the design, and the second is the internal memory consumed by the design. We treat 
these as independent objectives in order to explore the synergistic effects between 
these complementary resources. Because of the size of the design space and 
limitations due to execution time, it is not practical to synthesize and evaluate each 
solution. We, therefore, employ models for calculating objective functions to evaluate 
the solutions. The quality of the Pareto-optimized solutions will then depend on the 
fidelity of these objective function models. 
The error in MI calculation can be computed by comparing the MI value 
reported by the limited-precision FPGA implementation against that calculated by a 
double-precision software implementation. For this purpose, we have utilized a bit-
true emulator of the hardware. This emulator was developed in C++ and uses fixed-
point arithmetic to accurately represent the behavior of the limited-precision 
hardware. It supports multiple wordlengths for internal variables and is capable of 
accurately calculating the MI value corresponding to any feasible configuration. We 
 140 
 
have verified its equivalence with the hardware implementation for a range of 
configurations and image transformations. This emulator was used to compute the MI 
calculation error. The MI calculation error was averaged for three distinct image pairs 
(with different image modality combinations) and for 50 randomly generated image 
transformations. The same sets of image pairs and image transformations were used 
for evaluating all feasible configurations. 
The memory required for a configuration is primarily needed for intermediate 
FIFOs, which are used to buffer internal variables, and the MH memory. For 
example, a 64-word deep FIFO used to buffer a signal with a wordlength of b will 
require 64?b bits of memory. In our architecture, the depth of the FIFOs and the 
dimensions of the MH are constant, whereas their corresponding widths are 
determined by the wordlength of the design parameters. Using these insights, we have 
developed an architecture-specific analytical expression that accurately represents the 
cumulative amount of memory required for all internal FIFOs and MH.  We used this 
expression to calculate the memory requirement of a configuration. 
For estimating the area requirements of a configuration, we adopt the area 
models reported in [86, 169]. These are high-level models of common functional 
units such as adders, multipliers, delays. These models are derived from the 
knowledge of the internal architecture of these components. Area cost for 
interconnects and routing is not taken into account in this analysis. These models 
have been verified for the Xilinx Virtex series of FPGAs and are equally applicable to 
alternative FPGA families and for ASIC implementations. These models have also 
been previously used in the context of wordlength optimization [86, 91, 169].  
 141 
 
We further evaluated the fidelity of these area models using a representative 
module, PV Interpolator, from the aforementioned architecture. This module receives 
the fractional components of the floating image address and computes corresponding 
interpolation weights. We varied the FWL of the floating image address from 1 to 32 
bits and synthesized the module using the Altera Stratix II and Xilinx Virtex 5 as 
target devices. For a meaningful comparison, the settings for the analysis, synthesis, 
and optimization algorithms (for example, settings to favor area or speed) for the 
design tools (Altera Quartus II and Xilinx ISE) were chosen to be comparable. After 
complete synthesis, routing, and placement, we recorded the area (number of LUTs) 
consumed by the synthesized design. This process was automated by using the Tcl 
scripting feature provided by the design tools and through the parameterized design 
style described earlier. We then compared the consumed area against that predicted 
by the adopted area models for all FWL configurations. The results of this experiment 
are presented in Figure 5.3. These results indicate that the area estimates (number of 
 
Figure 5.3: Comparison of the area values predicted by the adopted area models 
with those obtained after physical synthesis.
 142 
 
LUTs) predicted by the model are comparable to that obtained through physical 
synthesis for both the target devices. For quantitative evaluation, the fidelity of the 
area models was calculated as below: 
 
1
11
2
,
(1)
NN
iji
ij
Fidelity F
NN
?
==+
??
=
??
?
??
??
 (5.4) 
where 
 
{
1if  ( ) ( )
.
0otherwise
ij i j
ij
sign S S sign M M
F
?= ?
=  (5.5) 
In this equation, the 
i
M s represent the values predicted by the area models; and the 
i
S s represent the values obtained after physical synthesis. The fidelity of the area 
models when evaluated with respect to the synthesis results obtained for both Altera 
and Xilinx devices was 1, which corresponds to maximum (?perfect?) fidelity. This 
indicates that the relative ordering of FWLs with respect to their area requirements is 
consistent for the model and synthesized designs. These results further validate the 
applicability of using the aforementioned area models for multiobjective 
optimization. 
5.3.   Experiments and Results 
We performed multiobjective optimization of the aforementioned architecture 
using the search algorithms outlined in the previous section. To account for the 
Table 5.2: Number of solutions explored by search methods. 
Search Method 
Number of  
solutions explored 
Partial search 65,536 
Random search 6,000 
EA-based search 6,000 
 143 
 
effects of random number generation, the EA-based search and random search were 
repeated five times each, and the average behavior from these repeated trials is 
reported. The number of solutions explored by each search algorithm in a single run 
is reported in Table 5.2. The execution time of each search algorithm was roughly 
proportional to the number of solutions explored, and the objective function 
evaluation for each solution took approximately 1 minute using a single computing 
node. As expected, the partial search algorithm explored the largest number of 
solutions. The parameters used for the EA-based search are listed in Table 5.3. The 
crossover and mutation operators were chosen to be one-point crossover and flip 
mutator, respectively. For a fair comparison, the number of solutions explored by the 
random search algorithm was set to be equal to that explored by the EA-based 
algorithm. 
The solution sets obtained by each search method were then further reduced to 
corresponding nondominated solution sets using the concept of Pareto optimality. As 
described earlier, the objectives considered for this evaluation were the MI 
calculation error and the memory and area requirements of the solutions. Figure 5.4 
shows the Pareto-optimized solution set obtained for each search method. 
Qualitatively, the Pareto front identified by the EA-based search is denser and more 
widely distributed and demonstrates better diversity than other search methods.  
Table 5.3: Parameters used for the EA-based search. 
Parameter Value 
Population size 200 
Number of generations 30 
Crossover probability 1.0 
Mutation probability 0.06 
 
 144 
 
 
 
(a)  
Partial Search 
 
(b) 
EA-based Search 
 
(c) 
Random Search 
 
Figure 5.4: Pareto-optimized solutions identified by various search methods. 
 145 
 
Figure 5.5 compares the Pareto fronts obtained by partial search and EA-based search 
by overlaying them and illustrates that the EA-based search can identify better Pareto-
optimized solutions, which indicates the superior quality of solutions obtained by this 
search method. Moreover, it must be noted that the execution time required for the 
EA-based search was over 10-times faster than that required for the partial search. 
(a) Area vs. MI 
calculation error 
 
(b) Memory vs. MI 
calculation error 
 
Figure 5.5: Qualitative comparison of solutions found by partial search and EA-
based search. 
 
 146 
 
5.3.1.   Metrics for Comparison of Pareto-optimized Solution Sets 
Quantitative comparison of the Pareto-optimized solution sets is essential in 
order to compare more precisely the effectiveness of various search methods. As with 
most real-world complex problems, the Pareto-optimal solution set is unknown for 
this application. We, therefore, employ the following two metrics to perform 
quantitative comparison between different solution sets. We use the ratio of non-
dominated individuals (RNI) to judge the quality of a given solution set, and the 
diversity of a solution set is measured using the cover rate. These performance 
measures are similar to those reported in [170] and are described below. 
The RNI is a metric that measures how close a solution set is to the Pareto-
optimal solution set. Consider two solution sets (
1
P  and 
2
P ) that each contain only 
non-dominated solutions. Let the union of these two sets be 
U
P . Furthermore, let 
ND
P  
be a set of all non-dominated solutions in ( )
UND U
PP P? . The RNI for the solution set 
i
P  is then calculated as: 
 ,
iND
i
ND
PP
RNI
P
=
?
 (5.6) 
where ?  is the cardinality of a set. The closer this ratio is to 100%, the more superior 
the solution set is and the closer it is to the Pareto-optimal front. We computed this 
metric for all the search algorithms previously described, and the results are presented 
in Figure 5.6. Our EA-based search offers better RNI and, hence, superior quality 
solutions to those achieve7d with either the partial or random search. 
 147 
 
The cover rate estimates the spread and distribution (or diversity) of a solution 
set in the objective space. Consider the region between the minimum and maximum 
of an objective function as being divided into an arbitrary number of partitions. The 
cover rate is then calculated as the ratio of the number of partitions that is covered 
(that is, there exists at least one solution with an objective value that falls within a 
given partition) by a solution set to the total number of partitions. The cover rate (
k
C ) 
of a solution set for an objective function (
k
f ) can then be calculated as: 
 ,
k
k
N
C
N
=  (5.7) 
where 
k
N  is the number of covered partitions and N  is the total number of 
partitions. If there are multiple objective functions ( m , for example), then the net 
cover rate can be obtained by averaging the cover rates for each objective function as: 
 
1
1
.
m
k
k
CC
m
=
= ?  (5.8) 
The maximum cover rate is 1, and the minimum value is 0. The closer the cover rate 
of a solution set is to 1, the better coverage and more even (more diverse) distribution 
it has. Because the Pareto-optimal front is unknown for our targeted application, the 
 
Figure 5.6: Quantitative comparison of search methods using the ratio of non-
dominated individuals (RNI). 
 148 
 
minimum and maximum values for each objective function were selected from the 
solutions identified by all the search methods. We used 20 partitions/decade for MI 
calculation error (represented using a logarithmic scale), 1 partition for every 50 
LUTs for the area requirement, and 1 partition for every 50 Kbits of memory 
requirement. The cover rate for all the search algorithms described earlier was 
calculated using the method outlined above, and the results are illustrated in Figure 
5.7. The EA-based search offers a better cover rate, which translates to better range 
and diversity of solutions when compared with either partial or random searches. In 
summary, our EA-based search outperforms the random search and is capable of 
offering more diverse and superior quality solutions when compared with the partial 
search, using only 10% of the execution time. 
5.3.2.   Accuracy of Image Registration 
An important performance measure for any image registration algorithm, 
especially in the context of medical imaging, is its accuracy. We did not choose 
 
Figure 5.7: Quantitative comparison of search methods using cover rate. 
 149 
 
registration accuracy as an objective function because of its dependence on data 
(image pairs), the degree of misalignment between images, and the behavior of the 
optimization algorithm that is used for image registration. These factors, along with 
its execution time, in our experience, may render registration accuracy as an 
unsuitable objective function, especially if there is non-monotonic behavior with 
respect to the wordlength of design variables. 
Instead, we evaluated the affect of error in MI calculation on the image 
registration accuracy for a set of image pairs. This analysis was performed using three 
computed tomography image pairs for the Pareto-optimized solutions identified by all 
of the search algorithms that we experimented with. Image registration was 
performed using limited-precision configurations corresponding to each solution 
using the aforementioned bit-true simulator. The result of registration was then 
compared with that obtained using double-precision software implementation. 
Registration accuracy was calculated by comparing deformations at the vertices of a 
 
Figure 5.8: Relationship between MI calculation error and resulting image 
registration error. 
 150 
 
cuboid (with size equal to half the image dimensions) located at the center of the 
image. The results of this analysis are illustrated in Figure 5.8. As expected, there is a 
good correlation between the MI calculation error and the accuracy of image 
registration. This demonstrates that optimized tradeoff curves between MI calculation 
error and hardware cost, as identified by our reported analysis, can be used to 
represent the relationships between registration accuracy and hardware cost with high 
fidelity. This analysis also provides better insight about the sensitivity of image 
registration accuracy to various design parameters. 
5.3.3.   Post-synthesis Validation 
We performed further validation of the presented multiobjective optimization 
strategy through physical design synthesis. We identified three solutions from the 
Pareto-optimized set obtained using the EA-based search and synthesized the 
aforementioned architecture with configurations corresponding to these solutions. 
These three configurations, which offer gradual tradeoff between hardware resource 
requirement and error in MI calculation, are listed in the first column of Table 5.4. 
The wordlengths associated with each configuration correspond to the FWLs of the 
design variables identified in Table 5.1.  The design was synthesized for these 
configurations and the resulting realizations were implemented using an Altera Stratix 
II EP2S180F1508C4 FPGA (Altera Corporation, San Jose, CA) on a PCI prototyping 
board (DN7000K10PCI) manufactured by the Dini Group (La Jolla, CA). We then 
evaluated the performance of the synthesized designs and compared it with that 
predicted by the objective function models. The results of this analysis are 
summarized in Table 5.4 and are described below.  
 151 
 
The error in MI calculation was computed by comparing the MI value 
reported by the limited-precision FPGA implementation against that calculated by a 
double-precision software implementation. The MI calculation error was averaged for 
three distinct image pairs and for 50 randomly generated image transformations for 
each pair. These image pairs and the associated transformations were identical to 
those employed in the objective function calculation. In this case, the average MI 
calculation error obtained by all the design configurations was identical to that 
predicted by the objective function model. This is expected due to the bit-true nature 
of the simulator used to predict the MI calculation error. We repeated this calculation 
with a different set of three image pairs and 50 randomly generated new 
transformations associated with each image pair. The MI calculation error 
corresponding to this setup is reported in the second column of Table 5.4.  The small 
difference when compared to the error predicted by the models is explained by the 
different sets of images and transformations used. The area and memory requirement 
corresponding to each configuration after synthesis are reported in columns three and 
Table 5.4: Validation of the objective function models using post-synthesis 
results. The wordlengths in a design configuration correspond to the FWLs of the 
design variables identified earlier. 
Objective Functions 
Post-Synthesis Value  
(Predicted Value) 
Design 
Configuration 
MI Calculation  
Error 
Area  
(No. of LUTs) 
Memory  
(Mbits) 
Image 
Registration 
Error 
(mm) 
{5, 6, 4, 9} 
2.4?10
-3
  
(2.1?10
-3
)
6527  
(5899)
2.23  
(2.23)
3.82 
{8, 9, 7, 12} 
5.3?10
-4 
 (5.2?10
-4
) 
7612  
(6754) 
2.45  
(2.45) 
1.57 
{9, 12, 10, 17} 
7.7?10
-5
  
(7.8?10
-5
) 
10356 
 (8073) 
2.81  
(2.81) 
0.45 
 
 152 
 
four of Table 5.4, respectively. For comparison, we have also included the values 
predicted by the corresponding objective function models in parenthesis. It must be 
noted that for all the three configurations, the relative ordering based on Pareto-
dominance relationships with respect to each objective function is identical for both 
post-synthesis and model-predicted values.  
We also evaluated the accuracy of image registration performed using the 
implementation corresponding to each design configuration. For this analysis, we 
considered five computed tomography image pairs. The image registration results for 
a representative image-pair are illustrated in Figure 5.9. Subfigures (a) and (b) show 
two distinct poses for the same subject; and (c) shows fusion of (a) and (b) using a 
checkerboard pattern. The misalignment between images is evident at the edges of the 
squares within the checkerboard pattern. Subfigures (d)-(f) show fusion images after 
registration using the identified design configurations. These configurations offer 
progressively reducing image registration error (3.82 mm, 1.57 mm, and 0.45 mm, 
respectively), and result into correspondingly improved image alignment. The arrows 
indicate representative regions with misalignment that are better-aligned after 
registration. The registration error was calculated by comparing the obtained 
registration results with that obtained using double-precision software 
implementation. The average registration error for each configuration is reported in 
the last column of Table 5.4. There is a good correlation between the MI calculation 
error and the registration error, reinforcing the results presented in the previous 
section. 
 153 
 
This post-synthesis validation further demonstrates the efficacy of the 
presented optimization approach for reconfigurable implementation of image 
registration. It also further demonstrates how the approach enables a designer to 
systematically choose an efficient system configuration to meet the registration 
accuracy requirements for a reconfigurable implementation. 
As described previously, we used error in calculation of MI as one of the 
objective functions. The Pareto-optimized solutions identified by the search schemes 
employed in this multiobjective optimization framework can then be used to select a 
design configurations that offer best (lowest) error in calculation of MI for various 
hardware resource requirements. Through additional experiments, both simulation-
based and post-synthesis validation?based, we have further demonstrated that there is 
Figure 5.9: Results of image registration performed using the high-speed, FPGA-
based implementation for design configurations offering various registration errors. 
 154 
 
a good correlation between error in MI calculation and the image registration error. In 
general, reducing the error in the calculation of MI will translate into improved image 
registration accuracy. This is expected, since MI is used as a similarity measure for 
performing image registration. Although, the image registration algorithm being 
accelerated in the current work is of deformable nature, it achieves deformable 
alignment through a series of hierarchical, locally rigid registrations that use MI as a 
similarity measure.  We, therefore, contend that the accuracy of the deformable 
registration is directly dependant on and highly correlated with the accuracy of MI 
calculation. Consequently, the design configuration offering better accuracy (lower 
error) in MI calculation will lead to superior image registration accuracy. This is also 
supported by the data presented in Table 5.4. As accuracy of image registration can 
play a crucial rule in IGI applications (and other medical applications, in general), we 
selected the system configuration that offered lowest error in MI calculation (and by 
extension lowest image registration error) from Table 5.4. This system configuration 
({9, 12, 10, 17}) was then used for validation of deformable registration in the 
context of novel IGI applications described in the following chapter. 
5.4.   Summary 
One of the main strengths of reconfigurable architectures over general-
purpose processor?based implementations is their ability to utilize more streamlined 
representations for internal variables. This ability can often lead to superior 
performance, as demonstrated by our architectures (for performing 3D image 
preprocessing and deformable image registration) presented in the previous chapters 
and by other researchers in the context of myriad of other applications. Furthermore, 
 155 
 
this approach can result into optimized utilization of FPGA resources, by employing 
just enough precision for each of its internal variables to satisfy design requirements 
of a given application. Given this advantage, it is highly desirable to automate the 
derivation of optimized design configurations that offer varying degree of tradeoff 
between implementation accuracy and hardware resources. Toward that end, this 
work has presented a framework for multiobjective optimization of finite precision, 
reconfigurable implementations. This framework considers multiple conflicting 
objectives, such as hardware resource consumption and implementation accuracy, and 
systematically explores tradeoff relationships among the targeted objectives. This 
work has also further demonstrated the applicability of EA-based techniques for 
efficiently identifying Pareto-optimized tradeoff relations in the presence of complex 
and non-linear objective functions. The evaluation that we have performed in the 
context of the architecture for FPGA-based deformable image registration 
demonstrates that such an analysis can be used to enhance automated hardware design 
processes, and efficiently identify a system configuration that meets given design 
constraints. This approach may also be applied in the context of reconfigurable 
computing for identifying suitable design configurations that can be switched among 
at runtime. Furthermore, the multiobjective optimization approach that we have 
presented is quite general, and can be extended to a multitude of other signal 
processing applications. 
 
 
 156 
 
Chapter 6:   Clinical Applications 
Earlier chapters of this dissertation have identified, described, and optimized 
the core components of an advanced image-guidance system for IGI applications. 
These components, namely real-time image processing and high-speed deformable 
image registration, enable the use of 3D image processing in the IGI workflow. This 
workflow is illustrated pictorially in Figure 6.1. For this example, we considered CT 
to be the intraprocedural imaging modality (labeled as iCT) and PET to be the 
preprocedural image modality (labeled as PET). However, it must be noted, that this 
workflow is only representative and can be extended to incorporate multiple image 
modality combinations (such as CT and contrast-enhanced CT, or MRI, etc.) for 
various IGI applications. This chapter validates the high-speed implementation of 
deformable image registration and demonstrates the feasibility of using these 
components to improve existing procedures and enable development of novel image-
guided applications. We consider two such clinical applications. First, we propose a 
strategy for intraprocedural radiation dose reduction in the context of CT-guided 
 
Figure 6.1: Integration of deformable registration into IGI workflow. 
 157 
 
procedures. Second, we demonstrate the feasibility of incorporating PET into liver 
radiofrequency ablations, a common procedure for treating liver tumors. 
6.1.   Radiation Dose Reduction 
6.1.1.   Motivation 
3D visualization is a critical need of image guided interventions. To perform true 3D 
visualization, a volumetric  image of the operative field is essential, the type of data 
which is native to modern CT/MRI, but not to 2D-ultrasound or fluoroscopic imaging 
which are conventionally used for image-guided interventions [18, 20, 21] . 
Continuous real-time 3D imaging in an interventional suite is the first step to equip 
interventionists with enhanced visualization capabilities. However, continuous 3D 
imaging has been technologically difficult until recently. MRI remains slow and 
while real-time 3D ultrasonography was recently released, its image quality remains 
suboptimal compared with that of CT and MRI. Recently introduced multi-slice CT 
does not suffer from these limitations and can, in fact, image the operative field at a 
high resolution and a very high frame rate (sub-seconds). As a result of this, many 
interventional procedures such as liver biopsies, cryo-and radio-frequency ablations 
are now being routinely carried out under volumetric CT-guidance [171-173]. With 
time, its speed, image resolution and coverage will only improve, making it even 
more desirable for live intraprocedural imaging. Several imaging equipment 
manufacturers, Toshiba and Philips, for example, have announced availability of 256-
slice scanners [25, 174] providing sufficient coverage (8-12 cm) for most image 
guided interventions. Radiation exposure to the patient and the interventionist, 
 158 
 
however, continues to be a concern with use of continuous or on-demand CT. It is, 
therefore, necessary to acquire the intraprocedural images at a lower dose, such that 
the net radiation dose is within the safe limits. There are some previous studies which 
have reported use of low-dose CT [175-177], however, these techniques were 
primarily employed for diagnostic and computer-aided detection applications and did 
not involve image registration. 
6.1.2.   Dose Reduction Strategy 
Our primary radiation dose reduction strategy is to acquire a standard-dose 
preprocedural CT image and scan the dynamic operative field subsequently using 
low-dose CT. Using the high-speed deformable registration technique described 
earlier, the preprocedural CT image can then be registered to low-dose 
intraprocedural CT images. Registered preprocedural CT will then show the dynamic 
intraprocedural anatomy and will substitute the low-dose CT images. These 
diagnostic quality images can then be 3D rendered and used for intraprocedural 
guidance and navigation. Capability of viewing hidden vasculature using 
preprocedural contrast-enhanced CT together with the additional capability of 
virtually inserting intraprocedural tools in the 3D renderings will add a new 
dimension to CT-guided interventions. 
This proposed dose reduction strategy necessitates the determination of the 
threshold radiation dose for intraprocedural CT, which permits its accurate image 
registration with preprocedural CT. The following sections describe a preliminary 
study to evaluate the registration accuracy with low-dose CT.  
 159 
 
6.1.3.   Evaluation of Registration Accuracy with Low-Dose CT 
Quantifying the accuracy of deformable registration, in general, is a very 
difficult task due to the lack of a well known gold standard. It is, however, necessary 
to judge the registration accuracy in the proposed application to determine the optimal 
radiation dose that does not sacrifice the precision of an image-guided procedure. 
Since our images are to be registered in the present case, our validation strategy has 
been to test how well the deformable registration algorithm recovers a user-
introduced, known non-rigid misalignment. Our overall strategy can then be 
described in following main steps:  1) Generate images representing the same 
anatomy at varying radiation doses, 2) introduce the same known deformation in low-
dose CT images, 3) preprocess the deformed low-dose CT images to improve SNR 
prior to the registration, 4) register the preprocedural standard-dose image with the 
deformed intraprocedural (simulated) low-dose images using deformable image 
 
Figure 6.2: Important steps for evaluating registration accuracy with low-dose CT.
 160 
 
registration and finally, 5) compare the transformation field obtained after image 
registration with the original, user-introduced deformation field to calculate the 
registration accuracy at various doses. The important steps in this workflow are 
illustrated in Figure 6.2 and are described below. 
6.1.3.1.   Generating Low-Dose CT Images 
Low dose images corresponding to a standard dose abdominal scan were 
generated using syngo-based Somaris/5 simulator from Siemens. This simulator 
models the noise and attenuation effects at lower radiation doses and can generate 
low-dose equivalent images from tomographic projection data corresponding to an 
input standard-dose image. The performance and accuracy of this simulator has been 
previously reported [178]. This approach ensures that scans at all radiation doses 
represent exactly the same anatomy. Example low-dose images generated using this 
simulator are shown in Figure 6.3.  Subfigure (a) shows a coronal slice for an input 
image at a standard dose (200 mAs), and subfigures (b-d) show the corresponding 
coronal slice for the low-dose images generated using the dose simulator at various 
doses. 
 
a) 200 mAs b) 50 mAs c) 20 mAs d) 10 mAs 
Figure 6.3: Low-dose CT images generated by the dose-simulator. 
 161 
 
6.1.3.2.   Creating Anatomically Realistic Deformations 
Human body, and abdominal organs and tissues in particular, undergo non-
rigid deformation during day-to-day activities, respiratory and cardiac cycles, etc. 
These deformations manifest as misalignment between preprocedural and 
intraprocedural scans. Further misalignments are introduced due to differences in 
patient position during imaging as well as different scan parameter settings. In order 
to create a realistic deformation that incorporates all these effects, it is necessary to 
estimate this deformation from scans of the same anatomy taken on different days, 
thus ensuring sufficient temporal separation. 
We, therefore, consider CT images for the same subject acquired at different 
times as a suitable image-pair. We then identify several significant anatomical 
landmarks from both these images. Deformation vectors between homologous 
anatomical landmarks in such two images represent the local misalignment at those 
landmarks. Using these vectors, deformation field for the entire anatomy can then 
approximated using thin-plate spline (TPS)-based interpolation [179]. TPS is an 
interpolation scheme based on radial basis functions and is capable of providing 
smooth 2D or 3D interpolation for non-uniformly spaced sample points. This 
technique has been applied extensively for various medical imaging applications 
[180, 181]. 
6.1.3.3.   Image Preprocessing 
Ultra low-dose CT scans acquired during the procedure show the exact same 
anatomy as a standard dose scan would but are characterized by high level of 
quantum noise. These scans may not be acceptable for diagnostic purposes, but 
 162 
 
contain sufficient information regarding the current anatomical state. From the 
perspective of MI-based deformable image registration, using these low-dose images 
as is will cause the dispersion of the mutual histogram (due to noise, in an otherwise 
uniform structure) leading to poor image registration. Anisotropic diffusion filtering 
has been shown to be an effective processing step prior to advanced image processing 
[76, 83, 182, 183]. Figure 6.4 compares the performance of anisotropic diffusion 
filtering against other standard preprocessing techniques. Higher value of mutual 
information between the original and the filtered image after preprocessing, which 
represents increased structural similarity, indicates superior filtering performance for 
a given pair of images. We, therefore, enhance the low-dose CT images through use 
of anisotropic diffusion filtering prior to image registration. Anisotropic diffusion 
filtering, however, is an iterative process which can take up to minutes on a modern 
CPU. The FPGA-based real-time implementation of this operation, which has been 
described earlier, can be employed to accelerate the execution of this step. 
6.1.3.4.   Applying Deformation and Image Registration 
The deformation field generated using TPS-based interpolation is applied to 
the low-dose images for all the simulated doses. This involves resampling of the low-
Standard Dose 
(200 mAs) 
Ultra Low Dose 
(10 mAs) 
Median 
Filtered 
Gaussian 
Filtered 
Anisotropic 
Diffusion Filtered
MI = 3.49 MI = 1.11 MI = 1.43 MI = 1.23 MI = 1.62 
Figure 6.4: Comparison of techniques for preprocessing low-dose CT images. 
 163 
 
dose image on to a regular grid using the anatomical mapping provided by the TPS-
generated deformation field. The image registration between these deformed low-
dose images and the original standard-dose image yields a registration field, which 
attempts to estimate and recover the induced anatomically realistic deformation. The 
mean-squared difference between the deformation vectors at every voxel 
corresponding to the registration field and the induced deformation field provides a 
measure of registration accuracy at a particular radiation dose. 
6.1.4.   Experiments 
An abdominal scan acquired under clinical settings at the standard dose (200 
mAs) was selected for this study. Dose correction feature of the scanner, which 
automatically modulates the radiation dose based on the anatomy to be scanned, was 
turned off for this scan to keep the dose consistent across all the slices. The CT image 
acquired measured 256?256?300 voxels, with voxel dimensions of 1.56 mm?1.56 
mm?1.5 mm and field of view restricted mostly to lower thorax and abdomen. 
Four different CT scans of the same subject acquired at different times within 
a span of one to sixty days were used for creating anatomically realistic deformations. 
These prior scans had dimensions and resolution of 256?256?280-315 and 1.48 
mm?1.48 mm?1.5 mm respectively. Based on a predetermined, well-described list of 
32 anatomical landmarks, a clinical expert identified and marked a set of homologous 
points in all the CT scans (one standard-dose scan and four older scans). Based on the 
expert-defined landmarks, TPS-based starting deformation fields (DF
1,
 DF
2,
 DF
3,
 and 
DF
4
) corresponding to each prior scans were generated. 
 164 
 
Treating the standard-dose scan as a reference, we generated eleven low-dose 
scans (at 10, 15, 20, 25, 30, 40, 50, 70, 85, 100, 150 mAs) using the Somaris/5 
simulator. 10mAs was the lowest setting possible for the current version of the 
simulator. Each of these scans (including the standard dose scan) was deformed using 
the realistic deformation fields DF
1,
 DF
2,
 DF
3, 
and DF
4 
described above. 
We registered these deformed and preprocessed low-dose images (reference 
image) with the original standard-dose image (floating image) using the deformable 
registration algorithm and its accelerated implementation described earlier. 
Alignment between the floating and reference after deformable image registration 
yields a registration field (RF
i
) which maps each reference image voxel into floating 
image coordinate space. Comparison of this registration field (RF
i
) with the originally 
introduced deformation field (DF
j
) was used to judge the registration accuracy for 
each dose. 
6.1.5.   Results 
Result of the deformable registration using the hardware implementation of 
the registration algorithm for a representative image-pair generated using one of the 
deformations (DF
1
) is shown in Figure 6.5. Columns a) and b) show the two starting 
poses, and column c) the starting pose difference that the image registration must 
recover. The difference images after image registration at various CT doses are 
shown in columns d) to f). The top row shows coronal slices of the image and the 
corresponding axial slices are shown in the bottom row. The software implementation 
produced qualitatively similar registration results. Visually correct registration of the 
standard-dose image with the deformed images at various low doses (evident from the 
 165 
 
reduced features in the difference image) demonstrates the feasibility of deformable 
registration at low CT doses.  
The process of deformable registration attempts to recover any misalignment 
between the reference and floating images. A perfect registration will completely 
recover this misalignment and yield a non-rigid transformation field that is identical 
to the deformation field representing the original misalignment. A comparison 
between these two fields can be used as a performance index for the registration 
accuracy. 
For this experiment, the deformation field introduced (DF
i
) is known at every 
voxel. The volume subdivision-based deformable registration algorithm generates the 
transformation field (RF
j
), which provides the transformation at every voxel in scan 
with dose j. The average of the magnitude of the vector differences between these two 
fields at all doses for the software implementation is reported in Figure 6.6. This 
average was calculated over the region of the image which contains sufficient part of 
the subject and hence information to yield meaningful registration. The regions of the 
image which contain no information (very low entropy) (e.g. black areas surrounding 
   
   
d) 200 mAs e) 20 mAs f) 10 mAs a) Standard 
CT 
Pose-1 
b) Standard 
CT 
Pose-2 
c) Starting 
Difference Difference after registration at various doses
Figure 6.5: Qualitative comparison of registration accuracy with low-dose CT. 
 166 
 
the subject) are masked out using a simple threshold operation. The results show a 
maximum error of about 10% at the doses of 10 mAs and 20 mAs, respectively. 
Moreover the maximum registration error at the lowest dose (10 mAs) is less than 2.5 
mm, which is within acceptable limits of IGI applications. As expected, the average 
error improves steadily with dose.  
We repeated the experiment with the high-speed implementation of the 
deformable registration algorithm described earlier, and compared the registration 
field produced by the hardware implementation with the user induced deformation. 
The same optimization algorithm (downhill Simplex) was used for these two 
implementations and equal number of optimization iterations were used. Figure 6.6 
also shows the comparison of the average registration accuracy achieved using the 
software and hardware implementations. The blue curve corresponds to the average 
registration accuracy using the software implementation and the red curve 
corresponds to the average registration accuracy of the hardware implementation. 
 
Figure 6.6: Average registration error with respect to dose using software and 
FPGA-based implementations. 
 167 
 
This comparison of the registration accuracies indicates that the hardware 
implementation is capable of achieving comparable registration accuracy. The 
average execution time for these two implementations is reported in Table 6.1. The 
hardware implementation achieved a speedup of about 30 and offers comparable 
registration accuracy. Moreover, both the implementations offer acceptable 
registration accuracy for most IGI applications and demonstrate the applicability of 
deformable image registration to lower the radiation dose during intraprocedural 
imaging. 
6.1.6.   Summary 
In this study, we have demonstrated successful registration of standard-dose 
abdominal CT images with lower-dose images of the same anatomy. Even at 10 mAs, 
the smallest dose achievable using the simulator, the registration accuracy achieved 
was comparable to that achieved at the standard dose. Our results demonstrate ten- to 
twenty fold reduction in radiation dose with the use of low-dose CT. Reduction of 
radiation dose to safe levels is highly significant in that it enables navigating 
interventions using more powerful, multislice CT. Moreover, through the use of 
hardware accelerated implementation, deformable image registration (which allows 
this reduction in radiation dose) can be performed in a matter of minutes. This 
enables assimilation of this novel dose reduction strategy in the IGI workflow. The 
Table 6.1: Execution time for deformable image registration using low-dose CT. 
Execution Time (s) 
Software 
Implementation 
FPGA-based 
Implementation 
Speedup 
11,976 392 30.55 
 
 168 
 
introduction of true 3D visualization possible through safe, volumetric CT imaging of 
the intraprocedural anatomy may enable development of novel image-guided 
interventional applications such as CT-guided minimally invasive surgery [149, 184].   
6.2.   Incorporation of PET into CT-Guided Liver Radio-Frequency 
Ablation 
I would like to thank Mr. Peng Lei from Prof. Shekhar?s research group for 
his help in the quantitative validation aspect of the deformable registration presented 
in this section. 
6.2.1.   Motivation 
The liver is a common site of both primary and metastatic cancers. Performing 
resection of malignant liver tumors has been shown to increase the 5-year survival 
rate of patients with liver cancer [185]. However, the majority (85%?90%) of hepatic 
tumors are considered unresectable at diagnosis [186], either because of their 
anatomic location, size, or number or because of inadequate viable liver tissue and 
morbidity. Radiofrequency ablation (RFA) as a minimally invasive procedure is often 
the treatment of choice for patients who are not suitable candidates for resection 
[187]. RFA is a treatment technique that uses high-frequency alternating electrical 
current to destroy tissue cells by heating [188]. For small lesions (<5 cm in diameter), 
RFA has been reported to achieve 4-year survival rates, comparable with those 
achieved with resection [189, 190]. 
RFA has conventionally been performed under fluoroscopic or ultrasound 
guidance. With the advent of multislie CT, many of these procedures are being 
 169 
 
carried out under intraprocedural volumetric CT guidance. The volumetric CT scan 
provides better 3D orientation and a detailed anatomic structural map. However, 
some lesions (particularly small untreated masses or recurrent or residual tumors in a 
large treated mass) are not clearly visible (see Figure 2.1) because intraprocedural CT 
scanning is usually not contrast enhanced. Even in contrast-enhanced diagnostic CT 
images, hepatic lesions with abnormal metabolic activity are sometimes overlooked 
[191]. Consequently, local recurrence after liver RFA as a result of these missed 
tumors remains one of the major factors in relapse [192], which is in the range of 
3%?39% [193]. Thus, precise targeting of lesions remains a challenging task when 
guided solely by volumetric CT with or without contrast enhancement. 
Because malignancies are better characterized by increased metabolic activity 
than CT number variations, PET (as a functional imaging modality) has higher 
sensitivity and specificity than CT for tumor localization. Active lesions show up 
clearly as regions of high uptake in a conventional PET scan. However, compared 
with CT, PET is devoid of structural tissue details, which are also important for 
intraprocedural targeting and needle placement. Moreover, PET is also challenged by 
slow scanning speed, radiation exposure risks, logistic challenges, etc. Currently PET 
remains primarily a preprocedural imaging modality, used to identify a treatment site 
rather than provide intraprocedural guidance during an ablative procedure. To 
combine the strengths and overcome disadvantages of both PET and CT as 
intraprocedural imaging modalities, registration between PET and CT is essential. In 
a clinical study by Veit et al. [186], PET and CT fusion images were reported to 
greatly improve recognition and localization of liver masses. Our work, therefore, is 
 170 
 
focused on demonstrating fast, automatic, and accurate deformable image registration 
between preprocedural PET and intraprocedural CT to improve localization of liver 
malignancies during RFA. 
6.2.2.   Registration of PET and CT 
As described earlier, combining of PET and CT images is crucial during liver 
RFA. PET, however, cannot be repeated intraprocedurally because of time and 
logistic challenges, as well as radiation risks [194].  Consequently, the interventionist 
must mentally correlate preprocedural PET data onto intraprocedural CT images to 
localize the lesion into which the RFA needle will be advanced, a subjective task 
dependent on operator expertise. Combined PET/CT scanners, which are based on 
mechanically achieved rigid registration, have emerged in recent years. They have 
provided significant improvement over separate CT and PET scanning. But in 
abdominal procedures such as liver RFA (unlike thoracic or brain surgery), non-rigid 
misalignment resulting from tissue deformation and respiration motion can be 
significant, so rigid registration approaches such as combined PET/CT scanning and 
fiducial marker?based registration may misrepresent the actual transformation of the 
liver. Combined PET/CT scanners have been shown to be unable to fully compensate 
for involuntary non-rigid motions, such as those from respiration [185], and the 
resulting images have been found to have significant misregistration in areas close to 
the diaphragm. Moreover, as mentioned previously, PET has challenges including 
slow scanning speed and radiation exposure risks. A combined PET/CT scanner is not 
an appropriate choice as an intraprocedural imaging modality in this application or 
any other image-guided procedure.  
 171 
 
Intensity-based, retrospective deformable image registration and fusion 
together form the only alternative, which is also proposed here. To be clinically 
practical, the registration algorithm must be automatic, which means that manual 
segmentation?based registration is excluded. Moreover, as mentioned before, the 
registration must be achieved sufficiently fast so as not to obstruct the clinical 
workflow. The deformable registration algorithm described in the dissertation has 
been previously validated for intensity based registration of whole-body PET and CT 
images [66] and PET/CT images for liver RFA [78, 148]. In this work, we validate 
this algorithm and its FPGA-based high-speed implementation of this algorithm in the 
context of the aforementioned clinical application. 
6.2.3.   Experiments 
This study was geared toward demonstrating the feasibility of image 
registration between preprocedural PET and intraprocedural CT in the context of liver 
RFA using the earlier described high-speed image registration solution. This 
retrospective study involved 20 CT-PET image pairs of patients who had undergone 
percutaneous abdominal RFA under CT guidance. The preprocedural PET scans had 
size and resolution of 150 ? 150 ? 187?240 and 4.0 ? 4.0 ? 4.0 mm, respectively. 
Noncontrast-enhanced intraprocedural abdominal CT scans were acquired for guiding 
needle placement during RFA procedures. The intraprocedural CT scans had size and 
resolution of 512 ? 512 ? 35?62 and 0.78?1.17 ? 0.78?1.17 ? 5 mm, respectively. 
 172 
 
6.2.3.1.   Image Preprocessing 
The intraprocedural CT images were acquired during the RFA with the RFA 
needle in the field of view. As a result, some metal artifacts were present in the CT 
images.  We preprocessed the CT images using 3D median filtering and 3D 
anisotropic diffusion filtering in order to minimize these artifacts.  To achieve a trade-
off between maintaining CT resolution and obtaining nearly isotropic voxels, we 
resampled the intraprocedural CT images for all cases to have the dimensions of 256 
? 256 ? 128. Resampling reduced the spatial resolution of CT images in x and y 
dimensions; however, the resulting images still had better spatial resolution than the 
preprocedural PET images (the lower resolution image controls the accuracy of 
intensity-based image registration in general). No preprocessing steps were used for 
the preprocedural PET images. 
6.2.3.2.   Validation of Image Registration 
We evaluated the accuracy of the deformable registration between PET and 
CT by comparing the alignment of several anatomic landmarks (3D locations within 
the images) as predicted by the algorithm (both software and FPGA-based 
implementations) against a reference. Because of the lack of a gold standard for 
validation of deformable registration algorithms, we assumed the ability of clinical 
experts to locate landmarks in both intraprocedural CT and preprocedural PET 
images as a suitable benchmark performance. We then contend that comparing the 
variability in landmark matching between algorithm- and expert-defined registrations 
with the variability among the three expert-defined correspondence between the 
 173 
 
landmarks is a reasonable way to evaluate the registration accuracy of the algorithm 
and assess whether its performance is comparable to that of the experts.  
 
Our validation scheme is graphically represented in Figure 6.7. Three clinical 
experts, experienced in interpreting CT and PET images, were involved in the 
validation procedure. Each expert was asked to identify and mark anatomic 
landmarks identifiable in both PET and CT images. Because the location of a specific 
landmark as marked by an expert can vary slightly from expert to expert, a set of ?test 
landmarks? were created for each case separately. This was achieved by defining the 
location of each landmark as the centroid of the expert-defined locations for that 
landmark in intraprocedural CT (represented by CT
TEST
 in Figure 6.7). The expert-
defined transformation fields were then used to determine distinct sets of homologous 
preprocedural PET landmarks (PET
E1
, PET
E2
, and PET
E3
, respectively), each 
representing the transformed locations of the test landmark (CT
TEST
) according to the 
manual registration performed by each expert. The average expert-defined 
transformed location (PET
EXPERT
) was calculated as the centroid of PET
E1
, PET
E2
, 
and PET
E3
. The algorithm-determined transformation field was used to determine a 
 
Figure 6.7: Graphic illustration of the quantitative validation approach used in the 
context of deformable registration between intraprocedural and preprocedural PET. 
 174 
 
set of landmarks in the PET image (PET
ALGO
) representing the transformed locations 
of the test landmarks after automatic deformable registration. These locations were 
calculated for both the software and FPGA-based high-speed implementation of the 
algorithm. 
For each case, the mean error between PET
EXPERT
 and PET
ALGO
 was then 
evaluated to quantify the registration accuracy for that case. To further evaluate the 
algorithm performance in the context of inter-expert variability, we allocated the 4 
sets of PET landmark points to separate groups of 3 sets each: reference group 
(PET
E1
, PET
E2
, and PET
E3
); test group 1 (PET
ALGO
, PET
E2
, and PET
E3
); test group 2 
(PET
E1
, PET
ALGO
, and PET
E3
); and test group 3 (PET
E1
, PET
E2
, and PET
ALGO
). For 
each group, the mean difference (Euclidean distance) in the transformed location of 
corresponding landmarks was obtained for all pair wise combinations of sets of PET 
points within that group. The mean difference for each group was determined by 
averaging over all the cases. The variability of locations was then calculated for each 
group. If the group variability of test groups 1, 2, and 3 is statistically similar to that 
of the reference group, we can conclude that the algorithm and experts agree on the 
PET location of a specific landmark in CT. If the group variability is statistically 
different, the algorithm differs significantly from that of the experts. As mentioned 
earlier, we performed this analysis for both software and the FPGA-accelerated 
implementation of the deformable registration algorithm. 
6.2.4.   Results 
All the cases were successfully registered when evaluated both qualitatively 
and quantitatively. Both the software and hardware implementations yielded 
 175 
 
qualitatively similar results. The average execution time for these two 
implementations is reported in Table 6.2. The hardware implementation provides a 
speedup of around 30 and offers comparable registration accuracy, as presented later. 
Qualitative evaluation included visual assessment of improvement in image 
alignment by the clinical experts. No visually apparent gross misregistration was 
found in any case. Example of registration using FPGA-based high-speed 
implementation of the intensity-based deformable registration is presented in Figure 
6.8. For this example, axial, coronal, and sagittal views through the intraprocedural 
CT and preprocedural PET, as well as fused PET-CT images before and after 
deformable image registration, are shown. The first row shows the original 
preprocedural CT image. The second row shows the unregistered preprocedural PET 
image. The third row presents the fusion of intraprocedural CT and preprocedural 
PET prior to image registration. Misalignment of anatomical structures in this fusion 
image is indicated by overlaid arrows. The final row illustrates the fusion of 
intraprocedural CT with registered (through FPGA-based deformable image 
registration) PET image. In all three views, deformable image registration between 
PET and CT images provided superior image alignment when compared with 
unregistered fusion images. Deformable registration recovered the misalignment 
between anatomical structures and lesions. For example, in Figure 6.8, edges of 
anatomical structures in PET and CT agree quite well.  
 176 
 
We performed the quantitative validation of deformable registration using the 
validation strategy described earlier. Table 6.3 presents the results of this validation 
for software implementation. The interexpert variability (i.e., group variability) in the 
identification of each of the four landmarks was averaged for all the 20 image pairs 
after software-based registration between intraprocedural CT and preprocedural PET. 
Table 6.2: Execution time for deformable image registration using 
intraprocedural CT and preprocedural PET images. 
Execution Time (s) 
Software 
Implementation 
FPGA-based 
Implementation 
Speedup 
5217 164 31.81 
Figure 6.8: Registration of intraprocedural CT and preprocedural PET images 
using the FPGA-based implementation of deformable image registration.  
 
 177 
 
The t-tests showed no statistically significant difference between the reference group 
and any test group for any of the four landmarks (the reference group value lies 
within the 95% confidence intervals of all test groups), indicating that the algorithm?s 
solutions were statistically similar to those of experts. For quantitative validation of 
the FPGA-based implementation, we repeated the above validation procedure, 
replacing the software implementation by the FPGA-based implementation. Table 6.4 
presents the results of this quantitative validation. The TRE derived from the mean of 
?PET
EXPERT 
- PET
ALGO
? over all cases and landmarks was 6.7 mm and 7.0 mm for 
software and hardware-accelerated implementations, respectively. This result is 
comparable with the accuracy reported earlier for deformable registration using 
whole-body 3D PET-CT images [66]. This indicates that the accuracy of the 
deformable image registration algorithm is approximately independent of the 
anatomy.  
The TRE obtained for the FPGA-based, high speed implementation of 
Table 6.3: Interexpert variability in landmark identification across 20 image 
pairs. PET
ALGO
 corresponds to the software implementation of the algorithm. 
 
Mean Difference in Landmark Location (mm) ( 95% Confidence 
Interval (mm) ) 
P Value 
Group 
Reference group
(PET
1
, PET
2
, 
PET
3
) 
Group 1 
(PET
ALGO
, PET
2
, 
PET
3
) 
Group 2 
(PET
1
, PET
ALGO
, 
PET
3
) 
Group 3 
(PET
1
, PET
2
, 
PET
ALGO
) 
Dome  
of Liver 
6.7 
6.8 (6.2, 7.4) 
P = 0.76 
7.2 (6.6, 7.7) 
P = 0.11 
7.0 (6.4, 7.6) 
P = 0.34 
Inferior Tip  
of Liver 
6.5 
6.4 (5.9, 6.9) 
P = 0.76 
6.2 (5.7, 6.7) 
P = 0.22 
6.5 (6.0, 7.0) 
P = 0.95 
Right Kidney 
Upper Pole 
6.1 
6.0 (5.4, 6.6) 
P = 0.74 
6.2 (5.5 , 6.8) 
P = 0.76 
6.3 (5.7, 7.0) 
P = 0.52 
Right Kidney 
Lower Pole 
5.8 
5.6 (5.1, 6.1) 
P = 0.43 
5.7 (5.2, 6.2) 
P = 0.64 
6.0 (5.5, 6.5) 
P = 0.39 
 178 
 
deformable image registration is slightly inferior to that obtained using the software 
implementation. However, the t-tests indicate no statistically significant difference 
between the reference group and any test group for any of the four landmarks (the 
reference group value lies within the 95% confidence intervals of all test groups). 
This further indicates that the registration solutions provided by the FPGA-based 
implementation were statistically similar to those of experts. Thus, the hardware 
solution developed in this dissertation not only reduces the execution time of 
deformable registration to a few minutes, but also offers accuracy that is statistically 
similar to that of a group of clinical experts. This fast and accurate implementation 
can, therefore, be used to incorporate preprocedural PET images during CT-guided 
liver RFA.  
6.2.5.   Summary 
In this study, we have demonstrated successful registration of intraprocedural 
abdominal CT images acquired during liver RFA with preprocedural PET images. 
Table 6.4: Interexpert variability in landmark identification across 20 image 
pairs. PET
ALGO
 corresponds to the FPGA-based implementation of the algorithm. 
 
Mean Difference in Landmark Location (mm) ( 95% Confidence 
Interval (mm) ) 
P Value 
Group 
Reference group
(PET
1
, PET
2
, 
PET
3
) 
Group 1 
(PET
ALGO
, PET
2
, 
PET
3
) 
Group 2 
(PET
1
, PET
ALGO
, 
PET
3
) 
Group 3 
(PET
1
, PET
2
, 
PET
ALGO
) 
Dome  
of Liver 
6.7 
7.1 (6.5, 7.8) 
P = 0.16 
7.3 (6.6, 8.0) 
P = 0.08 
7.2 (6.5, 7.9) 
P = 0.13 
Inferior Tip  
of Liver 
6.5 
6.8 (6.3, 7.4) 
P = 0.32 
6.5 (5.9, 7.0) 
P = 0.88 
6.8 (6.2, 7.4) 
P = 0.38 
Right Kidney 
Upper Pole 
6.1 
6.3 (5.6, 7.0) 
P = 0.50 
6.5 (5.7 , 7.2) 
P = 0.32 
6.4 (5.7, 7.1) 
P = 0.36 
Right Kidney 
Lower Pole 
5.8 
5.7 (5.2, 6.2) 
P = 0.70 
5.9 (5.4, 6.4) 
P = 0.67 
6.2 (5.8, 6.7) 
P = 0.06 
 
 179 
 
This can enable incorporation of PET into RFA procedures through algorithmic 
image registration. The hardware-accelerated implementation of image registration, 
developed as part of this dissertation work, can perform this image registration task in 
a matter of minutes and yet offers comparable accuracy. This represents a first 
significant step toward integration deformable registration in RFA procedures. With 
further technological advances, such as integration with the imaging equipment, this 
approach can be made routine under clinical settings. This would lead to precise 
ablation of the area of malignant activity, as indicated on PET, and avoid unnecessary 
ablation of healthy tissues. For large and multiple lesions, this technique may shorten 
procedure time and minimize post-procedural morbidity. Precise targeting of lesions 
could allow definitive and complete treatment, potentially reducing relapse rates and 
the number of repeat procedures. 
 180 
 
Chapter 7:   Conclusions and Future Work 
7.1.   Conclusion 
Minimally invasive IGIs are time and cost efficient, minimize unintended 
damage to healthy tissue, and lead to faster patient recovery. Consequently, these 
procedures are becoming increasingly popular. With the availability of high-speed 
volumetric imaging devices there is an increasing thrust on using 3D images for 
navigation and target delineation during IGIs. However, processing and analysis of 
these volumetric images, while meeting the on-demand performance requirements of 
IGI applications remains a challenging task. To address this problem, this dissertation 
has presented core components of an advanced image-guidance system that will allow 
high-speed processing and analysis of these images. The execution performance 
along with the accuracy and compact and reconfigurable nature of these components 
enables their integration into clinical applications. 
Image preprocessing and enhancement is an important prerequisite step in the 
IGI workflow prior to advanced image analysis and visualization. Chapter 3 presented 
a novel FPGA-based architecture for real-time preprocessing of volumetric images 
acquired during IGIs. We presented a brick-caching scheme that allows efficient, 
low-latency access to sequential 3D neighborhoods, a scenario common to most 
filtering operations. We introduced an FPGA-based implementation of 3D anisotropic 
diffusion filtering. This design takes advantage of the symmetries present in the 
Gaussian kernel and implements this filtering kernel with a reduced number of 
multipliers. We further presented a linear systolic array?based architecture for 
 181 
 
accelerated implementation of 3D median filtering. The developed architecture 
enables 3D anisotropic diffusion filtering and 3D median filtering of intraprocedural 
images at the rate of 50 fps, which is faster than current acquisition speeds of most 
intraprocedural imaging modalities. The solution presented offers real-time 
performance, is compact and accurate, and, hence, suitable for integration into IGI 
workflow. Furthermore, the additional filtering kernels that are based on 
neighborhood operations (for example, general purpose convolution) can be easily 
incorporated into the same framework.  
As IGI applications become increasingly popular, intraprocedural imaging 
modalities continue to offer wider coverage and higher imaging speed. Thus, there is 
a corresponding need for real-time processing of these images. The real-time 
performance of our design along with the throughput of one voxel per cycle can cater 
to these 4D (3D + time) image processing needs. 
Image registration between intra- and preprocedural images is a fundamental 
need in the IGI workflow. To facilitate that, Chapter 4 presented a novel FPGA-based 
architecture for high-speed implementation of MI-based deformable image 
registration. This architecture achieved voxel-level parallelism through pipelined 
implementation and employed several strategies to address the fundamental 
bottleneck in the intensity-based image registration, namely memory access 
management. As a result of these enhancements, the presented architecture is capable 
of achieving high voxel processing rate and a speedup of about 30 and consequently 
reduces the execution time of deformable registration from hours to only a few 
minutes. The results of the qualitative and quantitative validation demonstrate that 
 182 
 
this performance improvement does not significantly compromise the accuracy of 
deformable registration. Further clinical validation performed in the context of novel 
IGI applications illustrated the potential of this implementation to enable improved 
target delineation during image-guided interventions through deformable registration 
with preprocedural images. The robustness, speed, and accuracy offered by this 
architecture, in conjunction with its compact implementation, make it ideally suitable 
for integration into IGI workflow. 
Accurate, robust, and real-time deformable image registration between intra- 
and preprocedural images has been an unmet need, critical to the success of image-
guided procedures. The work presented in this dissertation constitutes an important 
first step toward meeting this goal. With further algorithmic and hardware 
improvements, geared toward enhancing its accuracy and performance, this approach 
has the potential to elevate the precision of current procedures and expand the scope 
of IGI to moving and deformable organs. 
The work presented in this dissertation achieved superior performance 
through custom design on a reconfigurable computing platform, by addressing the 
fundamental bottlenecks in the considered computationally intensive applications. 
One of the primary strengths of reconfigurable architectures over general purpose 
processor?based implementations is their ability to utilize more streamlined 
representations for internal variables. Furthermore, this approach can result into 
optimized utilization of FPGA resources, by employing just enough precision for 
each of its internal variables to satisfy design requirements of a given application. 
Given this advantage, it is highly desirable to automate the derivation of optimized 
 183 
 
design configurations that offer varying degree of tradeoff between implementation 
accuracy and hardware resources. Toward that end, this dissertation has presented a 
framework for multiobjective optimization of finite precision, reconfigurable 
implementations. This framework considered multiple conflicting objectives, such as 
hardware resource consumption and implementation accuracy, and systematically 
explored tradeoff relationships among the targeted objectives. This dissertation has 
also further demonstrated the applicability of EA-based techniques for efficiently 
identifying Pareto-optimized tradeoff relations in the presence of complex and non-
linear objective functions. The evaluation that this work has conducted in the context 
of the FPGA-based architecture for deformable image registration demonstrated that 
such an analysis can be used to enhance automated hardware design processes, and to 
efficiently identify a system configuration that meets given design constraints. This 
approach may also be applied in the context of reconfigurable computing for 
identifying suitable design configurations that can be switched among at runtime. 
Furthermore, the multiobjective optimization approach presented in this dissertation 
is quite general, and can be extended to a multitude of other signal processing 
applications. 
We have extensively validated the applicability of our approach in the context 
of CT-guided interventions. In the first clinical application, we demonstrated 
successful registration of standard-dose abdominal CT images with lower-dose 
images of the same anatomy. Our results demonstrated ten- to twenty fold reduction 
in intraprocedural radiation dose through the use of low-dose CT. Reduction of 
radiation dose to safe levels is highly significant in that it enables navigating 
 184 
 
interventions using more powerful, multislice CT. Moreover, through the use of 
hardware accelerated implementation, deformable image registration (which allows 
this reduction in radiation dose) can be performed in a matter of minutes. This 
enables assimilation of this novel dose reduction strategy in the IGI workflow. In the 
second IGI application, we have further demonstrated successful registration of 
intraprocedural abdominal CT images acquired during liver RFA with preprocedural 
PET images. This can enable incorporation of PET into RFA procedures through 
algorithmic image registration. We demonstrated that the hardware-accelerated 
implementation of image registration, developed in this dissertation work, can 
perform this image registration task in a matter of minutes and yet offers comparable 
accuracy.  
Our approach represents a first significant step toward integration of 3D 
image processing and deformable image registration in image-guided procedures. 
This approach can not only improve target delineation and reduce radiation dose but 
can also trigger the development of novel image-guided procedures. With further 
technological advances, such as integration with the imaging and visualization 
equipment, and additional enhancements in speed and accuracy, this approach can be 
made routine under clinical settings. This approach has the potential to elevate the 
precision of current IGI procedures, expand the scope of IGI to moving and 
deformable organs, and thereby to provide a new level of sophistication and accuracy 
during IGIs. 
 185 
 
7.2.   Future Work 
The FPGA-based architecture for real-time implementation of image 
preprocessing operations is capable of providing a high voxel throughput and a 
volumetric processing rate higher than the acquisition speed of most current 
generation imaging modalities. This architecture presented implementation of 3D 
anisotropic diffusion filtering and 3D median filtering, the preprocessing steps most 
commonly used in the context of IGI. However, the core components of this 
architecture, in particular the memory controller and the brick-caching scheme, are 
general enough to allow real-time realization of a range of image processing kernels 
based on neighborhood operations. For example, a kernel for general-purpose 3D 
convolution reported by Venugopal et al. [195] can easily be incorporated in the same 
image processing framework. Similarly, certain morphological and contrast-
enhancement operations that are based on neighborhood operations can also be 
supported by the same framework while providing equivalent voxel throughput. The 
current implementation of this framework supports sequential image processing 
operations through static reconfiguration of the filtering kernels. With the advent of 
modern FPGAs that are capable of runtime reconfiguration, it is possible to support 
adaptive image preprocessing by changing the preprocessing steps or tuning the 
filtering kernel parameters at run-time. This will enable the preprocessing system to 
adapt to the requirements of the subsequent advanced image processing applications 
(for example, registration, volume rendering, or segmentation).  
The architecture presented for accelerated calculation of MI, is capable of 
achieving deformable registration between a pair of images with size 256 ? 256 ? 256 
 186 
 
in about 6 minutes, while providing accuracy comparable to a software 
implementation. This represents a significant first step toward enabling integration of 
deformable registration in the IGI workflow. Further acceleration of the 
aforementioned registration algorithm to satisfy the interactive requirement of IGIs 
can be achieved through additional strategies. First, the current architecture uses the 
same external memory module to store both the RI and FI. Storing these images in 
two separate memory modules will allow their independent parallel access. This will 
eliminate the need to prefetch the RI voxels and thus provide speedup. In addition, 
using high-speed static random access memory (SRAM) modules for storing the 
randomly accessed FI is likely to provide further speedup by providing faster access 
to the FI with minimal latencies. Second, as showed by Studholme et al. [64], varying 
MH size between 32 ? 32 and 256 ? 256 does not significantly affect the accuracy of 
MI-based registration. Based on this observation, the size of the MH within the 
FPGA-based implementation can be adaptively reduced with every level of 
subdivision. This will reduce the overhead of clearing the MH for smaller 
subvolumes, thereby lending additional speedup. Third, the overhead of 
communication time required for exchanging the transformation matrix and the 
calculated MI value between the host and the MI calculator can be minimized by 
reducing the communication latency. This will provide additional speedup, especially 
at finer levels of image subdivision where the computation time becomes comparable 
to the communication time. Most modern FPGAs support an embedded hard- or soft- 
processor core that can be utilized to implement the optimization algorithm. Thus, 
both the components of the registration routine will be located on the same platform, 
 187 
 
thereby reducing the communication latency. Finally, as described earlier, the 
registration algorithm optimizes the individual subvolumes at a given level of 
subdivision sequentially, but independently of each other. Thus, using multiple FPGA 
modules in parallel it is possible to simultaneously optimize these subvolumes. This 
multi-FPGA implementation will likely provide near-linear speedup. This 
architecture can also be incorporated in a broader, heterogeneous computing 
framework such as the one described by Plishker et al. [121]. All these strategies, in 
combination, can further reduce the execution time of deformable image registration 
and ultimately achieve near-real time performance for its seamless integration into 
IGI applications.  
The framework we presented for optimization of finite-precision 
implementations considers multiple conflicting objectives, such as hardware resource 
consumption and implementation accuracy, and systematically explores tradeoff 
relationships among the targeted objectives. Although this framework was developed 
and validated in the context of FPGA-based image registration, the presented 
optimization approach is quite general and can be extended to many signal processing 
applications beyond medical image processing domain. However, for demonstration 
of this capability, further validation of this optimization strategy must be performed 
in the context of multiple signal processing applications with known benchmarks. The 
hardware area models we adopted while developing this framework, model the FPGA 
area consumption using the number of look-up tables required. Modern FPGA 
families, however, provide a large number of special functional units that may be 
utilized for performing arithmetic operations such as multiplication, addition, etc. 
 188 
 
Although we demonstrated the high fidelity of the adopted area models, enhancing 
these models to take into account the utilization of specialized functional units will 
better represent the area requirements of a design configuration and, in general, 
provide more accurate area estimation. In addition to the approaches mentioned 
above, the framework for optimization of finite precision implementations can be 
enhanced in the following ways: first, the search methods could be refined further to 
efficiently explore the search space. For example, the multi-objective optimization 
could be preceded by univariable simulations that can help to reduce the size of the 
search space. Also, the parameters used for EA-based search, such as representation 
scheme, population size, crossover and mutation operators, etc. can be optimized as 
well. Second, the framework could be tuned for automated selection of search 
methods that are ideally suited for a given problem. For example, for small design 
search spaces a technique based on exhaustive search could be utilized. Further, EA-
based search schemes could be enhanced further by exploring selection schemes 
based on techniques such as NSGA-II, epsilon dominance and quality indicators. 
Finally, this framework and/or the Pareto-optimized solutions identified by this 
framework can be incorporated into automated design optimization flows. This will 
enable selection of optimized design configurations that balance the tradeoff between 
implementation accuracy and hardware cost at run-time. Further, additional objective 
functions such as power requirement and operating frequency could also be 
incorporated in this framework to comprehensively capture the effects of various 
design configurations. 
 189 
 
Bibliography 
[1] W. Charboneau, "Image-guided Therapy is New Pillar in Patient Care," Radiology 
Society of North America, New Horizons Lecture, 2006. 
[2] C. Fujioka, J. Horiguchi, M. Ishifuro, H. Kakizawa, M. Kiguchi, N. Matsuura, M. 
Hieda, T. Tachikake, F. Alam, T. Furukawa, and K. Ito, "A feasibility study: 
Evaluation of radiofrequency ablation therapy to hepatocellular carcinoma using 
image registration of preoperative and postoperative CT," Academic Radiology, vol. 
13(8), pp. 986-994, Aug 2006. 
[3] D. E. Heron, R. S. Andrade, and R. P. Smith, "Advances in image-guided 
radiation therapy--the role of PET-CT," Medical Dosimetry, vol. 31(1), p. 3, 2006. 
[4] P. Veit, C. Kuehle, T. Beyer, H. Kuehl, A. Bockisch, and G. Antoch, "Accuracy 
of combined PET/CT in image-guided interventions of liver lesions: an ex-vivo 
study," World journal of gastroenterology, vol. 12(15), pp. 2388-2393, 2006. 
[5] V. Vilgrain, "Tumour detection in the liver: role of multidetector-row CT," 
European radiology, p. D85, 2005. 
[6] J. T. Yap, J. P. J. Carney, N. C. Hall, and D. W. Townsend, "Image-guided cancer 
therapy using PET/CT," Cancer journal, vol. 10(4), pp. 221-233, 2004. 
[7] K. Benkrid, D. Crookes, and A. Benkrid, "Design and implementation of a novel 
algorithm for general purpose median filtering on FPGAs," Proc. of the IEEE 
International Symposium on Circuits and Systems, ISCAS, vol. 4(pp. 425-428, 2002. 
[8] T. Gijbels, P. Six, L. Van Gool, F. Catthoor, H. De Man, A. Oosterlinck, J. 
Rabaey, P. M. Chau, and J. Eldon, "A VLSI-architecture for parallel non-linear 
diffusion with applications in vision," Proc. IEEE Workshop on VLSI Signal 
Processing, pp. 398-407, 1994. 
[9] C. L. Lee and C. W. Jen, "A bit-level scalable median filter using simple majority 
circuit," Proc. of IEEE International Symposium on VLSI Technology, Systems and 
Applications, pp. 174-177, 1989. 
[10] M. Rumpf and R. Strzodka, "Nonlinear Diffusion in Graphics Hardware," 
Proceedings of IEEE TCVG Symposium on Visualization, pp. 75-84, 2001. 
[11] K. Wiehler, J. Heers, C. Schnorr, H. S. Stiehl, and R. R. Grigat, "A one-
dimensional analog VLSI implementation for nonlinear real-time signal 
preprocessing," Real-Time Imaging, vol. 7(1), pp. 127-142, 2001. 
[12] C. Kremser, C. Plangger, R. Bosecke, A. Pallua, F. Aichner, and S. R. Felber, 
"Image registration of MR and CT images using a frameless fiducial marker system," 
Magnetic Resonance Imaging, vol. 15(5), pp. 579-585, 1997. 
[13] J. Weese, G. P. Penney, P. Desmedt, T. M. Buzug, D. L. G. Hill, and D. J. 
Hawkes, "Voxel-based 2-D/3-D registration of fluoroscopy images and CT scans for 
image-guided surgery," IEEE Transactions on Information Technology in 
Biomedicine, vol. 1(4), pp. 284-293, 1997. 
[14] R. C. Susil, J. H. Anderson, and R. H. Taylor, "A single image registration 
method for CT guided interventions," in Medical Image Computing and Computer-
Assisted Intervention - MICCAI, Berlin, Germany, 1999, pp. 798 - 808. 
 190 
 
[15] G. P. Penney, J. M. Blackall, M. S. Hamady, T. Sabharwal, A. Adam, and D. J. 
Hawkes, "Registration of freehand 3D ultrasound and magnetic resonance liver 
images," Medical Image Analysis, vol. 8(1), pp. 81-91, 2004. 
[16] B. J. Wood, H. Zhang, A. Durrani, N. Glossop, S. Ranjan, D. Lindisch, E. Levy, 
F. Banovac, J. Borgert, S. Krueger, J. Kruecker, A. Viswanathan, and K. Cleary, 
"Navigation with electromagnetic tracking for interventional radiology procedures: A 
feasibility study," Journal of vascular and interventional radiology, vol. 16(4), pp. 
493-505, 2005. 
[17] G. Dorfman, J. Emond, B. Hamilton, J. Haller, T. Smith, and L. Clarke, "Report 
of the NIH/NSF Group on Image-Guided Interventions.,"  2004. 
[18] D. J. Hawkes, J. McClelland, C. Chan, D. L. G. Hill, K. Rhode, G. P. Penney, D. 
Barratt, P. J. Edwards, and J. M. Blackall, "Tissue deformation and shape models in 
image-guided interventions: A discussion paper," Medical Image Analysis, vol. 9(2), 
p. 163, 2005. 
[19] J. C. Timothy, S. Maxime, M. C. David, C. B. Dean, Christine Tanner, and J. H. 
David, "Application of soft tissue modelling to image-guided surgery," Medical 
Engineering & Physics, vol. 27(10), pp. 893-909, 2005. 
[20] J. de Mey, W. Vincken, B. Op de Beeck, M. Osteaux, M. De Maeseneer, M. 
Noppen, M. Meysman, and M. Vanhoey, "Real time CT-fluoroscopy: Diagnostic and 
therapeutic applications," European Journal of Radiology, vol. 34(1), p. 32, 2000. 
[21] B. D. Fornage, H. M. Kuerer, G. V. Babiera, A. N. Mirza, F. C. Ames, N. 
Sneige, M. I. Ross, B. S. Edeiken, and L. A. Newman, "Small (< or = 2-cm) breast 
cancer treated with US-guided radiofrequency ablation: feasibility study," Radiology, 
vol. 231(Apr), p. 215, 2004. 
[22] R. L. Galloway, "The process and development of image-guided procedures," 
Annual Review of Biomedical Engineering, vol. 3(pp. 83-108, 2001. 
[23] Peters, "Image-guided surgery: From X-rays to virtual reality," Computer 
Methods in Biomechanics and Biomedical Engineering, vol. 4(1), p. 27, 2000. 
[24] T. M. Peters, "Image-guidance for surgical procedures," Physics in Medicine & 
Biology, vol. 51(14), p. R505, 2006. 
[25] Philips Medical Systems, "256-slice Brilliance iCT Scanner," 
http://www.medical.philips.com/main/news/content/file_1650.html. 
[26] GE Healthcare, "Signa OpenSpeed 0.7T " 
http://www.gehealthcare.com/usen/mr/open_speed/index.html. 
[27] Terarecon Inc., "Aquarius iNtuition," www.terarecon.com. 
[28] Vital Images Inc., "Vitrea? Software," www.vitalimages.com. 
[29] M. Das, F. Sauer, U. J. Schoepf, A. Khamene, S. K. Vogt, S. Schaller, R. 
Kikinis, E. vanSonnenberg, and S. G. Silverman, "Augmented reality visualization 
for CT-guided interventions: system description, feasibility, and initial evaluation in 
an abdominal phantom," Radiology, vol. 240(1), p. 230, 2006. 
[30] F. Sauer, "Image registration: Enabling technology for image guided surgery and 
therapy," in IEEE International Conference on Engineering in Medicine and Biology, 
2005, pp. 7242-7245. 
[31] W. Hendee, "Special Report: Biomedical Imaging Research Opportunities 
Workshop III. A Summary of Findings and Recommendations," Medical Physics, 
vol. 33(2), pp. 274-277, 2006. 
 191 
 
[32] J. B. A. Maintz and M. A. Viergever, "A survey of medical image registration," 
Medical Image Analysis, vol. 2(1), pp. 1-36, 1998. 
[33] D. L. G. Hill, "Medical image registration," Physics in Medicine & Biology, vol. 
46(1), p. 1, 2001. 
[34] L. Lemieux, N. D. Kitchen, S. W. Hughes, and D. G. Thomas, "Voxel-based 
localization in frame-based and frameless stereotaxy and its accuracy," Medical 
Physics, vol. 21(8), p. 1301, 1994. 
[35] T. Peters, B. Davey, P. Munger, R. Comeau, A. A. Evans, and A. A. Olivier, 
"Three-dimensional multimodal image-guidance for neurosurgery," IEEE 
Transactions on Medical Imaging, vol. 15(2), pp. 121-128, 1996. 
[36] C. R. Maurer, Jr., J. M. Fitzpatrick, M. Y. Wang, R. L. Galloway, Jr , R. J. 
Maciunas, and G. S. Allen, "Registration of head volume images using implantable 
fiducial markers," IEEE Transactions on Medical Imaging, vol. 16(4), pp. 447-462, 
1997. 
[37] R. D. Bucholz, K. R. Smith, J. M. Henderson, L. L. McDurmont, and D. W. 
Schulze, "Intraoperative localization using a three-dimensional optical digitizer," in 
Clinical Applications of Modern Imaging Technology, Los Angeles, CA, USA, 1993, 
pp. 312-322. 
[38] S. A. Tebo, D. A. Leopold, D. M. Long, S. J. Zinreich, and D. W. Kennedy, "An 
optical 3D digitizer for frameless stereotactic surgery," IEEE Computer Graphics and 
Applications, vol. 16(1), pp. 55-64, 1996. 
[39] S. Fang, R. Raghavan, and J. T. Richtsmeier, "Volume morphing methods for 
landmark-based 3D image deformation," in SPIE Medical Imaging 1996: Image 
Processing, 1996, pp. 404-415. 
[40] Y. Ge, J. M. Fitzpatrick, J. R. Votaw, S. Gadamsetty, R. J. Maciunas, R. M. 
Kessler, and R. A. Margolin, "Retrospective registration of PET and MR brain 
images: an algorithm and its stereotactic validation," Journal of Computer Assisted 
Tomography, vol. 18(5), p. 800, 1994. 
[41] D. N. Levin, C. A. Pelizzari, G. T. Chen, C. T. Chen, and M. D. Cooper, 
"Retrospective geometric correlation of MR, CT, and PET images," Radiology, vol. 
169(3), p. 817, 1988. 
[42] H. Rusinek, W. Tsui, A. V. Levy, M. E. Noz, and M. J. de Leon, "Principal Axes 
and Surface Fitting Methods for Three-Dimensional Image Registration," Journal of 
Nuclear Medicine, vol. 34(11), pp. 2019-2024, 1993. 
[43] T. G. Turkington, J. M. Hoffman, R. J. Jaszczak, J. R. MacFall, C. C. Harris, C. 
D. Kilts, C. A. Pelizzari, and R. E. Coleman, "Accuracy of surface fit registration for 
PET and MR brain images using full and incomplete brain surfaces," Journal of 
Computer Assisted Tomography, vol. 19(1), p. 117, 1995. 
[44] J. P. Thirion, "Image matching as a diffusion process: an analogy with Maxwell's 
demons," Medical Image Analysis, vol. 2(3), p. 243, 1998. 
[45] P. M. Thompson, "A surface-based technique for warping 3-dimensional images 
of the brain," IEEE Transactions on Medical Imaging, vol. 15(4), p. 1, 1996. 
[46] J. V. Hajnal, D. J. Hawkes, and D. L. G. Hill, Medical image registration. Boca 
Raton: CRC Press, 2001. 
 192 
 
[47] J. P. W. Pluim, J. B. A. Maintz, and M. A. Viergever, "Mutual-information-
based registration of medical images: a survey," IEEE Transactions on Medical 
Imaging, vol. 22(8), pp. 986-1004, 2003. 
[48] M. Holden, D. L. G. Hill, E. R. E. Denton, J. M. Jarosz, T. C. Cox, T. Rohlfing, 
J. Goodey, and D. J. Hawkes, "Voxel similarity measures for 3-D serial MR brain 
image registration," IEEE Transactions on Medical Imaging, vol. 19(2), pp. 94-102, 
2000. 
[49] R. Shekhar and V. Zagrodsky, "Mutual information-based rigid and nonrigid 
registration of ultrasound volumes," IEEE Transactions on Medical Imaging, vol. 
21(1), pp. 9-22, 2002. 
[50] C. Davatzikos, "Nonlinear registration of brain images using deformable 
models," in Mathematical Methods in Biomedical Image Analysis, 1996., 
Proceedings of the Workshop on, 1996, pp. 94-103. 
[51] B. Morten and G. Claus, "Fast Fluid Registration of Medical Images," in 
Proceedings of the 4th International Conference on Visualization in Biomedical 
Computing: Springer-Verlag, 1996. 
[52] A. Hagemann, K. Rohr, H. S. Stiehl, U. Spetzger, and J. M. Gilsbach, 
"Biomechanical modeling of the human head for physically based, nonrigid image 
registration," IEEE Transactions on Medical Imaging, vol. 18(10), pp. 875-884, 1999. 
[53] S. K. Kyriacou, C. Davatzikos, S. J. Zinreich, and R. N. Bryan, "Nonlinear 
elastic registration of brain images with tumor pathology using a biomechanical 
model," IEEE Transactions on Medical Imaging, vol. 18(7), p. 580, 1999. 
[54] J. Ashburner and K. Friston, "Nonlinear spatial normalization using basis 
functions," Human Brain Mapping, vol. 7(4), p. 254, 1999. 
[55] B. Kim, J. Boes, K. A. Frey, and C. R. Meyer, "Mutual Information for 
Automated Multimodal Image Warping," in Proceedings of the 4th International 
Conference on Visualization in Biomedical Computing: Springer-Verlag, 1996. 
[56] K. Rohr, H. S. Stiehl, R. Sprengel, B. Wolfgang, T. Buzug, J. Weese, and M. H. 
Kuhn, "Point-Based Elastic Registration of Medical Image Data Using 
Approximating Thin-Plate Splines," in Proceedings of the 4th International 
Conference on Visualization in Biomedical Computing: Springer-Verlag, 1996. 
[57] T. Rohlfing and C. R. Maurer, Jr., "Nonrigid image registration in shared-
memory multiprocessor environments with application to brains, breasts, and bees," 
IEEE Transactions on Information Technology in Biomedicine, vol. 7(1), pp. 16-25, 
2003. 
[58] D. Rueckert, L. I. Sonoda, C. Hayes, D. L. G. Hill, M. O. Leach, and D. J. 
Hawkes, "Nonrigid registration using free-form deformations: application to breast 
MR images," IEEE Transactions on Medical Imaging, vol. 18(8), pp. 712-721, 1999. 
[59] J. F. Krucker, G. L. LeCarpentier, J. B. Fowlkes, and P. L. Carson, "Rapid elastic 
image registration for 3-D ultrasound," IEEE Transactions on Medical Imaging, vol. 
21(11), pp. 1384-1394, 2002. 
[60] V. Walimbe and R. Shekhar, "Automatic elastic image registration by 
interpolation of 3D rotations and translations from discrete rigid-body 
transformations," Medical Image Analysis, vol. 10(6), p. 899, 2006. 
 193 
 
[61] A. Collignon, F. Maes, D. Delaere, D. Vandermeulen, P. Suetens, and G. 
Marchal, "Automated multi-modality image registration based on information 
theory," in Information Processing in Medical Imaging, 1995, pp. 263-274. 
[62] W. M. Wells, P. Viola, H. Atsumi, S. Nakajima, and R. Kikinis, "Multi-modal 
volume registration by maximization of mutual information," Medical Image 
Analysis, vol. 1(1), pp. 35-51, 1996. 
[63] M. Capek, L. mroz, and R. Wegenkittl, "Robust and fast medical registration of 
3D-multi-modality data sets," Mediterranean Conference on Medical and Biological 
Engineering and Computing, p. 515, 2001. 
[64] C. Studholme, D. L. G. Hill, and D. J. Hawkes, "An overlap invariant entropy 
measure of 3D medical image alignment," Pattern Recognition, vol. 32(1), pp. 71-86, 
1999. 
[65] W. H. Press, Numerical recipes in C++ : the art of scientific computing, 2nd ed. 
Cambridge, UK ; New York: Cambridge University Press, 2002. 
[66] R. Shekhar, V. Walimbe, S. Raja, V. Zagrodsky, M. Kanvinde, G. Y. Wu, and B. 
Bybel, "Automated 3-dimensional elastic registration of whole-body PET and CT 
from separate or combined scanners," Journal of Nuclear Medicine, vol. 46(9), pp. 
1488-1496, Sep 2005. 
[67] V. Walimbe, V. Zagrodsky, S. Raja, B. Bybel, M. Kanvinde, and R. Shekhar, 
"Elastic registration of three-dimensional whole body CT and PET images by 
quaternion-based interpolation of multiple piecewise linear rigid-body registrations," 
in Proceedings of SPIE Medical Imaging, 2004, pp. 119-128. 
[68] S. Clippe, D. Sarrut, C. Malet, S. Miguet, C. Ginestet, and C. Carrie, "Patient 
setup error measurement using 3D intensity-based image registration techniques," 
International Journal of Radiation Oncology, Biology, and Physics, vol. 56(1), pp. 
259-265, 2003. 
[69] F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens, 
"Multimodality image registration by maximization of mutual information," IEEE 
Transactions on Medical Imaging,, vol. 16(2), pp. 187-198, 1997. 
[70] Z. Hongying, Z. Xiaozhou, S. Jizhou, and J. Zhang, "A novel medical image 
registration method based on mutual information and genetic algorithm," in 
International Conference on Computer Graphics, Imaging and Vision: New Trends, 
2005, pp. 221-226. 
[71] J. M. Rouet, J. J. Jacq, and C. Roux, "Genetic algorithms for a robust 3-D MR-
CT registration," IEEE transactions on information technology in biomedicine, vol. 
4(2), p. 126, 2000. 
[72] C. Nikou, F. Heitz, J. Armspach, I. Namer, and D. Grucker, "Registration of 
MR/MR and MR/SPECT Brain Images by Fast Stochastic Optimization of Robust 
Voxel Similarity Measures," NeuroImage, vol. 8(1), pp. 30-43, 1998. 
[73] N. Ritter, R. Owens, J. Cooper, R. H. Eikelboom, and P. P. Van Saarloos, 
"Registration of stereo and temporal images of the retina," IEEE Transactions on 
Medical Imaging, vol. 18(5), pp. 404-418, 1999. 
[74] A. Carrillo, J. L. Duerk, J. S. Lewin, and D. L. Wilson, "Semiautomatic 3-D 
image registration as applied to interventional MRI liver cancer treatment," IEEE 
Transactions on Medical Imaging, vol. 19(3), pp. 175-185, 2000. 
 194 
 
[75] H. Mark, A. S. Julia, and L. G. H. Derek, "Quantifying Small Changes in Brain 
Ventricular Volume Using Non-rigid Registration," in Proceedings of the 4th 
International Conference on Medical Image Computing and Computer-Assisted 
Intervention: Springer-Verlag, 2001. 
[76] K. Z. Abd-Elmoniem, A. M. Youssef, and Y. M. Kadah, "Real-time speckle 
reduction and coherence enhancement in ultrasound imaging via nonlinear 
anisotropic diffusion," IEEE Transactions on Biomedical Engineering, vol. 49(9), pp. 
997-1014, 2002. 
[77] O. Dandekar, K. Siddiqui, V. Walimbe, and R. Shekhar, "Image registration 
accuracy with low-dose CT: How low can we go?," in IEEE International Symposium 
on Biomedical Imaging, 2006, pp. 502-505. 
[78] P. Lei, O. Dandekar, D. Widlus, P. Malloy, and R. Shekhar, "Incorporation of 
PET into CT-Guided Liver Radiofrequency Ablation," Radiology (under revision), 
2008. 
[79] M. Kachelriess, O. Watzke, and W. A. Kalender, "Generalized multi-
dimensional adaptive filtering for conventional and spiral single-slice, multi-slice, 
and cone-beam CT," Medical Physics, vol. 28(4), pp. 475-490, 2001. 
[80] L. Keselbrener, Y. Shimoni, and S. Akselrod, "Nonlinear filters applied on 
computerized axial tomography: Theory and phantom images," Medical Physics, vol. 
19(4), pp. 1057-1064, 1992. 
[81] P. Perona and M. Jitendra, "Scale-space and edge detection using anisotropic 
diffusion," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 
12(7), pp. 629-639, July 1990 1990. 
[82] R. T. Whitaker and S. M. Pizer, "A multi-scale approach to nonuniform 
diffusion," CVGIP: Image Understanding, vol. 57(1), pp. 99-110, 1993. 
[83] A. Dorati, C. Lamberti, A. Sarti, P. Baraldi, and R. Pini, "Pre-processing for 3D 
echocardiography," Computers in Cardiology, pp. 565-568, 1995. 
[84] M. A. Cantin, Y. Savaria, and P. Lavoie, "A comparison of automatic word 
length optimization procedures," in IEEE International Symposium on Circuits and 
Systems, 2002, pp. 612-615. 
[85] T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk, and P. 
Y. K. Cheung, "Reconfigurable computing: architectures and design methods," IEE 
Proceedings - Computers and Digital Techniques, vol. 152(2), pp. 193-207, 2005. 
[86] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, "Wordlength optimization 
for linear digital signal processing," IEEE Transactions on Computer-Aided Design of 
Integrated Circuits and Systems, vol. 22(10), pp. 1432-1442, 2003. 
[87] A. Nayak, M. Haldar, A. Choudhary, and P. A. B. P. Banerjee, "Precision and 
error analysis of MATLAB applications during automated hardware synthesis for 
FPGAs," in Proceedings of Design, Automation and Test in Europe, 2001, pp. 722-
728. 
[88] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time signal 
processing, 2nd ed. Upper Saddle River, N.J.: Prentice Hall, 1999. 
[89] M. Stephenson, J. Babb, and S. Amarasinghe, "Bidwidth analysis with 
application to silicon compilation," ACM SIGPLAN Conference on Programming 
Language Design and Implementation, vol. 35(5), pp. 108-120, 2000. 
 195 
 
[90] S. A. Wadekar and A. C. Parker, "Accuracy sensitive word-length selection for 
algorithm optimization," in Proceedings of International Conference on Computer 
Design: VLSI in Computers and Processors, ICCD '98. , 1998, pp. 54-61. 
[91] K. Han and B. Evans, "Optimum wordlength search using sensitivity 
information," EURASIP Journal on Applied Signal Processing vol. 2006, Article ID 
92849, pp. 1-14, 2006. 
[92] W. Sung and K. Kum, "Simulation-based word-length optimization method for 
fixed-point digital signal processing systems," IEEE Transactions on Signal 
Processing, vol. 43(12), pp. 3087-3090, 1995. 
[93] H. Choi and W. P. Burleson, "Search-based wordlength optimization for 
VLSI/DSP synthesis," in Proceedings of IEEE Workshop on VLSI Signal Processing, 
VII, 1994, pp. 198-207. 
[94] M. Leban and J. F. Tasic, "Word-length optimization of LMS adaptive FIR 
filters," in 10th Mediterranean Electrotechnical Conference 2000, pp. 774-777. 
[95] T. Givargis, F. Vahid, and J. Henkel, "System-level exploration for Pareto-
optimal configurations in parameterized system-on-a-chip," IEEE Transactions on 
Very Large Scale Integration (VLSI) Systems, vol. 10(4), pp. 416-422, 2002. 
[96] K. Kum, J. Kang, and W. Sung, "AUTOSCALER for C: an optimizing floating-
point to integer C program converter for fixed-point digital signal processors," IEEE 
Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 
47(9), pp. 840-848, 2000. 
[97] P. M. Dew, R. A. Earnshaw, and T. R. Heywood, Parallel processing for 
computer vision and display. Workingham, England ; Reading, Mass.: Addison-
Wesley, 1989. 
[98] V. K. Prasanna Kumar, Parallel architectures and algorithms for image 
understanding. Boston: Academic Press, 1991. 
[99] A. Bruhn, T. Jakob, M. Fischer, T. Kohlberger, J. Weickert, U. Bruning, and C. 
Schnorr, "Designing 3-D nonlinear diffusion filters for high performance cluster 
computing," Proc. of the 24th DAGM Symposium on Pattern Recognition, vol. 
2449(pp. 290-297, 2002. 
[100] A. Bruhn, T. Jakob, M. Fischer, T. Kohlberger, J. Weickert, U. Bruning, and C. 
Schnorr, "High performance cluster computing with 3-D nonlinear diffusion filters," 
Real-Time Imaging, vol. 10(1), pp. 41-51, 2004. 
[101] S. Tabik, E. M. Garzon, I. Garcia, and J. J. Fernandez, "Evaluation of parallel 
paradigms on Anisotropic Nonlinear Diffusion," in 12th International Euro-Par 
Conference, 2006, pp. 1159-1168. 
[102] M. Karaman and L. Onural, "New radix-2-based algorithm for fast median 
filtering," Electronics Letters, vol. 25(11), pp. 723-724, 1989. 
[103] K. Oflazer, "Design and implementation of a single-chip 1- D median filter," 
IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 31(5), pp. 1164-
1168, 1983. 
[104] E. Ataman and E. Alparslan, "Applications of median filtering algorithm to 
images," Electronics Division, Marmara Research Institute, Gebze, Turkey 1978. 
[105] J. P. Fitch, E. J. Coyle, and N. C. J. Gallagher, "Median filtering by threshold 
decomposition," IEEE Transactions on Acoustics, Speech, and Signal Processing, 
vol. 32(6), pp. 1183-1188, 1984. 
 196 
 
[106] C. L. Lee and C. W. Jen, "Bit-sliced median filter design based on majority 
gate," IEE Proceedings, Part G: Circuits, Devices and Systems, vol. 139(1), pp. 63-
71, 1992. 
[107] I. Hatirnaz, F. K. Gurkaynak, and Y. Leblebici, "A compact modular 
architecture for the realization of high-speed binary sorting engines based on rank 
ordering," Proc. of the IEEE International Symposium on Circuits and Systems, 
ISCAS, vol. 4(pp. 685-688, 2000. 
[108] A. A. Hiasat, M. M. Al-Ibrahim, and K. M. Gharaibeh, "Design and 
implementation of a new efficient median filtering algorithm," IEE Proceedings on 
Vision, Image and Signal Processing, vol. 146(5), pp. 273-278, 1999. 
[109] B. K. Kar, K. M. Yusuf, and D. K. Pradhan, "Bit-serial generalized median 
filters," Proc. of the IEEE International Symposium on Circuits and Systems, ISCAS, 
vol. 3(pp. 85-88, 1994. 
[110] C. Chakrabarti, "High sample rate array architectures for median filters," IEEE 
Transactions on Signal Processing, vol. 42(3), pp. 707-712, 1994. 
[111] L. W. Chang and J. H. Lin, "A bit-level systolic array for median filter," IEEE 
Transactions on Signal Processing, vol. 40(8), pp. 2079-2083, 1992. 
[112] K. Chen, "An integrated bit-serial 9-point median chip," Proc. of the European 
Conference on Circuit Theory and Design, pp. 339-343, 1989. 
[113] R. Roncella, R. Saletti, and P. Terreni, "70-MHz 2-um CMOS bit-level systolic 
array median filter," IEEE Journal of Solid-State Circuits, vol. 28(5), pp. 530-536, 
1993. 
[114] W. Plishker, O. Dandekar, S. Bhattacharyya, and R. Shekhar, "A taxonomy for 
medical image registration acceleration techniques," in IEEE Life Science Systems 
and Applications Workshop, 2007, pp. 160-163. 
[115] S. Ourselin, R. Stefanescu, and X. Pennec, "Robust registration of multi-modal 
images: Towards real-time clinical applications " in Medical Image Computing and 
Computer-Assisted Intervention ? MICCAI, 2002, pp. 140-147. 
[116] R. Stefanescu, X. Pennec, and N. Ayache, "Parallel non-rigid registration on a 
cluster of workstations," Proceedings of HealthGrid, 2003. 
[117] F. Ino, K. Ooyama, and K. Hagihara, "A data distributed parallel algorithm for 
nonrigid image registration," Parallel Computing, vol. 31(1), p. 19, 2005. 
[118] S. K. Warfield, F. Talos, A. Tei, A. Bharatha, A. Nabavi, M. Ferrant, P. M. 
Black, F. Jolesz, and R. Kikinis, "Real-time registration of volumetric brain MRI by 
biomechanical simulation of deformation during image guided neurosurgery," 
Computing and Visualization in Science, vol. 5(1), pp. 3-11, 2002. 
[119] R. Strzodka, M. Droske, and M. Rumpf, "Fast Image Registration in DX9 
graphics Hardware," Journal of Medical Informatics and Technologies, vol. 6(1), pp. 
43-49, 2003. 
[120] A. K?hn, J. Drexl, F. Ritter, M. K?nig, and H. Peitgen, "GPU Accelerated 
Image Registration in Two and Three Dimensions," in Bildverarbeitung f?r die 
Medizin, 2006, pp. 261-265. 
[121] W. Plishker, O. Dandekar, S. Bhattacharyya, and R. Shekhar, "Towards a 
heterogeneous medical image registration acceleration platform," Proceedings of the 
IEEE Biomedical Circuits and Systems Conference, pp. 231-234, 2007. 
 197 
 
[122] C. Vetter, C. Guetter, C. Xu, and R. Westermann, "Non-rigid multi-modal 
registration on the GPU," in SPIE Medical Imaging 2007: Image Processing, 2007, p. 
651228. 
[123] M. Ohara, H. Yeo, F. Savino, G. Iyengar, L. Gong, H. Inoue, H. Komatsu, V. 
Sheinin, and S. Daijavad, "Accelerating mutual-information-based linear registration 
on the cell broadband engine processor," in IEEE International Conference on 
Multimedia and Expo, 2007, pp. 272-275. 
[124] J. Jiang, W. Luk, and D. Rueckert, "FPGA-based computation of free-form 
deformations in medical image registration," in proceedings of IEEE International 
Conference on Field-Programmable Technology (FPT), 2003, pp. 234-241. 
[125] H. Keding, M. Willems, M. Coors, and H. A. M. H. Meyr, "FRIDGE: a fixed-
point design and simulation environment," in Proceedings of Design, Automation and 
Test in Europe, 1998, pp. 429-435. 
[126] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, "Optimum and heuristic 
synthesis of multiple word-length architectures," IEEE Transactions on Very Large 
Scale Integration (VLSI) Systems, vol. 13(1), pp. 39-57, 2005. 
[127] S. Kim, K. Kum, and W. Sung, "Fixed-point optimization utility for C and C++ 
based digital signal processing programs," IEEE Transactions on Circuits and 
Systems II: Analog and Digital Signal Processing, vol. 45(11), pp. 1455-1464, 1998. 
[128] K. Kum and W. Sung, "Combined word-length optimization and high-level 
synthesis of digital signal processing systems," IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 20(8), pp. 921-930, 2001. 
[129] I. Das and J. E. Dennis, "A closer look at drawbacks of minimizing weighted 
sums of objectives for Pareto set generation in multicriteria optimization problems," 
Structural and Multidisciplinary Optimization, vol. 14(1), pp. 63-69, 1997. 
[130] M. S. Bright and T. Arslan, "Synthesis of low-power DSP systems using a 
genetic algorithm," IEEE Transactions on Evolutionary Computation, vol. 5(1), pp. 
27-40, 2001. 
[131] C. L. Valenzuela and P. Y. Wang, "VLSI placement and area optimization 
using a genetic algorithm to breed normalized postfix expressions," IEEE 
Transactions on Evolutionary Computation, vol. 6(4), pp. 390-401, 2002. 
[132] G. Antoch, J. F. Debatin, J. Stattaus, H. Kuehl, and F. M. Vogt, "Value of CT 
volume imaging for optimal placement of radiofrequency ablation probes in liver 
lesions," Journal of Vascular and Interventional Radiology, vol. 13(11), p. 1155, 
2002. 
[133] D. E. Dupuy and S. N. Goldberg, "Image-guided radiofrequency tumor 
ablation: challenges and opportunities--part II," Journal of vascular and 
interventional radiology, vol. 12(10), pp. 1135-1148, 2001. 
[134] N. S. Goldberg and D. E. Dupuy, "Image-guided radiofrequency tumor 
ablation: challenges and opportunities--part I," Journal of vascular and interventional 
radiology, vol. 12(9), pp. 1021-1032, 2001. 
[135] J. R. Haaga, "Interventional CT: 30 years' experience," European radiology, 
vol. 15(p. D116, 2005. 
[136] I. Viola, A. Kanitsar, and M. E. Groller, "Hardware-based nonlinear filtering 
and segmentation using high-level shading languages," IEEE Visualization, p. 309, 
2003. 
 198 
 
[137] M. Jiang and D. Crookes, "High-performance 3D median filter architecture for 
medical image despeckling," Electronics Letters, vol. 42(24), p. 1379, 2006. 
[138] G. Gerig, O. Kubler, R. Kikinis, and F. A. Jolesz, "Nonlinear Anisotropic 
Filtering of MRI Data," IEEE Transactions on Medical Imaging, vol. 11(2), pp. 221-
232, 1992. 
[139] C. R. Castro-Pareja, J. M. Jagadeesh, and R. Shekhar, "FAIR: A Hardware 
Architecture for Real-Time 3-D Image Registration," IEEE Transactions on 
Information Technology in Biomedicine, vol. 7(4), pp. 426-434, 2003. 
[140] M. Doggett and M. Meissner, "A memory addressing and access design for real 
time volume rendering," IEEE International Symposium on Circuits and Systems, 
ISCAS, vol. 4(pp. 344-347, 1999. 
[141] H. Pfister, "Archtectures for real-time volume rendering," Future Generation 
Computer Systems, vol. 15(1), pp. 1-9, Feb. 1999 1999. 
[142] H. Pfister and A. Kaufman, "Cube-4: A scalable architecture for real-time 
volume rendering," Proc. of the 1996 symposium on volume visualization, pp. 47-54, 
1996. 
[143] C. R. Castro-Pareja, O. S. Dandekar, and R. Shekhar, "FPGA-based real-time 
anisotropic diffusion filtering of 3D ultrasound images," in Real-Time Imaging, 2005, 
pp. 123-131. 
[144] O. Dandekar, C. Castro-Pareja, and R. Shekhar, "FPGA-based real-time 3D 
image preprocessing for image-guided medical interventions," Journal of Real-Time 
Image Processing, vol. 1(4), pp. 285-301, 2007. 
[145] X. Li and T. Chen, "Nonlinear diffusion with multiple edginess thresholds," 
Pattern Recognition, vol. 27(8), pp. 1029-1037, 1994. 
[146] F. J. Gallegos-Funes and V. I. Ponomaryov, "Real-time image filtering scheme 
based on robust estimators in presence of impulsive noise," Real-Time Imaging, vol. 
10(2), p. 69, 2004. 
[147] D. Mattes, D. R. Haynor, H. Vesselle, T. K. Lewellen, and W. Eubank, "PET-
CT image registration in the chest using free-form deformations," IEEE Transactions 
on Medical Imaging, vol. 22(1), pp. 120-128, 2003. 
[148] P. Lei, O. Dandekar, F. Mahmoud, D. Widlus, P. Malloy, and R. Shekhar, "PET 
guidance for liver radiofrequency ablation: an evaluation," SPIE Medical Imaging 
2007: Visualization and Image-Guided Procedures, p. 650918, 2007. 
[149] R. Shekhar, O. Dandekar, S. Kavic, I. George, R. Mezrich, and A. Park, 
"Development of continuous CT-guided minimally invasive surgery," in SPIE 
Medical Imaging 2007: Visualization and Image-Guided Procedures, 2007, pp. 
65090D-8. 
[150] V. Walimbe, O. Dandekar, F. Mahmoud, and R. Shekhar, "Automated 3D 
elastic registration for improving tumor localization in whole-body PET-CT from 
combined scanner," in IEEE Annual International Conference on Engineering in 
Medicine and Biology, 2006, pp. 2799-2802. 
[151] J. Wu, O. Dandekar, V. Walimbe, W. D'Souza, and R. Shekhar, "Automatic 
prostate localization using elastic registration of planning CT and daily 3D ultrasound 
images," in SPIE Medical Imaging 2007: Visualization and Image-Guided 
Procedures, 2007, p. 650913. 
 199 
 
[152] H. Hassler and N. Takagi, "Function evaluation by table look-up and addition," 
in IEEE Symposium on Computer Arithmetic, IEEE, 1995, p. 10. 
[153] D. M. Mandelbaum and S. G. Mandelbaum, "A fast, efficient parallel-acting 
method of generating functions defined by power series, including logarithm, 
exponential, and sine, cosine," IEEE Transactions on Parallel and Distributed 
Systems, vol. 7(1), p. 33, 1996. 
[154] S. L. SanGregory, C. Brothers, D. Gallagher, and R. Siferd, "A fast, low-power 
logarithm approximation with CMOS VLSI implementation," in IEEE 42nd Midwest 
Symposium on Circuits and Systems, 2000, p. 91. 
[155] C. R. Castro-Pareja and R. Shekhar, "Hardware acceleration of mutual 
information-based 3D image registration," Journal of Imaging Science and 
Technology, vol. 49(2), pp. 105-113, 2005. 
[156] C. R. Castro-Pareja and R. Shekhar, "Adaptive reduction of intensity levels in 
3D images for mutual information-based registration," in SPIE Medical Imaging 
2005: Image Processing, San Diego, CA, USA, 2005, p. 1201. 
[157] O. Dandekar and R. Shekhar, "FPGA-Accelerated Deformable Image 
Registration for Improved Target-Delineation During CT-Guided Interventions," 
IEEE Transactions on Biomedical Circuits and Systems, vol. 1(2), pp. 116-127, 2007. 
[158] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, "The Multiple Wordlength 
Paradigm," in The 9th Annual IEEE Symposium on Field-Programmable Custom 
Computing Machines, FCCM '01. , 2001, pp. 51-60. 
[159] G. A. Constantinides and G. J. Woeginger, "The complexity of multiple 
wordlength assignment," Applied Mathematics Letters, vol. 15(2), pp. 137-140, 2002. 
[160] E. Zitzler, M. Laumanns, and L. Thiele, "SPEA2: Improving the Strength 
Pareto Evolutionary Algorithm for Multiobjective Optimization," in EUROGEN 2001 
- Evolutionary Methods for Design, Optimisation and Control with Applications to 
Industrial Problems, 2001, pp. 95-100. 
[161] K. Deb, A. Pratap, S. Agarwal, and T. A. M. T. Meyarivan, "A fast and elitist 
multiobjective genetic algorithm: NSGA-II," IEEE Transactions on Evolutionary 
Computation, vol. 6(2), pp. 182-197, 2002. 
[162] Altera Corp., "Altera IP Megacore Library," 
http://www.altera.com/literature/lit-ip.jsp. 
[163] W. Luk, S. Guo, N. Shirazi, and N. Zhuang, "A framework for developing 
parametrised FPGA libraries," in Field-Programmable Logic Smart Applications, 
New Paradigms and Compilers, 1996, pp. 24-33. 
[164] W. Luk and S. McKeever, "Pebble: A Language For Parametrised and 
Reconfigurable Hardware Design," in Field-Programmable Logic and Applications 
From FPGAs to Computing Paradigm, 1998, p. 9. 
[165] Xilinx Inc., "Xilinx Core Generator," 
http://www.xilinx.com/ise/products/coregen_overview.pdf. 
[166] J. Zhao, W. Chen, and S. Wei, "Parameterized IP core design," in Proceedings 
of 4th International Conference on ASIC, 2001, pp. 744-747. 
[167] T. Back, U. Hammel, and H. P. Schwefel, "Evolutionary computation: 
comments on the history and current state," IEEE Transactions on Evolutionary 
Computation, vol. 1(1), pp. 3-17, 1997. 
 200 
 
[168] V. Kianzad and S. S. Bhattacharyya, "Efficient techniques for clustering and 
scheduling onto embedded multiprocessors," IEEE Transactions on Parallel and 
Distributed Systems, vol. 17(7), pp. 667-680, 2006. 
[169] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, "Optimum wordlength 
allocation," in Proceedings of 10th Annual IEEE Symposium on Field-Programmable 
Custom Computing Machines, 2002, pp. 219-228. 
[170] E. Zitzler and L. Thiele, "Multiobjective evolutionary algorithms: a 
comparative case study and the strength Pareto approach," IEEE Transactions on 
Evolutionary Computation, vol. 3(4), pp. 257-271, 1999. 
[171] A. J. Bilchik, D. M. Rose, D. P. Allegra, P. J. Bostick, E. Hsueh, and D. L. 
Morton, "Radiofrequency ablation: A minimally invasive technique with multiple 
applications," Cancer Journal from Scientific American, vol. 5(6), p. 356, 1999. 
[172] J. H. Morgan, G. M. Royer, P. Hackett, T. C. Gamblin, B. L. McCampbell, A. 
Conforti, and P. S. Dale, "Radio-frequency ablation of large, nonresectable hepatic 
tumors," The American surgeon, vol. 70(12), p. 1035, 2004. 
[173] D. M. Rose, D. P. Allegra, P. J. Bostick, L. J. Foshag, and A. J. Bilchik, 
"Radiofrequency ablation: A novel primary and adjunctive ablative technique for 
hepatic malignancies," American Surgeon, vol. 65(11), p. 1009, 1999. 
[174] Toshiba, "The Next Revolution: 256-Slice CT," 
http://madeforlife.toshiba.com/ct256/TMS082_CT_WP_256.pdf. 
[175] R. Bellotti, F. De Carlo, G. Gargano, S. Tangaro, D. Cascio, E. Catanzariti, P. 
Cerello, S. C. Cheran, P. Delogu, I. De Mitri, C. Fulcheri, D. Grosso, A. Retico, S. 
Squarcia, E. Tommasi, and B. Golosio, "A CAD system for nodule detection in low-
dose lung CTs based on region growing and a new active contour model," Medical 
Physics, vol. 34(12), pp. 4901-4910, 2007. 
[176] D. S. Gierada, T. K. Pilgram, M. Ford, R. M. Fagerstrom, T. R. Church, H. 
Nath, K. Garg, and D. C. Strollo, "Lung Cancer: Interobserver Agreement on 
Interpretation of Pulmonary Findings at Low-Dose CT Screening," Radiology, vol. 
246(1), pp. 265-272, 2007. 
[177] D. Tack, P. Bohy, I. Perlot, V. De Maertelaer, O. Alkeilani, S. Sourtzis, and P. 
A. Gevenois, "Suspected Acute Colon Diverticulitis: Imaging with Low-Dose 
Unenhanced Multi-Detector Row CT," Radiology, vol. 237(1), pp. 189-196, 2005. 
[178] D. Sennst, M. Kachelriess, C. Leidecker, B. Schmidt, O. Watzke, and W. A. 
Kalender, "An Extensible Software-based Platform for Reconstruction and Evaluation 
of CT Images," Radiographics, vol. 24(2), pp. 601-613, 2004. 
[179] F. Richard, "Smooth interpolation of scattered data by local thin plate splines," 
Computers &amp; Mathematics with Applications, vol. 8(4), p. 273, 1982. 
[180] D. G. Gobbi and T. M. Peters, "Generalized 3D nonlinear transformations for 
medical imaging: an object-oriented implementation in VTK," Computerized Medical 
Imaging and Graphics, vol. 27(4), pp. 255-265, 2003. 
[181] C. R. Meyer, J. L. Boes, B. Kim, P. H. Bland, K. R. Zasadny, P. V. Kison, K. 
Koral, K. A. Frey, and R. L. Wahl, "Demonstration of accuracy and clinical 
versatility of mutual information for automatic multimodality image fusion using 
affine and thin-plate spline warped geometric deformations," Medical Image 
Analysis, vol. 1(3), pp. 195-206, 1997. 
 201 
 
[182] S. Balocco, O. Basset, C. Cachard, and P. Delachartre, "Spatial anisotropic 
diffusion and local time correlation applied to segmentation of vessels in ultrasound 
image sequences," Proc. of the IEEE Symposium on Ultrasonics, vol. 2(pp. 1549-
1552, 2003. 
[183] F. Xu, Y. Yu, S. T. Acton, and J. A. Hossack, "Detection of myocardinal 
boundaries from ultrasound imagery using activecontours," Proc. of the IEEE 
Symposium on Ultrasonics, vol. 2(pp. 1537-1540, 2003. 
[184] R. Shekhar, O. Dandekar, S. Kavic, I. George, R. Mezrich, and A. Park, 
"Development of continuous CT-guided minimally invasive surgery," Multimedia 
Meets Virtual Reality (MMVR), 2007. 
[185] C. D. Knox, C. D. Anderson, L. W. Lamps, R. B. Adkins, and C. W. Pinson, 
"Long-term survival after resection for primary hepatic carcinoid tumor," Annals of 
Surgical Oncology, vol. 10(10), pp. 1171-1175, Dec 2003. 
[186] P. Veit, G. Antoch, H. Stergar, A. Bockisch, M. Forsting, and H. Kuehl, 
"Detection of residual tumor after radiofrequency ablation of liver metastasis with 
dual-modality PET/CT: initial results," European Radiology, vol. 16(1), pp. 80-87, 
Jan 2006. 
[187] A. R. Gillams, "Liver ablation therapy," British Journal of Radiology, vol. 
77(921), pp. 713-723, Sep 2004. 
[188] J. P. McGhana and G. D. Dodd, 3rd, "Radiofrequency ablation of the liver: 
current status," American Journal of Roentgenology, vol. 176(1), pp. 3-16, 2001. 
[189] M. S. Chen, J. Q. Li, Y. Zheng, R. P. Guo, H. H. Liang, Y. Q. Zhang, X. J. Lin, 
and W. Y. Lau, "A prospective randomized trial comparing percutaneous local 
ablative therapy and partial hepatectomy for small hepatocellular carcinoma," Annals 
of Surgery, vol. 243(3), pp. 321-328, 2006. 
[190] L. Solbiati, T. Livraghi, S. N. Goldberg, T. Ierace, F. Meloni, M. Dellanoce, L. 
Cova, E. F. Halpern, and G. S. Gazelle, "Percutaneous radio-frequency ablation of 
hepatic metastases from colorectal cancer: long-term results in 117 patients," 
Radiology, vol. 221(1), pp. 159-166, 2001. 
[191] C. D. Roman, W. H. Martin, and D. Delbeke, "Incremental value of fusion 
imaging with integrated PET-CT in oncology," Clinical Nuclear Medicine, vol. 30(7), 
pp. 470-477, 2005. 
[192] K. K. Ng, C. M. Lam, R. T. Poon, V. Ai, W. K. Tso, and S. T. Fan, "Thermal 
ablative therapy for malignant liver tumors: a critical appraisal," Journal of 
Gastroenterology and Hepatology, vol. 18(6), pp. 616-629, 2003. 
[193] H. Higgins and D. L. Berger, "RFA for liver tumors: Does it really work?," 
Oncologist, vol. 11(7), pp. 801-808, 2006. 
[194] S. B. Solomon, "Incorporating CT, MR imaging, and positron emission 
tomography into minimally invasive therapies," Journal of Vascular and 
Interventional Radiology, vol. 16(4), pp. 445-447, 2005. 
[195] S. Venugopal, C. R. Castro-Pareja, and O. S. Dandekar, "An FPGA-based 3D 
image processor with median and convolution filters for real-time applications," in 
Real-Time Imaging, 2005, pp. 174-182.