Computed Tomography (CT) is an X-ray imaging modality that gives the ability to generate 3D volumetric images from 2D X-ray images or projections. This unique capability has given rise to the adoption in many fields including medicine, security, non-destructive testing, and metrology.
CT scanners acquire a large set of 2D projections by moving an X-ray source and a detector around an object on a circular trajectory. The 2D projections are then reconstructed into a 3D volume using reconstruction algorithms. The datasets generated are by nature very large and reconstructions are computationally challenging. Moreover in recent years more complex reconstruction algorithms have been introduced increasing the demand on computational speed significantly. Furthermore, it is desirable that reconstructions are performed in or near real-time.
To meet the computational challenge Graphical Processing Units (GPUs) have been employed. GPUs are highly parallelized processing units that use thousands of cores exceeding the computational power of CPUs by many orders of magnitude. Programming of GPUs is not trivial as it requires a deep understanding of the internal architecture of these devices and also requires parallelizing algorithms that are typically derived in a serial fashion.
The quest for the fastest reconstruction is ongoing. In fact the University of Erlangen in Germany hosts a benchmarking website for the fastest and most accurate GPU implementations of a back projection algorithm. (https://www5.cs.fau.de/research/projects/rabbitct/) We could not resist participating and currently hold a record time!
How does CT reconstruction work?
The goal of CT reconstruction is to produce a 3D volume from a set of 2D detector images. We reconstruct into a 3D volume located between the detector and the source focal spot. To describe our 3D volume we divide it up in into little cubes (voxels) and we typically use 512 x 512 x 512 voxels.
To understand the reconstruction process it is important to understand the imaging process. The detected image depends on the X-ray absorption properties of the object. The X-ray source emits X-rays that pass through the object and are then captured by a detector which is spatially sensitive in two dimensions, and has a large number of 2D pixels. For simplicity we assume here that the detector has 512 x 512 pixels.
To further conceptualize the image formation process we assume that the X-ray source emits beamlets of photons each directed to a particular pixel of the detector. These beamlets are commonly called rays. Each ray describes the path of the photons through the object. Along this path the photons are attenuated depending on the X-ray properties of the object. The signal then counted in each detector pixel reflects the total attenuation along the path of the ray due to the object properties.
The image formation is also referred to as the forward problem or forward projection. The forward projection can be described mathematically. To do so the objects can be abstracted as a voxelized volume with each voxel containing the X-ray absorption coefficient of the object at this particular location. In simple words forward projection is the following process:
Draw a ray between the X-ray source focal spot and the detector pixel, walk along the ray and whenever you intersect with a voxel, collect (integrate) the absorption coefficients stored in the voxel and store the final number in a detector pixel. Repeat for all pixels.
The reconstruction is exactly the inverse of the forward projection and therefore is referred to as back projection:
Draw a ray between focal spot and detector pixel, walk along the ray and whenever you intersect with a voxel add the value of the detector pixel to the voxel. Repeat for all pixels.
As mentioned earlier the detector and the source are rotating around the object and produce many images. The back projection process is repeated for all projections by simply adding over and over to the same voxel volume. Once completed for all projections the imaging volume is reconstructed.
As a rule of thumb, to reconstruct a volume of 512 x 512 x 512 voxels, the same number of rays is needed. At each projection our detector produces 512 x 512 rays, thus, we need 512 projection images to satisfy this criterion.
We made a few simplifications in the description above but it describes the core of what is happening in the reconstruction process. It also illustrates the computational complexity. To store the imaging volume, we need about 250 Mbytes. If we consider 2 byte numbers, the same is true for the detector images. However, larger datasets can lead to a higher spatial resolution for the images, and it is not uncommon to use 1024 x 1024 x 1024 voxels reconstructed from 1024 detector images, each of 1024x 1024 pixels. This leads to an 8 times larger dataset or2Gbytes of memory. We have pushed the envelope even further and increased by another factor of 8 to reconstruct 2048 voxel datasets.
The problem: How can we speed up the CT reconstruction?
The solution: Efficiently utilize the parallel computing capabilities of Graphical Processing Units (GPUs) to perform the back projection algorithm. This results in a dramatic increase in performance without a loss of accuracy.
GPUs versus CPUs:
The advantage of GPUs over CPUs can be summarized as
- Significantly superior 32-bit floating point computation
- 7000 Giga floating point operations per second (GFLOPS) for Nvidia Titan GPU vs ~300 GFLOPS Intel CPU
- Significantly superior memory bandwidth
- 336 GBs for Nvidia Titan GPU vs ~ 50 GBs for Intel CPU
- Increased 32 bit floating point accuracy
- GPUs use FMA (Fused Multiply-Add) operations for increased accuracy
- High reliability due to extensive testing
- PC gaming
- DOE Supercomputing clusters use Nvidia GPUs
- Relatively low cost and easy replacement
- Robust software ecosystem for Nvidia’s CUDA language with an increasing number of open-source projects and libraries
Therefore the implementation of a highly parallelized problem such as CT reconstruction lends itself to the use of GPUs
Triple Ring Technologies approach to algorithm design and implementation:
As mentioned earlier, implementation of algorithms onto GPUs is highly specialized. We summarize here our overall design rules in tackling complex implementations of algorithms:
- Maximize FLOPS (Floating Point Operations Per Second) for 32-bit floating point computation without loss of accuracy
- use built-in 32-bit FMA operations when possible
- Maximize global memory efficiency
- enforce coalesced memory reads and updates
- pre-process input projection data into optimal format for bi-linear interpolation load pattern
- Determine optimal work performed by each distinct GPU device
- Determine optimal work done per worker thread in launch grid
- enforce balance between computation, memory reads, and memory updates for each thread
- Coordinate memory operations between CPU and GPUs for optimal performance
- Use overlapping memory transfers during concurrent computation
- enforce coalesced memory reads and updates
Moreover to complete this task we have developed proprietary technology that leads to higher accuracy and higher speed than other implementations.
To demonstrate the capabilities of our 2048^3 voxel reconstruction algorithm we generated and reconstructed a projection dataset of a pocket watch. The above images show a cross section through the resulting reconstruction volume. This work was presented at this year’s Silicon Valley GPU Technology Conference and was selected as a Top 20 poster.
In summary, we have developed proprietary reconstruction algorithms ranging from 256x256x256 voxel to 2048×2048 x2048 voxel with excellent runtimes and accuracy shown in the table below.
|Output Volume size (32-bit)||Back Projection running time||GUPS
(Giga-updates per second)
Our algorithm is top ranked in the rabbitCT benchmarking website, and outperforms all existing algorithms in terms of accuracy.
We have developed an ultrafast reconstruction toolkit for CT that can be used for improved clinical workflows or faster CT based security screening at ports, airports, and military installations. Our toolkit can be further augmented by custom architectures using a multi-GPU approach.
Fast processing times also enable implementation of advanced algorithms such as sophisticated artifact reduction routines for improved image quality and iterative reconstruction for X-ray dose reduction.
Oleg Konings, Brian Wilfley, Tobias Funk, Daniel Badali, Scott Hsieh, Triple Ring Technologies Applied Science Group