Full Text

Turn on search term navigation

1. Introduction

Ultrasound strain elastography (USE) [1] can provide new information than that contained in B-Mode ultrasound images, which display only the amplitudes of envelope detected and decimated echo signals. USE has been successfully used for breast lesion differentiation [2,3,4] because it is capable of visualizing elevated tissue hardness. The strain/modulus nonlinearity has been also explored by others in order to better characterize breast lesions [5,6,7]. Furthermore, a number of studies in the literature has been devoted to understanding signal quality [8], image resolution [9,10] and contrast [11,12] in USE. In this study, we solely focus on ultrasound-based motion tracking, because it plays a critically important role in USE.

Among all motion tracking algorithms [13], the block-matching algorithm is one of the most used correlation-based methods and has been adopted by several prototype USE systems [14,15,16,17,18,19,20,21] due to its simplicity. Furthermore, the block-matching algorithm can also be easily extended to include motion regularization or multi-scale tracking to deal with large tissue deformation. In the framework of block matching algorithm, similarity or correlation evaluation influences both motion tracking accuracy and computational efficiency [22]. The commonly used similarity/correlation evaluation metrics include sum absolute error, sum square error and (SSD), entropy, and normalized cross-correlation (NCC). NCC has been proved to be one of the best similarity matching criteria in block matching methods [22] and therefore, has been widely used in various ultrasound-based motion tracking algorithms.

One drawback of using the NCC as a correlation metric is that calculating NCC following the standard protocol is computationally intensive, thereby limiting its use in many ultrasound techniques requiring real-time performance. Lewis proposed to use a pre-computed table in conjunction with a fast Fourier Transform (FFT) to calculate NCC values. His method is often referred as to Sum-Table (ST) method now [23]. Luo et al. further modified Lewis’ approach [22] by replacing the requirement of using FFT with another ST. Both methods [22,23] have demonstrated that they can achieve almost the same accuracy together with significantly faster calculation efficiency in a serial computing environment in which i.e., calculations are conducted by a central processing unit (CPU).

Several major vendors (e.g., General Electric, Siemens, Philips, Hitachi, Toshiba, Samsung Medison, etc.) have released their commercially-available USE packages. However, those clinical USE systems all operate in 2D. However, continued efforts [16,19,24,25,26,27,28] are under the way to expand USE into 3-D because 3-D tracking can significantly improve quality of strain elastograms [16,19].

Now FDA-approved 3-D automated breast ultrasound (ABUS) systems (e.g., GE InveniaTM; Siemens Acuson S2000 ABVS) have been released. It is our vision that with the availability of GPU-acceleration, bringing 3-D breast USE into the clinical workflow becomes possible. In the last decade, GPU-based parallel computing has been utilized to accelerate ultrasound-based motion tracking in several important applications including USE [21,29,30,31], shear wave elastography [32] and color Doppler imaging [33]. In particular, the work by Peng et al. [21] demonstrated that high quality strain data can be obtained for a reasonably large volume (e.g., 2.5×2.5×2.5^cm3 ) in 20 s. Consequently, investigating whether or not the concept of ST can be used to further improve 3-D motion tracking is a logical next step.

To this end, our primary goal is to investigate the feasibility of using ST in conjunction with GPUs to further accelerate 3-D motion tracking. Thus, two above-mentioned ST methods (i.e., Lewis’ and Luo’s methods) were analyzed and implemented for parallel (GPU) and serial (CPU) computing settings. The standard protocol of NCC calculation without the use of ST was used as the baseline in both computing settings for a systematic comparison.

2. Materials and Methods 2.1. A Description of GPU-Accelerated Block-Matching Algorithm in USE

As shown in Figure 1, in order to estimate one displacement vector (dx,dy,dz) for an arbitrary location (x,y,z) from a pair of pre- and post-deformation ultrasound echo volumes, the block-matching algorithm can be largely divided into three steps: (1) selecting a pair of reference and target ultrasound signals from two successive ultrasound radio frequency (RF) fields based on a predetermined search range and a tracking kernel; (2) calculating an NCC value between the pair of selected reference and target echo signals, generating a 3-D resultant NCC map for a given search range, and finding a peak from the NCC map; and (3) fitting NCC values around the peak NCC map to a 3-D quadratic surface [34] to find the estimated displacement vector. During the process, the tracking-kernel size and search range must be determined prior to the start of motion tracking. The location (x,y,z) here is referred to a location in the pre-deformation ultrasound each volume (i.e., the reference volume). Thus, the (dx,dy,dz) represents the displacement vector by which the medium moved to the post-deformation ultrasound echo volume (i.e., the target volume) from the reference volume. In the block-matching algorithm, time-domain ultrasound echo signals are used to calculate correlation.

As can be seen from the schematic diagram in Figure 1, Lewis’ and Luo’s methods both need to create sum tables prior to the calculation of NCC values in Step 1. Those sum-tables are used as “lookup-tables” to replace the standard NCC calculations in order to reduce the computing time. The details of Step 1 for three different ways of calculating NCC values are discussed below. Steps 2 and 3 are exactly the same for all three methods. Specifically, Step 2 is to find the maximum on the NCC map. Thus, an integer displacement vector can be determined for the location (x,y,z) . Finally, in Step 3, the subs-sample displacement vector can be obtained by fitting a 3×3×3 matrix containing NCC values surrounding the maximum NCC value into a quadratic surface. Through Steps 2 and 3, the final axial, lateral and elevational displacements with sub-sampling precision can be obtained.

2.1.1. A Standard Protocol for Calculating NCC

Given one reference signal f and one target signal g, the NCC function over a search range (_τx,_τy,_τz) can be calculated as follows:

_RNCCu,v,w,_τx,_τy,_τz=_{∑m=uu+_Wx−1} _{∑n=vv+_Wy−1} _{∑k=ww+_Wz−1}{fm,n,k•gm−_τx,n−_τy,k−_τz}_{∑m=uu+_Wx−1} _{∑n=vv+_Wy−1} _{∑k=ww+_Wz−1}{^f2m,n,k}•_{∑m=uu+_Wx−1} _{∑n=vv+_Wy−1} _{∑k=ww+_Wz−1}{^g2m−_τx,n−_τy,k−_τz}

where the dimensions of the reference and target windows (tracking kernels) are [u,u+_Wx−1] , [v,v+_Wy−1] , [w,w+_Wz−1] in the lateral, axial and elevational directions, respectively. Similarly, (_Wx,_Wy,_Wz) are the tracking kernel dimensions in lateral, axial and elevational directions, respectively. In Equation (1), the origin of the reference window (tracking kernel) is (u,v,w) and [_τx,_τy,_τz] is the search range defined below,

(_τ1≤_τx≤_τ2,_τ3≤_τy≤_τ4,_τ5≤_τz≤_τ6)

In the block-matching algorithm, [_τ1,_τ2],[_τ3,_τ4] and [_τ5,_τ6] are pre-determined by the algorithm. For each 3-D shift (_τx,_τy,_τz) , Equation (1) can be used to obtain one NCC value. Thus, looping through the entire search range yields a 3-D NCC map.

2.1.2. Lewis’ Sum-Table Method

When the block matching algorithm is used, significant overlaps among adjacent tracking kernels exist, resulting in a lot of redundant computation. Thus, Lewis proposed to use pre-computed tables (also known as Sum-tables) to partially “memorize" some correlation between f and g [23]. Together with Fourier Transform (FT), Lewis’ method can conceptually reduce the computational demands. More specifically, the numerator in Equation (1) is a convolution of the tracking kernel within the reference signal f with the corresponding tracking kernel of the reversed target signal g(−m,−n,−k) and can be computed by Fourier Transform (FT). The process is defined as follows:

_∑m,n,kfm,n,k•gm−_τx,n−_τy,k−_τz=^F−1{F(f)^F*(g)}

where F stands for an FT operator. The complex conjugate accomplishes reversal of the template via the FT’s time reversal property [23]. Equation( 3) was implemented via fast Fourier Transform (FFT) and thus, f and g were zero-padded to a common power of two.

Now referring to the calculation of the denominator in Equation (1), two sum tables were used: one for the reference signal f and the other for the target signal g. Let _sf2(u,v,w) and _sg2(u,v,w) denotes the created sum-tables for f and g, respectively. More details about setting up these two sum-table can be found in Appendix A. Once these two tables become available, the calculation of denominator can be conducted through a very efficient manner as follows,

_{∑m=uu+_Wx−1} _{∑n=vv+_Wy−1} _{∑k=ww+_Wz−1}^f2m,n,k=_sf2u+_Wx−1,v_+Wy−1,w_+Wz−1−_sf2u+_Wx−1,v_+Wy−1,w−1−_sf2u+_Wx−1,v−1,w_+Wz−1−_sf2u−1,v_+Wy−1,w_+Wz−1+_sf2u−1,v−1,w_+Wz−1+_sf2u−1,v_+Wy−1,w−1+_sf2u+_Wx−1,v−1,w−1−_sf2u−1,v−1,w−1

_{∑m=uu+_Wx−1} _{∑n=vv+_Wy−1} _{∑k=ww+_Wz−1}^g2m+_τx,n+_τy,k+_τz=_sg2u+_Wx−1+_τx,v_+Wy−1+_τy,w+_Wz−1+_τz−_sg2u+_Wx−1+_τx,v_+Wy−1+_τy,w−1+_τz−_sg2u+_Wx−1+_τx,v−1+_τy,w+_Wz−1+_τz−_sg2u−1+_τx,v_+Wy−1+_τy,w+_Wz−1+_τz+_sg2u−1+_τx,v−1+_τy,w+_Wz−1+_τz+_sg2u−1+_τx,v_+Wy−1+_τy,w−1+_τz+_sg2u+_Wx−1+_τx,v−1+_τy,w−1+_τz−_sg2u−1+_τx,v−1+_τy,w−1+_τz

2.1.3. Luo-Konofagou Sum-Table Method

In Luo and Konofagou method [35], both the numerator and denominator of Equation (1) are calculated using sum-tables. In addition to Equations (4) and (5), the numerator can be calculated as follow using a set of sum-tables _sf,g(u,v,w) to look up pre-computed numbers,

_{∑m=uu+_Wx−1} _{∑n=vv+_Wy−1} _{∑k=ww+_Wz−1}fm,n,k•gm+_τx,n+_τy,k+_τz=_sf,gu+_Wx−1,v_+Wy−1,_{w−1+_Wz,τx},_τy,_τz−_sf,gu+_Wx−1,v_+Wy−1,_w−1,τx,_τy,_τz−_sf,gu+_Wx−1,v−1,_{w−1+_Wz,τx},_τy,_τz−_sf,gu−1,v_+Wy−1,_{w−1+_Wz,τx},_τy,_τz+_sf,gu−1,v−1,_{w−1+_Wz,τx},_τy,_τz+_sf,gu−1,v_+Wy−1,_w−1,τx,_τy,_τz+_sf,gu+_Wx−1,v−1,_w−1,τx,_τy,_τz−_sf,gu−1,v−1,_w−1,τx,_τy,_τz

According to Equations (4)–(6), the calculation of NCC can be simplified to addition and subtraction operations. More formal analyses of algorithmic complexity and memory requirements can be found in Appendices Appendix B and Appendix C.

2.2. GPU Implementation of Three NCC Calculation Methods

2.2.1. A Brief Description of GPU Computing

CUDA (Compute Unified Device Architecture) is a common parallel computing architecture launched by Nvidia (Nvidia Inc., Santa Clara, CA, USA). This architecture enables GPUs to solve complex computing problems in parallel. In CUDA, a KERNEL function is defined as a function that performs multi-threaded parallel computation. Similarly, a DEVICE function is defined as a single-threaded function called by a KERNEL function on GPU. According to the memory structure defined in CUDA, off-chip memory (global memory and texture memory) has a higher access delay than on-chip memory (register, shared memory and constant memory). Consequently, the on-chip memory should be given priority during programming in order to improve memory access efficiency. It is worth noting that the lower-case “kernel” is used for tracking kernels in this study, while the upper-case “KERNEL” is referred as to a GPU KERNEL function.

2.2.2. Block-Matching Using GPU Parallel Computing

The block-matching algorithm implemented in this study is the classic block-matching algorithm. Because of memory limitation, full 3-D block-matching tracking was first performed in a slice-by-slice manner. Thus, a full displacement vector field for a single slice consists of M×N (Axial × Lateral) displacement estimation locations; each displacement estimation location obtains one 3-D displacement vector after the block-matching process. Because tracking displacements using the classic block-matching algorithm for each location is independent (i.e., no data dependency and requirements regarding communication), in the CUDA programming structure, the M×N thread can be launched through the KERNEL function (see Section 2.2.1). As shown in Figure 1, in order to construct 3-D NCC map for each displacement estimation location, one NCC value needs to be calculated at a specified search location (_τx,_τy,_τz) . In the Step 1 (see Figure 1), given a search range (A[Axial],B[Lateral],C[Elevational]) , the KERNEL function can start (M×N)×(A×B×C) CUDA threads to obtain one 3-D NCC map to complete Step 1. Here, A=(_τ2−_τ1+1) , B=(_τ4−_τ3+1) and C=(_τ6−_τ5+1) .

In the subsequent Steps 2 and 3 (see Figure 1), the number of CUDA threads in the KERNEL functions are consistent with the total number of displacement estimation locations M×N . Basically, each CUDA thread invokes a DEVICE function to search the peak NCC value of the corresponding 3-D NCC map (Step 2). Then, the same DEVICE function uses a quadratic fitting to obtain sub-sample estimates (Step 3) [36].

In the process of implementation, a few notable strategies were used. First, one-dimensional thread structure was adopted to ensure the consistency of memory access. In other words, we attempted to ensure that adjacent threads read adjacent memory regions and try to satisfy the requirement of coalesced memory access. Second, RF data were stored in TEXTURE memory. Nvidia GPU TEXTURE memory technology can avoid the delay caused by cross-line reading. Third, some important variables (such as axial and lateral search range, tracking kernel size) are stored in GPU constant memory for rapid accesses.

2.2.3. Implementing A Standard NCC Calculation on CUDA

Under the above-mentioned parallel implementation framework, improving the efficiency for memory access was the most important consideration. This is because the calculation of a 3-D NCC map will read each RF sample multiple times. In order to avoid this frequent data reading from TEXTURE memory, we load each small part of RF echo data into on-chip memory during CUDA programming to improve memory access efficiency. The detailed GPU implementation of this method is described in our early work [21].

2.2.4. Implementing Lewis’ Method on CUDA

In this study, the FFT function library cuFFT available in CUDA (Nvidia Inc., CA, USA) was used to calculate the numerator of Equation (1). The cuFFT library provides a simple interface for computing FFTs on Nvidia GPUs. Also, the cuFFT library has been highly optimized and systematically tested for the GPU parallel computing environment. In contrast, FFTW (http://www.fftw.org/) is a state-of-art fast FFT toolbox on CPU and was applied to implement Lewis’s method on CPU.

The calculation of the denominator in Equation (1) needs two sum-tables: one for the reference RF signal and the other one for the target RF signal. In this study, a parallel scan algorithm proposed by Sengupta et al. [37] was adopted to perform rapid prefix sum (i.e., cumulative sum) computation for each direction. The scan algorithm is based on an idea of balanced tree proposed by Blelloch [38]. Equations can be found in Appendix A.

More specifically, the prefix sum builds a balanced binary tree on the input data and scans it from the top to the root for calculating the prefix sum. Figure 2 illustrates the process of setting up a 3-D sum-table. The calculation of a 3-D sum-table can be divided into the following three stages: (1) constructing a prefix sum array along with the X-axis direction; (2) constructing the prefix sum array along with the Y-axis direction based on the result of (1); and (3) using the prefix scan algorithm to build the prefix sum array along with the Z-axis direction. A for-loop operation involving the above-mentioned three stages is sufficiently fast to establish the final 3-D sum-table. Consequently, the prefix scan algorithm was only invoked once for each sum-table and twice during the entire process. If the size of the 3-D RF signal is (Rows×Columns×Slices) , the corresponding KERNEL function will launch (Rows×Columns×Slices) CUDA threads in order to create a sum-table.

After setting up 3-D sum-tables, the calculations of numerator and denominator (see Equation (1)) are accomplished by FFT and through two sum-tables, respectively.

2.2.5. Implementing Luo-Konofagou Method

Please recall that in the Luo-Konofagou method, computing the denominator in Equation (1) is the same as Lewis’s method (see Section 2.2.4). Luo-Konofagou method replaced the FFT operation to sum-table method for computing the numerator. During this process, numerator’s calculation needs (_τ2−_τ1+1)×(_τ4−_τ3+1)×(_τ6−_τ5+1) (i.e., A×B×C ) sum-tables to represent the standard cross-correlation (CC) between the reference and target RF signals at a specified search location (_τx,_τy,_τz) . Thus, we need to launch (_τ2−_τ1+1)×(_τ4−_τ3+1)×(_τ6−_τ5+1)×(Rows×Columns×Slices) CUDA threads. It is easy to see that the number of threads and required memory become a burden to manage if the search range is very large.

2.3. Experimental Design and Data Analysis

2.3.1. A Tissue-Mimicking Phantom Experiment

Volumetric ultrasound data acquired from a 100 mm × 100 mm × 70 mm oil-in-gelatin phantom were also used to test the above-mentioned three different NCC calculation methods. The inclusion in the ultrasound data was 5 times stiffer than the background. As shown in Figure 3, a 9-MHz CMUT transducer connected to a Siemens Antares (Siemens Health Care, Inc. Ultrasound Division, Mountain View, CA, USA) was used to acquire radio frequency (RF) echo data. A robotic arm (see Figure 3) was used to move the CMUT transducer downward to compress the phantom. The volume-to-volume deformation was approximately 1.5%. Then, ultrasound echo data before and after the compression were first acquired using ultrasound research interface (URI, Siemens Health Care, Inc. Ultrasound Division, Mountain View, CA, USA) installed on the Siemens scanner. Each echo data represented a 40 mm (axial) × 37 mm (lateral) × 30 mm (elevation) volume. The RF sample size, line spacing and distance between two parallel image planes were 0.0193-mm, 0.119-mm and 0.214-mm, respectively. More details can be found elsewhere [19].

2.3.2. Data Analysis

The GPU hardware used in this study is Nvidia GeForce GTX TITAN X (Nvidia Corp., Santa Clara, CA, USA). The GPU card comes along with 80 Stream Multiprocessors, along with a total of 5120 CUDA computing cores along with 12 GB of Memory. The GPU card is installed on a desktop workstation with a Ubuntu operating system (version 16.04), Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz, and 16 GB of host memory. ANSI C and CUDA 9.0 were adopted for implementing all CPU and GPU algorithms. All testing was done under the MATLAB platform (Version 2016b, Mathworks Inc., Natick, MA, USA) and all implemented algorithms were invoked in the MATLAB environment through the MEX interface.

In total, six different implementations were done in this study: standard NCC in CPU, standard NCC in GPU, Lewis’ method in CPU and Lewis’ method in GPU, Luo-Konofagou method in CPU and Luo-konofagou method in GPU. Hereafter, those six implementations are referred to as NCC-CPU, Lou-Konofagou-CPU, Lewis-CPU, NCC-GPU, Luo-Konofagou-GPU, and Lewis-GPU, respectively. In this paper, the performance is mainly compared from the following two aspects: (1) Evaluating whether or not a different implementation yields substantial errors and (2) comparing computational efficiency of those three methods, given motion tracking parameters and the computing environment (i.e., CPU or GPU).

3. Results 3.1. Comparisons of Accuracy Among Different Implementations

Displacement estimates obtained from using the NCC-CPU of the block-matching algorithm were compared to other five above-mentioned implementations: Lou-konofagou-CPU, Lewis-CPU, NCC-GPU, Lou-Konofagou-GPU, Lewis-GPU. In Figure 4, the standard-NCC-CPU implementation yielded exactly same displacements as compared to the Lou-Konofagou-CPU implementation (see the second row). The differences between the standard-NCC-CPU and Lewis-CPU implementations were neglectable (< ¹⁰⁻¹⁵mm ), as shown in the third row of Figure 4. However, when comparing three GPU implementations (NCC-GPU, Lou-Konofagou-GPU and Lewis-GPU) to the standard-NCC-CPU implementation, small differences exist (see the fourth to sixth rows in Figure 4). The differences are quantitatively analyzed from 30 (elevational) slices of ultrasound displacement data and the result is shown in Figure 5.

It is clear from Figure 5 that all three GPU implementations produced slightly different displacement results. However, the difference was small ( ¹⁰⁻⁶ – ¹⁰⁻⁴ mm). Such a small difference did not result in visible differences on respective axial strain images (see the fourth column in Figure 4).

Computational Efficiency

The influence of the tracking kernel size was investigated for six implementations and the results are shown in Figure 6. The search range was set to (5[lateral]×18[axial]×3[elevational]) for tracking approximately 1.5% tissue deformation, and the size of tracking kernel varied as stated in those figures. For reference, 18, 61 and 69 axial samples are equivalent to 0.36 mm, 1.16 mm, and 1.35 mm, respectively, while 7, 9 and 11 beamlines are 0.83 mm, 1.07 mm and 1.31 mm, respectively. Three elevation planes equal to 0.64 mm in space. We found that the Lou-Konofagou method can substantially reduce the computing time under the CPU environment (i.e., >95% reduction between Lou-Konofagou-CPU and NCC-CPU). However, under the GPU environment, the reductions become small but still noticeable (between 17–46%). It is interesting to note that, under the GPU environment, Lewis’s method reduces the computing time at a lower rate (between 7–23%), as compared to Lou-Konofagou method.

When the tracking kernel was fixed at 9[1.07mm;lateral]×69[1.35mm;axial]×3[0.64mm;elevational]) . The computational efficiency was also examined when the search range had been varied. It is common to change the search range in USE to accommodate different frame-to-frame (2D) or volume-to-volume (3-D) strain levels occurring in vivo. As shown in Figure 7, the required time to complete the motion tracking as the increase of search range. However, the trend of time reduction between the standard NCC method and the Lou-Konofagou method remained the same. In contrast, with the increase of the search range, time needed for Lewis’ method to complete the required tracking remains relatively steady (Figure 7).

As shown in Table 1, the standard deviation values among 30 slices were low as compared to the mean value. This indicated that the time required to calculate NCC values was stable regardless of the implementation used.

4. Discussion and Summary

In this paper, the motion tracking accuracy and computational efficiency of the three methods in the CPU serial computing environment and GPU parallel computing environment are compared and systematically analyzed. Under the CPU environment, the Lou-Konofagou method can substantially improve computational efficiency (by 95% or more). In contrast, using Luo-Konofagou method, the 3-D tracking can still be accelerated under the GPU parallel environment. However, the rate of improvements only ranged between 17% and 46% (see Figure 6). The rate of improvements obtained by the Lewis’ method was considerably less (7–23%; see Figure 6).

In order to further improve the GPU-based 3-D motion tracking, one strategy is to enhance the memory access efficiency. That requires us to make full use of on-chip memory. Recall that, according to Nvidia’s GPU specifications, the latency of on-chip memory is significantly better than that of off-chip memory. However, on-chip memory capacity is limited on GPUs. With the development of GPU hardware technology, the on-chip memory capacity may continue to increase. At the same time, the current GPU implementation can also be further optimized. In multi-instruction and multi-data (MIMD) mode, when one instruction is waiting for loading data, another segment of data can be used to perform some computing tasks at the same time. Therefore, the MIMD mode may further improve the computational efficiency of GPU implementation. We expect that with the improvements of GPU hardware, real-time 3-D ultrasound motion tracking may become a reality in the near future.

As the tissue deformation increases, the required search range becomes inevitably large. Consequently, the creation of those sum-tables requires a longer time (see Appendix B). It is also found that the Lou-Konofagou method demands a high usage of memory (see Appendix C). Also, when the search range increases, the number of required sum-tables becomes bigger accordingly. In this sense, in order to deal with large tissue deformation, the Luo-Konofagou method can be used in conjunction with a multiple-compression tracking strategy [39,40]. This is because the multi-compression method can effectively reduce the required search range at the expense of performing motion tracking multiple times. Alternatively, the Luo-Konofagou method can be used together with a 3-D region-growing motion tracking method [20]. The advantage of using a region-growing motion tracking method is that the search region would be fairly small. Both directions will be explored in our future work.

Our preliminary results demonstrated that the Lewis method accelerated the 3-D motion tracking at a slow rate (see Figure 6 and Figure 7) in both CPU and GPU computing environments. This is largely because the FFT requires a large number of calculations in order to estimate the numerator in Equation (1) regardless of the computing environment.

Implementation Method	Computing Time (milliseconds)
Standard-NCC-CPU	5760.0 ± 9.3
Luo-Konofagou-CPU	168.5 ± 2.2
Lewis-CPU	2120.0 ± 5.5
Standard-NCC-GPU	278.1 ± 0.9
Luo-Konofagou-GPU	159.9 ± 0.4
Lewis-GPU	225.2 ± 3.2

Author Contributions

All authors discussed the original idea. B.P. and J.J. constructed the overall study design with support from S.L. and Z.X.; B.P., S.L., Z.X. performed programming and conducted data analysis; B.P and J.J. drafted this manuscript; all authors provided critical feedback and reviewed the manuscript.

Funding

The project is funded in part by research grants from Scientific Innovation Program of Sichuan Province (Major Engineering Project: 2018RZ0093), Nanchong Scientific Council (Strategic Cooperation Program between University and City: NC17SY4020), and US National Institutes of Health (R15-EB026197).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Setting up 3-D Sum-Tables for 3-D Ultrasound Echo Data

The sum tables in this study are defined as follows:

_sf2=∑m=0u∑n=0v∑k=0w^f2(m,n,k)_sg2=∑m=0u∑n=0v∑k=0w^g2(m,n,k)_sf,g(u,v,w,_τx,_τy,_τz)=∑m=0u∑n=0v∑k=0wf(m,n,k)·g(m+_τx,n+_τy,k+_τz)

Above definitions are similar to those defined by Luo and Konofaguo in 2D [22]. The 3-D sum tables can be constructed efficiently using the following three steps. First, the summation is performed in the u (lateral) direction for each v and w. Then, the summation was performed in the v (axial) direction for each u and w. Finally, the summation was performed in the w (elevational) direction for each u and v. One illustrative example is shown in Figure 2. The constructing process of _sf2 , _sg2 and _sf,g2 are given by (A1)-(A3) below.

_{sf,temp_u2}(u,v,w)=_fu,v,w2+_{sf,temp_u2}(u-1,v,w)_{sf,temp_v2}(u,v,w)=_{sf,temp_u2}(u,v,w)+_{sg,temp_v2}(u,v-1,w)_sf2(u,v,w)=_{sf,temp_v2}(u,v,w)+_sf2(u,v,w-1)

_{sg,temp_u2}(u,v,w)=_gu,v,w2+_{sg,temp_u2}(u-1,v,w)_{sg,temp_v2}(u,v,w)=_{sg,temp_u2}(u,v,w)+_{sg,temp_v2}(u,v-1,w)_sg2(u,v,w)=_{sg,temp_v2}(u,v,w)+_sg2(u,v,w-1)

_{sf,g,temp_u}(u,v,w,_τx,_τy,_τz)=f(u,v,w)·g(u+_τx,v+_τy,w+_τz)+_{sf,g,temp_u}(u-1,v,w,_τx,_τy,_τz)_{sf,g,temp_v}(u,v,w,_τx,_τy,_τz)=_{sf,g,temp_u}(u,v,w,_τx,_τy,_τz)+_{sf,g,temp_v}(u,v-1,w,_τx,_τy,_τz)_sf,g(u,v,w,_τx,_τy,_τz)=_{sf,g,temp_v}(u,v,w,_τx,_τy,_τz)+_sf,g(u,v,w-1,_τx,_τy,_τz)

Appendix B. Analysis of Algorithmic Complexity for Sum-Table Methods

The complexity of calculating one NCC for 3-D block-matching is analyzed in the appendix for the sake of completeness. The analysis of complexity for the 3-D case is an extension to the work done by Briechle and Hanebeck in 2D [41]. Under the framework of a 3-D block-matching, one-direction search ranges are _Sx,_Sy and _Sz for lateral, axial and elevation directions, respectively. Of note, the one-direction search range above was defined as the maximal translation of the center of the tracking kernel along one given direction (e.g., upward or downward). Given a 3-D tracking kernel with a size of _Wx [lateral], _Wy [axial] and _Wz [elevational] samples, we need to get access to a segment of RF data whose size is at least _Mx[lateral]×_My[axial]×_Mz[elevational] RF samples in order to complete the calculations for one NCC values (see Equation (1)), where _Mx=2_Sx+_Wx+1 , _My=2_Sy+_Wy+1 and _Mz=2_Sz+_Wz+1 .

According to the given data, Table A1 shows the complexity for computing a single NCC value for three methods: Standard NCC, Lewis' method and Luo-Konofagou method. According to Table A1, the algorithmic complexity of the FFT operation depends on the size of the tracking kernel and the corresponding search area. When the search range is small, Lewis' method has considerably lower algorithmic complexity as compared to the standard NCC method. When the search area becomes larger, the benefit of using Lewis' method diminishes. In contrast, the algorithmic complexity of Lou-Konofagou method is not dependent on the tracking kernel size and search range.

Table

Table A1.A summary of algorithmic complexity for computing a single NCC value under the block-matching algorithm. The costs involving the construction of sum-tables is not included and analyzed below in Table A2.

[ Table omitted. See PDF. ]

The size of the 3-D reference signal f are _fx , _fy , and _fz for lateral, axial and elevation directions, respectively. The size of target signal g along the axial and lateral directions is the same as that of the reference signal f. However, the size of the 3-D target signal g for elevation direction is _gz , which could differ from _fz . The required number of operations in terms of setting up the sum tables is shown in Table A2. Recall that the standard NCC method has no need for using the sum-tables.

Table

Table A2.A Summary of Algorithmic Complexity for Setting up Sum-tables needed for Lewis' and Luo-Konofagou methods.

[ Table omitted. See PDF. ]

Appendix C. Analysis of Memory Requirements for Sum-Table Methods

Let the size of raw RF volume data is 1024(axial)×128(lateral)×50(elevational) . The memory usage for setting up sum-table is mainly associated with 3-D search range and the size of evelational of 3-D tracking kernel. Assuming that a 3-D tracking kernel is (69(axial)×9(lateral)×3(elevational) and the search range is 18(axial)×5(lateral)×3(elevational) , we need to access to both the reference RF volume data f ( 1024×128×3 ) and the target RF data g ( 1024×128×5 ). According to Equations (4)-(6), the RF volume data required to set up the sum-tables of f and g are 1024×128×(3+1) and 1024×128×(5+1) . Therefore, the memory consumption for the corresponding sum-table of f and g are 1024×128×(3+1)×sizeof(float) = 2 MB, and 1024×128×(5+1)×sizeof(float) = 3 MB, respectively.

Since Luo-Konofagou method needs to set up a set of sum-tables at different shift locations in target signal g for calculation of numerator, the memory required for this part is (18×5×3)×(1024×128×4)×sizeof(float) = 540 MB. Table A3 lists the memory usage for Luo-Konofagou method and Lewis's method given different search ranges.

Table

Table A3.A Summary of memory usage for Setting up Sum-tables needed for Lewis' and Luo-Konofagou methods.

[ Table omitted. See PDF. ]

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Word count: 6111

Show less

© 2019. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

The location (x,y,z) here is referred to a location in the pre-deformation ultrasound each volume (i.e., the reference volume). [...]the (dx,dy,dz) represents the displacement vector by which the medium moved to the post-deformation ultrasound echo volume (i.e., the target volume) from the reference volume. [...]in Step 3, the subs-sample displacement vector can be obtained by fitting a 3×3×3 matrix containing NCC values surrounding the maximum NCC value into a quadratic surface. For each 3-D shift (τx,τy,τz) , Equation (1) can be used to obtain one NCC value. [...]looping through the entire search range yields a 3-D NCC map. 2.1.2. According to the memory structure defined in CUDA, off-chip memory (global memory and texture memory)

Details

Title

Accelerating 3-D GPU-based Motion Tracking for Ultrasound Strain Elastography Using Sum-Tables: Analysis and Initial Results

Author

Peng, Bo; Luo, Shasha; Xu, Zhengqiu; Jiang, Jingfeng

Publication year

2019

Publication date

2019

Publisher

MDPI AG

e-ISSN

20763417

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/app9101991

ProQuest document ID

2331445119

Accelerating 3-D GPU-based Motion Tracking for Ultrasound Strain Elastography Using Sum-Tables: Analysis and Initial Results

Jump to:

Full Text

Abstract

Details

Suggested sources