GPU-accelerated image processing in ImageJ using CLIJ
Note: This page is outdated. We changed the benchmarking to use JMH. See here for the new benchmarking report. This page is just kept for allowing reproducing our results from the first BioRxiv preprint version of the paper.
In order to measure performance differences between ImageJ and CLIJ operations, we conducted benchmarking experiments.
First, we set up a list of operations to benchmark in more detail. The list contains typical operations such as morpholigical filters, binary image operations, thresholding and projections. The operations were encapsulated in respective Benchmarking modules and put in a Java package; Each of these modules has two test methods: One that processes images in the ImageJ-way and another that executes an as-similar-as-possible operation using CLIJ. Both methods measure the pure processing time excluding time spent for data transfer between CPU and GPU for example. An exceptional module called Transfer3DFrom is used to quantify the transfer time exclusively. In order to measure pure processing times of the operations, image data transfor from/to GPU was excluded. Furthermore, GPU-code compilation time was excluded by ignoring the first out of 10 repetitions. During this so called warm-up phase, compiled GPU-code is cached and can be reused in subsequent iterations.
For benchmarking operations, we used images with random pixel values of pixel type 8-bit with a size of 16 MB if not specified otherwise. The 2D-operations were tested on images of size 4096 x 4096. The 3D-operations received 256 x 256 x 256 large image stacks for processing.
Benchmarking was executed on
The measured median transfer times +- standard deviations show clear differences between notebook (push: 0.5 +- 0.4 GB/s, pull 0.4 +- 0.5 GB/s) and workstation (push: 1.3 +- 0.6 GB/s, pull: 1.3 +- 0.4 GB/s ) suggesting that the data transfer is faster on the workstation. These numbers can be retraced by executing the analyse_transfer_time.py
These plots were done with the plotCompareMachinesTransferImageSize.py script.
We chose six operations to plot processing time with respect to image size in detail.
Time benchmark measurement raw data are available online; These plots were done with the plotCompareMachinesImageSize.py script.
Some of the operations show different behaviour with small images compared to large images. There is apparently not a linear or polynomial relationship between image size and execution time. We assume that the cache of the CPU allows ImageJ operations running on the CPU to finish faster with small images compared to large images. This is reasonable as these caches usually hold several megabytes.
Some operations have parameters influencing processing time. We chose two representative examples: the Gaussian Blur filter whose processing time might depend on its sigma parameter and the Minimum filter whose processing time might depend on the entered radius parameter.
Time benchmark measurement raw data are available online; These plots were done with the plotCompareMachinesKernelSize.py script.
The Gaussian Blur filter in ImageJ is optimized for speed. Apparently, it is an implementation which is independent from kernel size. Thus, with very large kernels, it can perform faster than it GPU-based counter part. The Minimum filter shows an apparently polynomial increasing time with increasing filter radius.
We also generated an overview of speedup factors for all tested operations. The speedup was calculated relative to the ImageJ operation executed on the laptop CPU.
Time benchmark measurement raw data are available online; This plot was generated with the plotCompareMachinesAllOperations.py script.
This table shows some operations to perform slower (speedup < 1) on the workstation CPU in comparison to the notebook CPU. We see the reason in the single thread performance resulting from the clock rate of the CPU. While the Intel i7-8650U is a high end mobile CPU with up to 4.2 GHz clock rate, the Intel Xeon Silver 4110 has a maximum clock rate of 3 GHz. On the other hand, there are operations shown which run faster on the workstation CPU. We assume that the algorithms implemented exploit multi-threading for higher performance yield. The workstation CPU has 8 physical cores while the laptops CPU has just 4.
If you want to contribute to this project, feel free to clone the repository, execute the benchmarks and file a pull request with your measurements! Also, please specify your CPU and GPU hardware when filing the PR.