Deep learning in veterinary medicine, an approach based on CNN to detect pulmonary abnormalities from lateral thoracic radiographs in cats

study population

TR images were extracted from the image database of the veterinary campus of VetAgroSup, France, over the period from September 2012 to January 2020. The image database was composed of 72,567 records. In this database, each record has a Medical Imaging Report (MIR) which contained details relative to the animal (eg specie, age, breed), its condition (eg : reason of consultation, clinical sign, follow-up), the radiographic procedures (e.g. X-ray machine, projections) and the medical imaging findings. All MIRs formed a secondary tabular database, where each MIR was used to label its corresponding image from the first database. The MIR was reviewed by at least one veterinary radiologist expert (ECVDI board-certified).

Feline TR images selection

For a same cat, sets of TR images could have been acquired on different dates and the medical imaging findings could have changed : that is why a case was considered as a couple (cat; day). For each animal, one up to three views have been realized among the following projections: left-lateral, right-lateral and ventro-dorsal. The acquisition of TR images was supervised by veterinary technicians from the diagnostic imaging unit of the veterinary campus of VetAgro Sup in accordance with their animal welfare guidelines. No direct cats were involved in this study. Two softwares were used for acquisition: ImagePilot (KONICA MINOLTA) and Console Advance (FUJIFILM CORPORATION). All MIRs and all TRs were validated by the veterinary radiologist expert. A first request on all MIRs selected those with or without description of RPP and led to the creation of two sets of cases: TR images with RPP(s) and TR images without RPP. Based on these two sets, corresponding DICOM files were extracted and then converted into JPEG files with med2image25. Each JPEG image and corresponding MIR were reviewed by a veterinarian to check if there is misclassification and to conserve only normal TRs or TRs with RPP(s) without any extra material (eg infusion line, bandage, lead shot) or extra lung disorders (eg diaphragmatic hernia, severe pleural effusion, pneumothorax).

Feline TR images pre-processed

Initial TR image files size ranged from 138 to 545 KB and their image matrix size ranged from (1692 times 1350) to (2964 times 2364) pixels, with a width-height ratio from 1.2 to 1.3, depending on the size and detector density of the radiographic detector plate used during TR image acquisition. For conserving the ratio and pixel intensity range of CNN, all TR images were resized to (256 times 192) and normalized during pre-processing. Then, TR images were said as “Original” if no more modification was done or as “Segmented” if the intra-thoracic area was manually segmented with the image-viewer Preview (v.10.1, macOS Mojave v.10.14.6). The intra-thoracic area was defined as the radiographic part on TR image delimited dorsally by the ventral side of spine, ventrally by the dorsal side of sternum, cranially by the first ribs and caudally by the diaphragm. This segmentation was made by a unique veterinarian. Moreover, the effect of “ECM” based on Contrast Limited Adaptive Histogram Equalization for “Original” and “Segmented” TR images was tested26. These two additional image pre-processings were called respectively “Original + ECM” and “Segmented + ECM”. The manual segmentation was performed after the use of ECM. Thus, four different image pre-processings are assessed in this work. The workflow used is presented in Fig. 4.

Model and architecture

ResNet5027, a well used deep learning classification architecture, recently showed the highest performance in comparison with four other CNNs for the detection of coronavirus pneumonia using CXR images28. Inspired by ResNet50, ResNet50V229 is a modified version that performed better on ImageNet30, one of the hugest database composed of millions of images from hundreds of categories. That is why the model in this study was built on the model ResNet50V2. All layers above the last convolutional layer were replaced by a 4-layers block including two fully connected layers with only one neuron on the last layer for binary classification purposes. Fig. 5 details the proposed model. A binary cross-entropy loss function was used and a final output sigmoid function predicted the class. Thus, the model takes a TR image as input and returns a prediction probability in the range of 0 to 1. If the returned prediction probability is less than 0.5, the predicted label is “Normal”, else the predicted label is “Abnormal”.

Keras library on top of TensorFlow (version 2.3.0, Google) was used to implement the model. All algorithms were run on a Tesla P100 16G (NVIDIA) GPU from the computational platform of the Center Blaise Pascal (ENS, Lyon, France). Training such a deep classification network is challenging and we proposed to use transfer learning and fine-tuning and then a data augmentation for the final model’s training.

Transfer learning from ImageNet and human CXR images

In order to get the most out of the database of TR images on a very deep architecture such as ResNet50V2, a pre-training was performed at first on a natural image database and then a fine-tuning was run on a human radiography database. These approaches are also known as transfer learning. Transfer learning refers to storing knowledge learned from solving one problem and then using it to another related problem. In the context of classification of TR images by CNN, this corresponds to reusing the weights of a classification network trained on another database (ie natural color images), as initial weights for the coming training on other database (ie X-ray images)31 . This strategy often produces better results after training (or fine tuning) on ​​a new database than using random initial weights for training even for medical applications (an example using magnetic resonance imaging to evaluate positron emission tomography scans can be found in32). For instance a CNN model trained on ImageNet has been fine tuned for pneumonia and tuberculosis localization on radiographs33. Thus, transfer learning was used from ImageNet to the “Large database of Labeled Optical Coherence Tomography and Chest X-Ray”34 which contains hundreds of human CXR images with or without signs of pneumonia.

Training model on feline TR images

Two sets were randomly generated from the 500 TR images: 455 TR images for the training (performed with the training set and the validation set) and 45 TR images for the test (performed with the test set). TR images used for the training and for the test represented respectively 90% and 10% of the total amount of TR images. Among the 455 TR images used for the training, 80% were allocated for the training set and 20% for the validation set. Setting of such ratios (90%/10% and 80%/ 20%) were inspired by a similar study which used a limb radiograph database for the detection of hip fractures on plain pelvic radiographs35.

A data augmentation strategy was applied only on the training set using the following transformations: random rotation (± 15°) random width and height shift (at max of 0.05% of total width or height), random shear (0.5) and random zoom ( 0.8 to 1.2). Although flipping could extend the training set, it was not used because TR images are conventionally oriented with the cat’s head and the cat’s back on the left part and on the top part of TR images respectively36.

Inspired from a methodological CNN-based study with ultrasound images in dogs11 an exponential decay learning rate schedule with an initial learning rate of 0.001 was used. The batch size was set to 40, a dropout of 0.5 and Adam optimizer37 were used. The initial number of epochs was set arbitrarily to 500 although an early-stopping function was implemented to stop the training when the loss on the validation set has stopped decreasing after 25 consecutive epochs in order to avoid overfitting and reduce the training time38.

The model firstly fine-tuned with human CXR images was secondly fine-tuned on the 455 TR images with each of the four pre-processings (“Original”, “Original + ECM”, “Segmented”, “Segmented + ECM”). To rigorously compare the four pre-processings it was essential to run a lot of different training sessions. Indeed, it was necessary to overcome variations of training due to the distribution of the 455 TR images between the training set and the validation set24. That is the reason why our model was fine-tuned with 200 random distributions of the 455 TR images between the training set and the validation set. Thus, 200 different fine-tuned models were obtained and saved for each of the four pre-processings. This approach enabled a robust statistical analysis and justified the choice of the best pre-processing with a quantitative assessment over the 200 random shuffles validation sets. In the Supplementary Fig. S3, we justify the choice of the number 200 as a result of an analysis of three metrics (Accuracy, Sensitivity, Specificity) obtained on the test set, according to the number of fine-tuned models.

Quantitative assessment and ensemble methods approach

The quantitative assessment of the 200 fine-tuned models was performed using 5 five metrics: Sensitivity (Se) also called “Recall”, Specificity (Sp), Accuracy (Acc), Positive Predictive Value (PPV) also called “Precision” and the F1 score. Negative Predictive Value was not calculated because the TR exam is realized most of the time when abnormalities are suspected, thus PPV was preferred. Metrics were calculated such as: (text {Se}=TP/(TP+FN)), (text {Sp}=TN/(TN+FP)), (text {Acc}=(TP+TN)/(TP+TN+FP+FN)), (text {PPV}=TP/(TP+FP)) other (text {F1-score}=TP/(TP+frac{1}{2}(FN+FP)))). With TP, TN, FP and FN were respectively the number of True Positive, True Negative, False Positive and False Negative classifications.

These five metrics were calculated for each of the 200 validation sets for the four pre-processings. A statistical analysis of metrics’ distribution allowed to demonstrate which pre-processing achieved the best performances over 200 random shuffles validations sets. To complete the quantitative assessment, the metrics were also computed on the 45 TR images of the test set. These 45 TR images had never been used for training. For each pre-processing, the 200 fine-tuned models worked as independent classifiers of TR image, thus 200 predictions were provided for the same TR image. To take full advantage of these 200 fine-tuned models, a voting ensemble method was applied to make the final prediction. The final prediction was obtained by the unweighted averaging method which is the most common ensemble method for neural networks24: it consisted of considering the average of prediction of the 200 fine-tuned models. In this way, the variance of prediction from these 200 fine-tuned models was reduced and less dependent on the split between the training set and validation set24. The final prediction was compared to the medical imaging findings described in the MIR. For one TR, the final prediction was obtained in about 40 s.

Qualitative assessment with an averaged Grad-CAM

To facilitate the interpretability of the final prediction an averaged Grad-CAM was applied20. We refer as activation map, a function that maps all the points of the input data (the TR image), to understand where the feature of interest lies. In practice, it means that it takes in input the TR image, and understand which pixels give the more information to the model for its final prediction. The averaged Grad-CAM produced an activation map of the input TR image. This activation map represented areas on the TR image which permitted the final prediction. The intensity of activation was represented by a continuum of colors from cool (blue) to warm (red) hues. For a human eye, it produces a new picture, where points of interest are highlighted.

In medical terms, these activations corresponded to “Abnormal” areas, ie areas with signs of RPP. The more the color was warm, the more the activated area was recognized as a strongly abnormal area. On the contrary, an activated area in cool colors was recognized as a slightly abnormal area. Thus, with the averaged Grad-CAM the veterinarian is able to double check the final prediction with a critical and analytical eye.

Statistics and data analysis

For each metric, a comparison of means achieved with the four pre-processings was realized with an ANOVA and followed by a pairwise t-test with the Bonferroni-Holm correction for multiple comparisons. In addition the 95% Confidence Interval (CI) of the median value for each distribution was graphically represented with notched box plots.