GENERATING SYNTHETIC 2019-NCOV SAMPLES WITH WGAN TO INCREASE THE PRECISION OF AN ENSEMBLE CLASSIFIER
Resumen
The objective of this research is to present an alternative data augmentation technique based on WGAN to improve the precision in detection of positive 2019-nCoV samples, as well as compare it with other traditional data augmentation techniques, using a dataset composed of results of exams from individuals tested for COVID-19 in a hospital in Brazil. Given the data imbalance, we presented a gradient-boosted decision tree (GBDT) classifier with the preprocessed data in 13 different oversampling training scenarios, using SMOTE, ADASYN, Random Over Sampling, and WGAN to augment positive samples, as well as no augmentation at all. All over-sampling scenarios are set so that a mixture of real and synthetic samples is presented to the classifier. GBDT classifier was then trained in all scenarios with stratified k-fold (k=10), and its hyperparameters were optimized for the highest possible f1-measure with the random search algorithm for 20000 epochs. A hold-out test set was prepared by randomly removing an even number of really positive and negative samples from the training set and shuffling it for testing. GBDT classifier trained with 50% WGAN synthetic positive samples achieved a precision score of 0.967 on the test set, far outperforming all other scenarios, including similar mixture scenarios using other oversampling strategies. These results indicate that data augmentation with WGAN to oversample the minority class might be an alternative for traditional oversampling techniques, even improving a classifier’s precision.