IMPACT OF NOISE ON THE PERFORMANCE OF MACHINE LEARNING MODELS: A CASE STUDY WITH THE IRIS DATASET

Authors

Abstract

This paper evaluates the robustness of four classical classification algorithms — K-Nearest Neighbors (KNN), Decision Tree, Random Forest, and Support Vector Machine (SVM) — when faced with noise in the Iris dataset, which has been widely used in previous research. Five levels of noise (0% to 40%) were applied to the numerical attributes of the dataset, simulating real variations in input data. The models were implemented in Python using the scikit-learn, numpy, pandas, matplotlib, seaborn, and statsmodels libraries. The evaluation was performed through cross-validation, using the metrics of accuracy, F1-score, and recall. The results were subjected to ANOVA and Tukey statistical tests to verify significant differences between the performances of the algorithms. The results indicate that all models presented high performance with clean data, with emphasis on KNN and SVM. As the noise increased, Decision Tree showed greater sensitivity, while Random Forest and SVM demonstrated greater stability. However, statistical tests did not identify significant differences between the models at any noise level (p > 0.62), suggesting that the observed practical variations are not statistically relevant. The paper reinforces the usefulness of the Iris dataset as a benchmark and highlights the importance of statistical analysis in assessing the robustness of classification algorithms in scenarios with noisy data.

Published

2026-04-27