Optimization of the REED web application for automatic outlier detection in predictive medicine
Files
Lambot_49301400_2017.pdf
Open access - Adobe PDF
- 1.66 MB
Details
- Supervisors
- Faculty
- Degree label
- Abstract
- This master thesis aims at proposing a solution to improve the current outlier detection method used by the REED online tool developed by DNAlytics and performing a preliminary - fully automated - analysis of uploaded medical datasets in order to determine if the data exhibits enough interest to justify the costs of starting a real, more in depth, manual analysis. An outlier being a sample for which the behavior indicates that it might not come from the same distribution as the normal observation, it has an impact on the analysis that can be made on the datasets and it is therefore often needed to identify those points in order to handle them correctly. Detecting outliers, especially on medical datasets is not trivial due to the non-normal behavior of some of the features, the very high dimensionality, and the small number of samples usually available. This work will explore the different techniques found in the literature and try to find a solution that is both robust and more efficient than the currently implemented one. The different experiments performed using public datasets (considering the minority class as outliers) or simulated datasets (with the last percentile of the distribution as outliers). The usage of Ensemble methods, even if they are usually not performing as well as the best base detectors shows a stabilizing factor which tends to suggest that using ensemble is a really good solution in an unsupervised environment requiring robustness such as REED. Moreover, the improvement offered by the bagging seems really promising, as expected due to the strong impact of the curse of dimensionality. Unfortunately, most of our experiments are based on synthetic datasets due to the lack of good unsupervised assessment methods. The literature does not offer any real leads on how to tackle the lack of ground truth, preventing to draw any strong conclusions from the unsupervised experiments.