Resolution of the big-data problem related to a dimension reduction algorithm based on multi-scale similarities in stochastic neighbor embedding

Files

Supervisors: Verleysen, Michel ; Lee, John
Faculty: Ecole polytechnique de Louvain
Degree label: Master [120] : ingénieur civil en informatique
Abstract: Data visualization has always been a necessity. That is why the dimension reduction field is an important part of machine learning. One of the best algorithms to do data visualization is the multi-scale stochastic neighbor embedding (Ms.~SNE). But because of its time complexity of O(N^2 \log(N)), it is not suitable for large databases. In order to solve this Big Data problem, the solution proposed here is an accelerated version of Ms. SNE. It uses metric trees to approximate the data cloud into clusters and to reduce the cost to a O(N \log^2(N)) time complexity. This is a new research and the resulted solution is not perfect yet but the results prove that the approximations added to the original algorithm allow the code to run on larger databases with a minimum loss of precision.