ATTENTION/WARNING - NE PAS DÉPOSER ICI/DO NOT SUBMIT HERE

Ceci est la version de TEST de DIAL.mem. Veuillez ne pas soumettre votre mémoire sur ce site mais bien à l'URL suivante: 'https://thesis.dial.uclouvain.be'.
This is the TEST version of DIAL.mem. Please use the following URL to submit your master thesis: 'https://thesis.dial.uclouvain.be'.
 

Feature embedding techniques for enhancing deep learning models on tabular data

(2024)

Files

Tchoupe_18442100_2024.pdf
  • Open access
  • Adobe PDF
  • 1.78 MB

Details

Supervisors
Faculty
Degree label
Abstract
Many datasets are stored in tabular form, where rows correspond to samples and columns to attributes or features. While deep learning models excel in other data modalities such as images or text, they often underperform when applied to tabular data compared to traditional models like gradient-boosted tree ensembles (e.g., XGBoost). This is partly due to the heterogeneous nature of tabular data features. Each feature may represent a different variable, either categorical or numerical, complicating the learning process for neural network-based models. An important step in any deep learning pipeline for tabular data is to encode categorical features in a way that is usable by the network and to project all features into a more homogeneous space that is more conducive to learning. A common approach is to use one-hot encoding for categorical features and allow the first few layers of a neural network to learn suitable feature representations in a supervised manner during the training phase. Self-supervised learning has led to considerable performance gains in the natural language processing world: a decade ago with static embedding techniques like Word2Vec, and more recently with contextual embedding techniques that are key elements in the success of large language models like GPT. The goal of this thesis is to explore such self-supervised feature embedding techniques for tabular data and evaluate their impact on the performance of simple deep learning architectures (e.g., MLP) and more advanced architectures like Transformer-like models for various downstream tasks. The primary focus is on understanding how these feature embedding techniques impact the learning process and how they can be leveraged to improve performance. We have developed a self-supervised embedding technique inspired by Word2Vec, and through our research, we have obtained insights that can aid in the development of better deep learning models for tabular data.