Leonardo S. Paulucio, Thiago M. Paixão, Rodrigo F. Berriel, Alberto F. De Souza, Claudine Badue and Thiago Oliveira-Santos
Paper published in 2020 International Joint Conference on Neural Networks (IJCNN)
DOI: 10.1109/IJCNN48605.2020.9207093.
Natural Language Processing (NLP) has been receiving increasing attention in the past few years. In part, this is related to the huge flow of data being made available everyday on the internet, which increased the need for automatic tools capable of analyzing and extracting relevant information, especially from the text. In this context, text classification became one of the most studied tasks on the NLP domain. The objective is to assign predefined categories or labels to text or sentences. Important applications include sentence classification, sentiment analysis, spam detection, among many others. This work proposes an automatic system for product categorization using only their titles. The proposed system employs a state-of-the-art deep neural network as a tool to extract features from the titles to be used as input in different machine learning models. The system is evaluated in the large-scale Mercado Libre dataset, which has the common characteristics of real-world problems such as imbalanced classes, unreliable labels, besides having a large number of samples: 20,000,000 in total. The results showed that the proposed system was able to correctly categorize the products with a balanced accuracy of 86.57% on the local test split of the Mercado Libre dataset. It also surpassed the fourth place on the public rank of the MeLi Data Challenge with 91.19% of balanced accuracy, which represents less than 1% of the difference to the winner.
-
Mercado Libre Data Challenge: Link
-
Model weights: Download
-
Local splits: Splits created from the original Mercado Libre train set to perform the model fine-tuning: Download
-
Processed splits: Splits after the preprocessing stage: Download
Folds:
-
Folds created from both train and validation local splits: Download
-
Folds created from all data: Download
This study was financed in part by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brasil (CAPES) – Finance Code 001; Conselho Nacional de Desenvolvimento Científico e Tecnológico – Brasil (CNPq) – grants 311654/2019-3, 200864/2019-0 and 311504/2017-5; and Fundação de Amparo à Pesquisa do Espírito Santo (FAPES), Brazil – grant 84304057/18
If you find this useful, consider citing:
@INPROCEEDINGS{paulucio2020ijcnn,
author = {L. S. {Paulucio}
and T. M. {Paixão}
and R. F. {Berriel}
and A. F. {De Souza}
and C. {Badue}
and T. {Oliveira-Santos}},
booktitle={2020 International Joint Conference on Neural Networks (IJCNN)},
title={Product Categorization by Title Using Deep Neural Networks as Feature Extractor},
year={2020},
pages={1-7},
}