Proceedings of the 2019 International Conference on Mathematics, Big Data Analysis and Simulation and Modelling (MBDASM 2019)

A Novel Feature Selection Method Based on Category Distribution Ratio in Text Classification

Authors
Pujian Zong, Jian Bian
Corresponding Author
Pujian Zong
Available Online October 2019.
DOI
10.2991/mbdasm-19.2019.45How to use a DOI?
Keywords
text classification; feature selection; feature reduce
Abstract

In text classification, texts are represented as a high-dimensional and sparse matrix, whose dimension is the same as the total number of terms of all texts. Using all terms for text classification tasks will affect the accuracy and efficiency. Feature selection algorithm can select some features most relevant to text category and reduce the dimension of text representation vector. In this paper, we propose a new feature ranking metric as category distribution ratio (CDR) which takes the true positive rate and false positive rate and their difference of a term into account while estimating the significance of a term. To prove the effectiveness of the proposed feature selection algorithm, we compare its performance against six metrics ( balanced accuracy measure (ACC), odds ratio (OR), Gini index (GI), max-min Ratio (MMR), normalized difference measure(NDM),chi - square (CHI)) on three benchmark data sets (20newsgropus, Ohsumed, Reuters 21578) using multinomial naive Bayes, support vector machines and k-nearest neighbor classifiers. The experimental results show that the classification evaluation index macro F1 based on CDR feature selection is higher than the other six algorithms.

Copyright
© 2019, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2019 International Conference on Mathematics, Big Data Analysis and Simulation and Modelling (MBDASM 2019)
Series
Advances in Computer Science Research
Publication Date
October 2019
ISBN
10.2991/mbdasm-19.2019.45
ISSN
2352-538X
DOI
10.2991/mbdasm-19.2019.45How to use a DOI?
Copyright
© 2019, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Pujian Zong
AU  - Jian Bian
PY  - 2019/10
DA  - 2019/10
TI  - A Novel Feature Selection Method Based on Category Distribution Ratio in Text Classification
BT  - Proceedings of the 2019 International Conference on Mathematics, Big Data Analysis and Simulation and Modelling (MBDASM 2019)
PB  - Atlantis Press
SP  - 195
EP  - 200
SN  - 2352-538X
UR  - https://doi.org/10.2991/mbdasm-19.2019.45
DO  - 10.2991/mbdasm-19.2019.45
ID  - Zong2019/10
ER  -