Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering

Research on Feature Selection and kNN Classification Method in Chinese Text Classification

Authors
Chao Xiao, Ping Wu
Corresponding Author
Chao Xiao
Available Online December 2015.
DOI
10.2991/nceece-15.2016.172How to use a DOI?
Keywords
Chinese text classification; feature selection; text similarity; kNN; unbalanced degree of term distribution
Abstract

Scholars at home and abroad have done lots of research on feature selection methods in Chinese text classification, such as document frequency (DF), information gain (IG), and a -test (CHI). On the basis of their work, we propose a new selection method of counting the unbalanced degree of term distribution, compare it with other feature selection methods using the k-nearest-neighbor (kNN) algorithm, and find that the new method performs as well as CHI and IG. Experiments have shown that whatever the feature selection method we choose, after the number of features reaches a certain value, the gain of classification accuracy becomes very slight. Keep increasing the feature dimension can hardly improve the classification performance, while the time consumed doubles. In that case, we attempts to improve the kNN method by counting the text similarity differently. The improved method will quantify each feature’s weight using a bit string, count the similarity of two documents under their bits mode, and finally remarkably reduce the space required for storing documents and the time consumed by counting their similarity. Experiments have confirmed that the new kNN method can greatly accelerate the speed of classification at the expense of a little loss of classification accuracy.

Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering
Series
Advances in Engineering Research
Publication Date
December 2015
ISBN
10.2991/nceece-15.2016.172
ISSN
2352-5401
DOI
10.2991/nceece-15.2016.172How to use a DOI?
Copyright
© 2016, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Chao Xiao
AU  - Ping Wu
PY  - 2015/12
DA  - 2015/12
TI  - Research on Feature Selection and kNN Classification Method in Chinese Text Classification
BT  - Proceedings of the 2015 4th National Conference on Electrical, Electronics and Computer Engineering
PB  - Atlantis Press
SP  - 956
EP  - 962
SN  - 2352-5401
UR  - https://doi.org/10.2991/nceece-15.2016.172
DO  - 10.2991/nceece-15.2016.172
ID  - Xiao2015/12
ER  -