Chinese Words Segmentation Based on Double Hash Dictionary Running on Hadoop
- 10.2991/icectt-15.2015.88How to use a DOI?
- Search Engine, Words Segmentation, Hadoop, Double Hash
Words Segmentation is an essential stage to establish a search engine, and the quality of words segmentation directly affects the search speed and precision. We have to adopt a word segmentation tool which can deal with a big data when large amounts of data is being segmented, because the traditional single PC segmentation has not been able to meet our needs. This study presents a Chinese words segmentation technology based on Hadoop. In this paper, the adoption of dictionary created by the double hash function, the adoption of the maximum forward successive matching method, and the using of the MR programming realize the parallel words segmentation in the distributed cluster, and it greatly shortens the time and increases efficiency. It provides a convenient and quick method for the words segmentation of a large quantity of text.
- © 2015, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Chao Feng AU - Baoan Li PY - 2015/11 DA - 2015/11 TI - Chinese Words Segmentation Based on Double Hash Dictionary Running on Hadoop BT - Proceedings of the 2015 International Conference on Electromechanical Control Technology and Transportation PB - Atlantis Press SP - 461 EP - 465 SN - 2352-5401 UR - https://doi.org/10.2991/icectt-15.2015.88 DO - 10.2991/icectt-15.2015.88 ID - Feng2015/11 ER -