Chinese Words Segmentation Based on Double Hash Dictionary Running on Hadoop

Chao Feng; Baoan Li

doi:10.2991/icectt-15.2015.88

<Previous Article In Volume

Next Article In Volume>

Chinese Words Segmentation Based on Double Hash Dictionary Running on Hadoop

Authors

Chao Feng, Baoan Li

Corresponding Author

Chao Feng

Available Online November 2015.

DOI: 10.2991/icectt-15.2015.88 How to use a DOI?
Keywords: Search Engine, Words Segmentation, Hadoop, Double Hash
Abstract: Words Segmentation is an essential stage to establish a search engine, and the quality of words segmentation directly affects the search speed and precision. We have to adopt a word segmentation tool which can deal with a big data when large amounts of data is being segmented, because the traditional single PC segmentation has not been able to meet our needs. This study presents a Chinese words segmentation technology based on Hadoop. In this paper, the adoption of dictionary created by the double hash function, the adoption of the maximum forward successive matching method, and the using of the MR programming realize the parallel words segmentation in the distributed cluster, and it greatly shortens the time and increases efficiency. It provides a convenient and quick method for the words segmentation of a large quantity of text.
Copyright: © 2015, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2015 International Conference on Electromechanical Control Technology and Transportation
Series: Advances in Engineering Research
Publication Date: November 2015
ISBN: 978-94-6252-124-7
ISSN: 2352-5401
DOI: 10.2991/icectt-15.2015.88 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Chao Feng
AU  - Baoan Li
PY  - 2015/11
DA  - 2015/11
TI  - Chinese Words Segmentation Based on Double Hash Dictionary Running on Hadoop
BT  - Proceedings of the 2015 International Conference on Electromechanical Control Technology and Transportation
PB  - Atlantis Press
SP  - 461
EP  - 465
SN  - 2352-5401
UR  - https://doi.org/10.2991/icectt-15.2015.88
DO  - 10.2991/icectt-15.2015.88
ID  - Feng2015/11
ER  -

download .riscopy to clipboard