Proceedings of the 7th International Conference on Education, Management, Information and Mechanical Engineering (EMIM 2017)

An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability

Authors
Qingsong Lv, Shulin Cao, Yifan Wang, Qian Yin, Xin Zheng
Corresponding Author
Qingsong Lv
Available Online April 2017.
DOI
10.2991/emim-17.2017.19How to use a DOI?
Keywords
Web page extraction; Web page classification; Law of total probability
Abstract

Since Internet web pages have diverse contents and complex structure, it is of great significance to use a uniform algorithm to tackle them. In this paper, we proposed an algorithm called P value algorithm to extract the main text of one webpage. By calculating the P value of each tag in an HTML page, we can locate where the main text is. Moreover, the P value of a web page can also represent the probability of "This web page has main text". The experiments show that the accuracy of extracting web pages is 95.42% and the accuracy of judging whether a page has main text is 93.98% without any prior knowledge.

Copyright
© 2017, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 7th International Conference on Education, Management, Information and Mechanical Engineering (EMIM 2017)
Series
Advances in Computer Science Research
Publication Date
April 2017
ISBN
10.2991/emim-17.2017.19
ISSN
2352-538X
DOI
10.2991/emim-17.2017.19How to use a DOI?
Copyright
© 2017, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Qingsong Lv
AU  - Shulin Cao
AU  - Yifan Wang
AU  - Qian Yin
AU  - Xin Zheng
PY  - 2017/04
DA  - 2017/04
TI  - An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability
BT  - Proceedings of the 7th International Conference on Education, Management, Information and Mechanical Engineering (EMIM 2017)
PB  - Atlantis Press
SP  - 93
EP  - 96
SN  - 2352-538X
UR  - https://doi.org/10.2991/emim-17.2017.19
DO  - 10.2991/emim-17.2017.19
ID  - Lv2017/04
ER  -