Proceedings of the 2017 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017)

General Simhash-based Framework for News Aggregators

Authors
Pengcheng Hu, Xiangdong You
Corresponding Author
Pengcheng Hu
Available Online June 2016.
DOI
10.2991/mecs-17.2017.152How to use a DOI?
Keywords
News Aggregator, Simhash, Deduplication, News Recommendation, Breaking Event Detection.
Abstract

News aggregator usually indexes billions of news from Internet and try to recommend news according to readers' intrinsic interests. Retrieval for similar news, deduplication and event detection are common problems in aggregator systems, and related works are reported in [1], [2], [3], [4] and [5]. We proposed a general simhash-based framework for news aggregator, the system has no necessary to process crawled news for retrieval, deduplication and event detection respectively, each piece of news is processed only one time and without extra storage space. Duplicates and breaking events can be detected online before new crawled news was stored in system's database. Machine learning are widely used in news aggregator for tasks like topic classification and each piece of news is mapped into a feature vector with fixed length. Simhash fingerprints are generated on feature vectors rather than original text of news, therefore news retrieval, deduplication and breaking news detection can be integrated into any running aggregator systems without extra efforts. Our aggregator collected around 9.6 million of news from Internet and the framework function well in real scenario.

Copyright
© 2017, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2017 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017)
Series
Advances in Engineering Research
Publication Date
June 2016
ISBN
978-94-6252-352-4
ISSN
2352-5401
DOI
10.2991/mecs-17.2017.152How to use a DOI?
Copyright
© 2017, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Pengcheng Hu
AU  - Xiangdong You
PY  - 2016/06
DA  - 2016/06
TI  - General Simhash-based Framework for News Aggregators
BT  - Proceedings of the 2017 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017)
PB  - Atlantis Press
SP  - 310
EP  - 315
SN  - 2352-5401
UR  - https://doi.org/10.2991/mecs-17.2017.152
DO  - 10.2991/mecs-17.2017.152
ID  - Hu2016/06
ER  -