General Simhash-based Framework for News Aggregators
Pengcheng Hu, Xiangdong You
Available Online June 2016.
- https://doi.org/10.2991/mecs-17.2017.152How to use a DOI?
- News Aggregator, Simhash, Deduplication, News Recommendation, Breaking Event Detection.
- News aggregator usually indexes billions of news from Internet and try to recommend news according to readers' intrinsic interests. Retrieval for similar news, deduplication and event detection are common problems in aggregator systems, and related works are reported in , , ,  and . We proposed a general simhash-based framework for news aggregator, the system has no necessary to process crawled news for retrieval, deduplication and event detection respectively, each piece of news is processed only one time and without extra storage space. Duplicates and breaking events can be detected online before new crawled news was stored in system's database. Machine learning are widely used in news aggregator for tasks like topic classification and each piece of news is mapped into a feature vector with fixed length. Simhash fingerprints are generated on feature vectors rather than original text of news, therefore news retrieval, deduplication and breaking news detection can be integrated into any running aggregator systems without extra efforts. Our aggregator collected around 9.6 million of news from Internet and the framework function well in real scenario.
- Open Access
- This is an open access article distributed under the CC BY-NC license.
Cite this article
TY - CONF AU - Pengcheng Hu AU - Xiangdong You PY - 2016/06 DA - 2016/06 TI - General Simhash-based Framework for News Aggregators BT - Proceedings of the 2017 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) PB - Atlantis Press SP - 310 EP - 315 SN - 2352-5401 UR - https://doi.org/10.2991/mecs-17.2017.152 DO - https://doi.org/10.2991/mecs-17.2017.152 ID - Hu2016/06 ER -