An End-to-End Hybrid Document Intelligence System for Unstructured and Semi-Structured Data
- DOI
- 10.2991/978-94-6239-693-7_103How to use a DOI?
- Keywords
- Document Intelligence; OCR; EasyOCR; Unstructured Data; LLM; Information Extraction; Receipt Digitization
- Abstract
Extracting essential information from unstructured and mixed types of documents is a tremendous challenge in many organizations. Documents like scanned receipts, invoices, and native PDFs often contain noise, uneven layouts, faded text, and formatting differences, which makes it difficult for traditional OCR and rule-based systems to read them accurately. In this work, we present a complete end-to-end Document Intelligence system that cleans and preprocesses documents, uses EasyOCR to extract text, groups tokens into proper lines and columns, and then applies rules to pull out key fields. For digital PDFs and Word files, the system extracts clean text and generates meaningful summaries using an LLM. The system also includes MongoDB for storing results, a FastAPI backend for processing, a user-friendly web interface, and a text-to-speech feature. Testing on real scanned receipts and native documents shows that the system extracts information more accurately, works better on low-quality inputs, and answers user questions reliably. Overall, this hybrid AI pipeline is expandable and useful for automating document understanding across different types of files.
- Copyright
- © 2026 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Swapnaja Yadav AU - Soleha Tamboli AU - Shaivi Jaiswal AU - Yash Lulla AU - Shivam Angral AU - Preeti Bailke PY - 2026 DA - 2026/06/16 TI - An End-to-End Hybrid Document Intelligence System for Unstructured and Semi-Structured Data BT - Proceedings of the International Conference on Intelligent Systems for a Sustainable Future (ISSF 2026) PB - Atlantis Press SP - 1069 EP - 1077 SN - 2589-4919 UR - https://doi.org/10.2991/978-94-6239-693-7_103 DO - 10.2991/978-94-6239-693-7_103 ID - Yadav2026 ER -