Proceedings of the International Conference on Intelligent Systems for a Sustainable Future (ISSF 2026)

An End-to-End Hybrid Document Intelligence System for Unstructured and Semi-Structured Data

Authors
Swapnaja Yadav1, *, Soleha Tamboli1, Shaivi Jaiswal1, Yash Lulla1, Shivam Angral1, Preeti Bailke2
1Department of IT, Vishwakarma Institute of Technology, Pune, India
2Assistant Professor, Department of IT, Vishwakarma Institute of Technology, Pune, India
*Corresponding author. Email: swapnaja.yadav242@vit.edu
Corresponding Author
Swapnaja Yadav
Available Online 16 June 2026.
DOI
10.2991/978-94-6239-693-7_103How to use a DOI?
Keywords
Document Intelligence; OCR; EasyOCR; Unstructured Data; LLM; Information Extraction; Receipt Digitization
Abstract

Extracting essential information from unstructured and mixed types of documents is a tremendous challenge in many organizations. Documents like scanned receipts, invoices, and native PDFs often contain noise, uneven layouts, faded text, and formatting differences, which makes it difficult for traditional OCR and rule-based systems to read them accurately. In this work, we present a complete end-to-end Document Intelligence system that cleans and preprocesses documents, uses EasyOCR to extract text, groups tokens into proper lines and columns, and then applies rules to pull out key fields. For digital PDFs and Word files, the system extracts clean text and generates meaningful summaries using an LLM. The system also includes MongoDB for storing results, a FastAPI backend for processing, a user-friendly web interface, and a text-to-speech feature. Testing on real scanned receipts and native documents shows that the system extracts information more accurately, works better on low-quality inputs, and answers user questions reliably. Overall, this hybrid AI pipeline is expandable and useful for automating document understanding across different types of files.

Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the International Conference on Intelligent Systems for a Sustainable Future (ISSF 2026)
Series
Atlantis Highlights in Intelligent Systems
Publication Date
16 June 2026
ISBN
978-94-6239-693-7
ISSN
2589-4919
DOI
10.2991/978-94-6239-693-7_103How to use a DOI?
Copyright
© 2026 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Swapnaja Yadav
AU  - Soleha Tamboli
AU  - Shaivi Jaiswal
AU  - Yash Lulla
AU  - Shivam Angral
AU  - Preeti Bailke
PY  - 2026
DA  - 2026/06/16
TI  - An End-to-End Hybrid Document Intelligence System for Unstructured and Semi-Structured Data
BT  - Proceedings of the International Conference on Intelligent Systems for a Sustainable Future (ISSF 2026)
PB  - Atlantis Press
SP  - 1069
EP  - 1077
SN  - 2589-4919
UR  - https://doi.org/10.2991/978-94-6239-693-7_103
DO  - 10.2991/978-94-6239-693-7_103
ID  - Yadav2026
ER  -