Document Data Extraction With Optical Character Recognition

Sanghvi, Vidit

Please use this identifier to cite or link to this item: http://10.1.7.192:80/jspui/handle/123456789/11351

Title:	Document Data Extraction With Optical Character Recognition
Authors:	Sanghvi, Vidit
Keywords:	Computer 2020 Project Report Computer Project Report Project Report 2020 20MCE 20MCED 20MCED09 CE (DS) DS 2020
Issue Date:	1-Jun-2022
Publisher:	Institute of Technology
Series/Report no.:	20MCED09;
Abstract:	Extracting data in digital form is one of the needed functionality for the companies who process the documents. Many companies does this by manual content writing into computers and it requires a lot of time and one of the tedious works to do. However, since application of optical character recognition has been in trend since few years after successfully transforming content from scanned and non-scanned images and documents to digital format with good amount of accuracy, we explore popular approaches with a goal of building application from scratch to parse the documents we have. We have discussed those approaches and it’s performance, however paper is mainly focused on implementing a system which transforms and stores the content in .xlsx (excel) format. The main goal of this project is to reduce time in processing documents which is currently done by humans at a goods transportation place which classifies the documents as safe to transfer the goods or not safe. The data we have is in native portable document format and also non-native documents. We face challenges parsing them, like handling tabular contents and more overhead of annotation timing. Lastly, we analyse the results we are getting from each approach.
URI:	http://10.1.7.192:80/jspui/handle/123456789/11351
Appears in Collections:	Dissertation, CE (DS)

Files in This Item:

File	Description	Size	Format
20MCED09.pdf	20MCED09	2.94 MB	Adobe PDF	View/Open

Show full item record

IR @ Nirma University