Please use this identifier to cite or link to this item:
Title: Document Data Extraction With Optical Character Recognition
Authors: Sanghvi, Vidit
Keywords: Computer 2020
Project Report
Computer Project Report
Project Report 2020
DS 2020
Issue Date: 1-Jun-2022
Publisher: Institute of Technology
Series/Report no.: 20MCED09;
Abstract: Extracting data in digital form is one of the needed functionality for the companies who process the documents. Many companies does this by manual content writing into computers and it requires a lot of time and one of the tedious works to do. However, since application of optical character recognition has been in trend since few years after successfully transforming content from scanned and non-scanned images and documents to digital format with good amount of accuracy, we explore popular approaches with a goal of building application from scratch to parse the documents we have. We have discussed those approaches and it’s performance, however paper is mainly focused on implementing a system which transforms and stores the content in .xlsx (excel) format. The main goal of this project is to reduce time in processing documents which is currently done by humans at a goods transportation place which classifies the documents as safe to transfer the goods or not safe. The data we have is in native portable document format and also non-native documents. We face challenges parsing them, like handling tabular contents and more overhead of annotation timing. Lastly, we analyse the results we are getting from each approach.
Appears in Collections:Dissertation, CE (DS)

Files in This Item:
File Description SizeFormat 
20MCED09.pdf20MCED092.94 MBAdobe PDFThumbnail

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.