AUTOMATED PDF EXTRACTION USING AWS TEXTRACT PYTHON CODE

July 14, 2023

Sogeti Labs

Introduction/Overview

The medical documents and patient files are the most important documents concerning the insurance sector. Besides, manual handling and copying are time-consuming processes that take up countless valuable working hours. But what if we told you we could automate this, and you can save on those human hours spent as a business? Yes, it is possible, thanks to the incredible data science tools like Textract AWS services that one can use to create OCR in Python. OCR stands for optical character recognition, and in this project, we will explain how to build OCR from scratch in Python and how we are integrating with Mongo DB for file storage on cloud.

Business Case

Extraction of text from hospitalization claims

Problem Statement

Data extraction from PDFs is crucial for reorganising data according to your own requirements. In other document formats such as DOC, XLS or CSV, extracting a portion of information is simple. Just edit the data or copy and paste. But this is quite challenging to do in the case of PDFs. Copy pasting just doesn’t maintain the original formatting & order – try extracting tables from PDF. When handling PDF Data extraction in bulk, these issues can cause errors, delays or cost overruns that could seriously impact your bottom-line.

Proposed Solution

Automated PDF extraction by using Textract AWS services by using Python code. Textract supports such image formats as scans, PDFs, and photos, and it ingests a range of document formats, including those specific to financial services, insurance, and health care.

Introduction of Solution
Creating AWS account
Creating S3 Bucket and upload the bulk data
Creating IAM roles for lambda function and AWS service
Writing Python code by using Lambda service
Executing the test
Output will be obtained on instance terminal
Integrating with Mongo DB for storing

Future/Long-Term Focus

We developed Python code for automation. This will help whenever client ask if he wants to run a script by means of code
It will maintain confidentiality
It will help to upload bulk load at one time
Integrating with MongoDB for storing

Conclusion

The does not require human intervention for the extraction and validation process, once the data is fed to the system it will extract the text out of it and push it to the inventory in the same flow.

Reference

https:://link.springer.com – Journal of computational social science 5,861-882(2022)

https:://researchgate.net- May 2022 – Journal of computational social science 5(4)

This article was submitted as part of the SogetiLabs India Hackathon’s blog and whitepaper contest.

Author – Shasi Kiranmai Bethineedi

With 5 years of experience in performance testing, I recently took on the role of Test Lead at Capgemini, where I’ve been working for the past 7 months. Currently, I am leading performance testing efforts for a multinational firm, having previously worked on projects related to Audit. My current focus includes overseeing the migration of on-prem 2019 servers and conducting proof of concept testing for cloud servers.

About the author

SogetiLabs gathers distinguished technology leaders from around the Sogeti world. It is an initiative explaining not how IT works, but what IT means for business.

Comments

2 thoughts on “Automated PDF Extraction using AWS Textract Python code”

THUMMALAPALLI VENKATESH says:

July 17, 2023 at 3:16 pm

Excellent performance

Reply
1. Shasikiranmai Bethineedi says:
  
  August 11, 2023 at 11:20 am
  
  Thank You
  
  Reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

AUTOMATED PDF EXTRACTION USING AWS TEXTRACT PYTHON CODE

July 14, 2023

Introduction/Overview

Business Case

Problem Statement

Proposed Solution

Future/Long-Term Focus

Conclusion

Reference

About the author

Related posts

Aftermovie Executive Summit '23 – Scarcity in Abundance

Executive Summit ’23 – Closing notes on Scarcity in Abundance [Keynote] by Pierre Hessler

Executive Summit ’23 – The Case for Degrowth [Keynote] by Susan Paulson

Executive Summit ’23 – The Economics of Happiness [Keynote] by Claudia Senik

Executive Summit ’23 – Scarcity of Trust [Keynote] by Eleanor ‘Nell’ Watson

Executive Summit ’23 – Empire of Things [Keynote] by Frank Trentmann

Executive Summit ’23 – Legal implications of Generative Al [Keynote] by Christiaan Alberdingk Thijm

Executive Summit ’23 – The Great Progression [Keynote] by Peter Leyden

The role of cross-functional teams in SAP implementations

Part-3: Implementing Automated Tools and Scripts to Manage Cloud Budgets

Comments

2 thoughts on “Automated PDF Extraction using AWS Textract Python code”

Leave a Reply Cancel reply