Skip to Content

Automated PDF Extraction using AWS Textract Python code

Sogeti Labs
July 14, 2023


The medical documents and patient files are the most important documents concerning the insurance sector. Besides, manual handling and copying are time-consuming processes that take up countless valuable working hours. But what if we told you we could automate this, and you can save on those human hours spent as a business? Yes, it is possible, thanks to the incredible data science tools like Textract AWS services that one can use to create OCR in Python. OCR stands for optical character recognition, and in this project, we will explain how to build OCR from scratch in Python and how we are integrating with Mongo DB for file storage on cloud.

Business Case

Extraction of text from hospitalization claims

Problem Statement

Data extraction from PDFs is crucial for reorganising data according to your own requirements. In other document formats such as DOC, XLS or CSV, extracting a portion of information is simple. Just edit the data or copy and paste. But this is quite challenging to do in the case of PDFs. Copy pasting just doesn’t maintain the original formatting & order – try extracting tables from PDF. When handling PDF Data extraction in bulk, these issues can cause errors, delays or cost overruns that could seriously impact your bottom-line.

Proposed Solution

Automated PDF extraction by using Textract AWS services by using Python code. Textract supports such image formats as scans, PDFs, and photos, and it ingests a range of document formats, including those specific to financial services, insurance, and health care.

  • Introduction of Solution
  • Creating AWS account
  • Creating S3 Bucket and upload the bulk data
  • Creating IAM roles for lambda function and AWS service
  • Writing Python code by using Lambda service
  • Executing the test
  • Output will be obtained on instance terminal
  • Integrating with Mongo DB for storing

Future/Long-Term Focus

  • We developed Python code for automation. This will help whenever client ask if he wants to run a script by means of code
  • It will maintain confidentiality
  • It will help to upload bulk load at one time
  • Integrating with MongoDB for storing


The does not require human intervention for the extraction and validation process, once the data is fed to the system it will extract the text out of it and push it to the inventory in the same flow.


https::// – Journal of computational social science 5,861-882(2022)

https::// May 2022 – Journal of computational social science 5(4)

This article was submitted as part of the SogetiLabs India Hackathon’s blog and whitepaper contest.

Author – Shasi Kiranmai Bethineedi

With 5 years of experience in performance testing, I recently took on the role of Test Lead at Capgemini, where I’ve been working for the past 7 months. Currently, I am leading performance testing efforts for a multinational firm, having previously worked on projects related to Audit. My current focus includes overseeing the migration of on-prem 2019 servers and conducting proof of concept testing for cloud servers.

About the author

SogetiLabs gathers distinguished technology leaders from around the Sogeti world. It is an initiative explaining not how IT works, but what IT means for business.


    2 thoughts on “Automated PDF Extraction using AWS Textract Python code

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    Slide to submit