Tuesday, 4 May 2021

Convert Nested JSON to CSV using Boto3 and Pandas

PARSE THE NESTED JSON FILE PLACE IN THE S3 BUCKET TO CSV USING BOTO3 AND PANDAS JSON_NORMALIZE FUNCTION

This blog will explore how to convert the nested JSON file in the S3 bucket to a CSV file using Boto3 and Pandas.

It is easier to flatten the nested JSON using Pandas, the main element to flatten the nested JSON is  json_normalize from pandas.io.json.

json_normalize converts an array of nested JSON objects into a flat DataFrame with dotted-namespace column names.

Example data of the JSON file

{"xmp:CreatorTool":"Adobe InDesign CC 2015 (Macintosh)",
"dam:Physicalheightininches":"15.0",
"dam:Physicalwidthininches":"26.666666666666668",
"dam:Producer":"Adobe PDF Library 15.0",
"branding":"branding",
"dam:Trapped":"False",
"productionType":"Closed Caption",
"intellectualProperty":"GTM2",
"dc:format":"application/pdf",
"xmpMM:DocumentID":"xmp.id:3aad8938-517f-49c6-a0bb",
"GTMID":661845,"dam:extracted":"Thu Feb 11 2021 15:19:53 GMT+0000",
"xmp:CreateDate":"Mon Feb 06 2017 23:48:28 GMT+0000",
"xmpMM:RenditionClass":"proof:pdf",
"xmpMM:OriginalDocumentID":"xmp.did:02801174072068",
"xmp:ModifyDate":"Mon Feb 06 2017 23:50:38 GMT+0000",
"xmp:MetadataDate":"Mon Feb 06 2017 18:50:38 GMT-0500",
"xmpMM:DerivedFrom":
{"stRef:instanceID":"xmp.iid:5ce8ffe1-5066-4aa3-8fc8-9e860fc26829","xmpNodeType":"xmpStruct"
,"jcr:primaryType":"nt:unstructured","stRef:documentID":"xmp.did:b4995bec-895c-4b10-a0f1",
"stRef:originalDocumentID":"xmp.did:0280117083B9C41E4EEF8C","stRef:renditionClass":"default"}}

json_normalize from the pandas.io.json - Normalized JSON data
 json_normalize from the pandas.io.json - Normalized JSON data

Python Code To Convert  Nested JSON in S3 bucket to CSV file :


import
boto3 import json import csv import pandas as pd from pandas.io.json import json_normalize fname="/tmp/Sample.csv" s3Dev = boto3.client('s3',aws_access_key_id='awsAccessKeyDev'
,aws_secret_access_key='awsSecretAccessKeyDev')

#Retrieving the file from S3 bucket
obj = s3Dev.get_object(Bucket='bucket_name', Key='Folder/Sample.json')

#Streaming the JSON file data from S3 bucket
data = obj["Body"].read().decode()

#Converts the data in byte format to string to pass the data for normalization
json_data = json.loads(data)
  
print (json_data)

#Normalizes the data in array format  
normalized= pd.io.json.json_normalize(json_data)

print(normalized)

#Converts the normalized data into dataframe 
normalized.to_csv(fname,index=False, encoding='utf-8')

s3Dev.upload_file(fname, 'bucket_name' , 'Folder/Sample.csv')

Here are some other posts that might be helpful :

No comments:

Post a Comment