This blog will explore how to convert the nested JSON file in the S3 bucket to a CSV file using Boto3 and Pandas.
It is easier to flatten the nested JSON using Pandas, the main element to flatten the nested JSON is json_normalize from pandas.io.json.
json_normalize converts an array of nested JSON objects into a flat DataFrame with dotted-namespace column names.
Example data of the JSON file
{"xmp:CreatorTool":"Adobe InDesign CC 2015 (Macintosh)",
"dam:Physicalheightininches":"15.0",
"dam:Physicalwidthininches":"26.666666666666668",
"dam:Producer":"Adobe PDF Library 15.0",
"branding":"branding",
"dam:Trapped":"False",
"productionType":"Closed Caption",
"intellectualProperty":"GTM2",
"dc:format":"application/pdf",
"xmpMM:DocumentID":"xmp.id:3aad8938-517f-49c6-a0bb",
"GTMID":661845,"dam:extracted":"Thu Feb 11 2021 15:19:53 GMT+0000",
"xmp:CreateDate":"Mon Feb 06 2017 23:48:28 GMT+0000",
"xmpMM:RenditionClass":"proof:pdf",
"xmpMM:OriginalDocumentID":"xmp.did:02801174072068",
"xmp:ModifyDate":"Mon Feb 06 2017 23:50:38 GMT+0000",
"xmp:MetadataDate":"Mon Feb 06 2017 18:50:38 GMT-0500",
"xmpMM:DerivedFrom":
{"stRef:instanceID":"xmp.iid:5ce8ffe1-5066-4aa3-8fc8-9e860fc26829","xmpNodeType":"xmpStruct"
,"jcr:primaryType":"nt:unstructured","stRef:documentID":"xmp.did:b4995bec-895c-4b10-a0f1",
"stRef:originalDocumentID":"xmp.did:0280117083B9C41E4EEF8C","stRef:renditionClass":"default"}}
![]() |
| json_normalize from the pandas.io.json - Normalized JSON data |
Python Code To Convert Nested JSON in S3 bucket to CSV file :
import boto3 import json import csv import pandas as pd from pandas.io.json import json_normalize fname="/tmp/Sample.csv" s3Dev = boto3.client('s3',aws_access_key_id='awsAccessKeyDev'
,aws_secret_access_key='awsSecretAccessKeyDev') #Retrieving the file from S3 bucket obj = s3Dev.get_object(Bucket='bucket_name', Key='Folder/Sample.json') #Streaming the JSON file data from S3 bucket data = obj["Body"].read().decode() #Converts the data in byte format to string to pass the data for normalization json_data = json.loads(data) print (json_data) #Normalizes the data in array format normalized= pd.io.json.json_normalize(json_data) print(normalized) #Converts the normalized data into dataframe normalized.to_csv(fname,index=False, encoding='utf-8') s3Dev.upload_file(fname, 'bucket_name' , 'Folder/Sample.csv')

No comments:
Post a Comment