This blog will explore how to convert the nested JSON file in the S3 bucket to a CSV file using Boto3 and Pandas.
It is easier to flatten the nested JSON using Pandas, the main element to flatten the nested JSON is json_normalize from pandas.io.json.
json_normalize converts an array of nested JSON objects into a flat DataFrame with dotted-namespace column names.
Example data of the JSON file
{"xmp:CreatorTool":"Adobe InDesign CC 2015 (Macintosh)", "dam:Physicalheightininches":"15.0", "dam:Physicalwidthininches":"26.666666666666668", "dam:Producer":"Adobe PDF Library 15.0", "branding":"branding", "dam:Trapped":"False", "productionType":"Closed Caption", "intellectualProperty":"GTM2", "dc:format":"application/pdf", "xmpMM:DocumentID":"xmp.id:3aad8938-517f-49c6-a0bb", "GTMID":661845,"dam:extracted":"Thu Feb 11 2021 15:19:53 GMT+0000", "xmp:CreateDate":"Mon Feb 06 2017 23:48:28 GMT+0000", "xmpMM:RenditionClass":"proof:pdf", "xmpMM:OriginalDocumentID":"xmp.did:02801174072068", "xmp:ModifyDate":"Mon Feb 06 2017 23:50:38 GMT+0000", "xmp:MetadataDate":"Mon Feb 06 2017 18:50:38 GMT-0500", "xmpMM:DerivedFrom": {"stRef:instanceID":"xmp.iid:5ce8ffe1-5066-4aa3-8fc8-9e860fc26829","xmpNodeType":"xmpStruct" ,"jcr:primaryType":"nt:unstructured","stRef:documentID":"xmp.did:b4995bec-895c-4b10-a0f1", "stRef:originalDocumentID":"xmp.did:0280117083B9C41E4EEF8C","stRef:renditionClass":"default"}}
json_normalize from the pandas.io.json - Normalized JSON data |
Python Code To Convert Nested JSON in S3 bucket to CSV file :
import boto3 import json import csv import pandas as pd from pandas.io.json import json_normalize fname="/tmp/Sample.csv" s3Dev = boto3.client('s3',aws_access_key_id='awsAccessKeyDev'
,aws_secret_access_key='awsSecretAccessKeyDev') #Retrieving the file from S3 bucket obj = s3Dev.get_object(Bucket='bucket_name', Key='Folder/Sample.json') #Streaming the JSON file data from S3 bucket data = obj["Body"].read().decode() #Converts the data in byte format to string to pass the data for normalization json_data = json.loads(data) print (json_data) #Normalizes the data in array format normalized= pd.io.json.json_normalize(json_data) print(normalized) #Converts the normalized data into dataframe normalized.to_csv(fname,index=False, encoding='utf-8') s3Dev.upload_file(fname, 'bucket_name' , 'Folder/Sample.csv')
No comments:
Post a Comment