Tech Blogger: 2021-05-02

Home » 2021-05-02

Thursday, 6 May 2021

Part 2 | Snowflake Developer Interview Questions

FREQUENTLY ASKED SNOWFLAKE CLOUD DEVELOPER INTERVIEW QUESTIONS

In continuation of Snowflake Developer Interview Questions and Answers (Part 1).

Topics Covered are Snowflake NVL2 and NVL function, Warehouse, STREAM, External table, types of tables in snowflake, Time travel, Fail-Safe, and materialized view.

Let's jump right away into the question.

Q1: What is Snowflake URL?

Format URL: account.region.snowflakecomputing.com

Sample URL: blogger.us-east-1.snowflakecomputing.com

Q2: Explain the difference between NVL2 and NVL function?

NVL2: One input parameter and two output parameter

Syntax : NVL2( <expr1>, <expr2>, <expr3>)

If expr1 is not null, it returns expr2
If expr1 is null, it returns expr3

NVL: One input parameter and one output parameter

Syntax : NVL( <expr1>, <expr2>)

If expr1 is not null, it returns expr1 itself
If expr1 is null, it returns expr2

NVL2 , NVL query

Q3: Is it possible to suspend a warehouse manually?

Yes a warehouse can be suspended manually by two ways.

1) Through GUI, at left corner of the page you can see the Warehouses tab. Navigate to the Warehouses tab, select the warehouse and click on suspend.

Snowflake

2) Second method is through executing query as below.

Syntax:

CREATE OR REPLACE WAREHOUSE name

WITH warehouse_size=’MEDIUM’;

ALTER WAREHOUSE [ IF EXISTS ] name SUSPEND|RESUME;

Q4: Difference between DBMS and RDBMS?

DBMS :

Abbreviation - Database Management System
Stored in hierarchical form

RDBMS :

Abbreviation - Relational Database Management System
Stored in tabular form

Q5: What happens you suspend a warehouse?

When you suspend a warehouse, snowflake suspends all idle servers for the warehouse, but allows any server that are executing statements to continue until the statements complete, at which time the servers are shut down and the changes the to suspend mode.

Q6: What is stream?

Changes made to a table, including information about inserts, updates, and deletes. The table for which changes are recorded is called the source table.
The stream will create three extra columns (METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID).
Based on it is decided whether the incoming record is new, deleted, or updated.

Syntax: CREATE STREAM stream_name ON TABLE table_name;

Metadata$Action	Metadata$IsUpdate	Scenario
INSERT	FALSE	Insert - New record
DELETE	FALSE	Delete - Deleted record
INSERT	TRUE	Update - Updated record
DELETE	TRUE	Update - Updated record

Q7: What is an external table?

When queried, an external table reads data from a set from a set of file in an external stage.

Syntax :

CREATE OR REPLACE EXTERNAL TABLE ext_table

WITH LOCATION = @stage_name

AUTO_REFRESH = true

FILE_FORMAT = file_format_name;

Q8: What are the types of tables in Snowflake?

Temporary Table – Stores non-permanent, transitory data.Only exists within the session, will not be visible to other users.

Syntax : Create temporary table mytemptable (id number, creation_date date);

Permanent Table – Available until explicitly dropped.

Has fail safe period.

Transient Table - Available until explicitly dropped. Doesn’t have fail safe period, can prefer when the data doesn’t require same level of data protection and recovery as permanent table.

Syntax : Create transient table mytranstable (id number, creation_date date);

Q9: Difference between Time travel and Failsafe?

Time Travel

Access historical data that has been changed or deleted at any point within defined time period. (SELECT, CREATE … CLONE, UNDROP)
Retention period – 1 to 90 days (default 1 day)
If retention period for time travel ends, it will move to fail safe.

Fail safe

Historical data is protected in case of system failure
Data is recoverable only by snowflake.
Retention period – 7 days for Permanent table, 0 days for transient table

Q10: Difference between materialized view and normal view?

Materialized View	View
Data is stored on disk	Holds the data virtually
Updated periodically based on query definition	Refresh only when accessed
WHEN TO USE – Resultant doesn’t change often	WHEN TO USE - Resultant changes often
View is used often	View that is not used often

Copy an Object between AWS Accounts using boto3

MOVE S3 OBJECT BETWEEN AWS ACCOUNTS USING BOTO3

Before moving on to the code, let's go through about the /tmp directory provided by the Lambda function.

Temporary storage with /tmp

The lambda function execution environment provides a file system for the code to use at /tmp.
On each new execution environment is created.
It is important to know /tmp is not recreated or cleared on each lambda invocation.
And it is okay/safe to use /tmp in Lambda since this folder is not shared with other AWS clients.

Here I have taken a scenario to copy an S3 object from PROD to DEV using boto3 in a lambda function.

import boto3

# AWS credentials of PROD environemnt
s3 = boto3.client('s3',aws_access_key_id='awsAccessKey',
aws_secret_access_key='awsSecretAccessKey')

# AWS credentials of DEV environemnt
s3Dev = boto3.client('s3',aws_access_key_id='awsAccessKeyDev',
aws_secret_access_key=awsSecretAccessKeyDev)

def lambda_handler(event, context):

#Downloads the file from PROD S3 bucket and saves in temporary storage
    s3.download_file('source_bucket', 'folder/source.csv', '/tmp/source.csv')

    print('File Downloaded')

    file = '/tmp//source.csv'

#Uploads the file in temporary space to DEV S3 bucket 
    s3Dev.upload_file(file , 'target_bucket','folder/target.csv')

    print('File Uploaded')

    return {
         'statusCode': 200
    }

Here are some other posts that might be helpful :

Convert Nested JSON to CSV using Python in S3

Wednesday, 5 May 2021

Part 1 | Informatica Interview Questions

FREQUENTLY ASKED INFORMATICA INTERVIEW QUESTIONS

In this blog, we are going to explore some of the most frequently asked Informatica interview questions. Hoping that it will be helpful and wishing the best for your interview.

Brief: Informatica is one of the most widely used ETL data integration tool for loading data and export functions.

It has its own transformation language that you can use in your expressions, filters, and such.

Let's move on to the questions.

Q1: What is database, datawarehouse, and datamart?

The database stores different kinds of data, however, it is smaller than the data warehouse.

Datawarehouse is used to store historical data.

And datamart is a subset of data warehouse, which is used for analytical purposes like prediction.

Q2: What are the types of dimension modeling? Or What are the different types of schema?

Dimension modeling or schema is of two types.

1) Star Schema

2) Snowflake Schema

Detailed Explanation:

Star Schema:

Dimensional tables are not normalized
Dimensional tables are not linked to each other

Star Schema

Snowflake Schema:

Dimension tables are normalized
Dimension tables are divided into sub-dimension

Snowflake Schema

Q3: What are the different types of Slowly Changing Dimension (SCD)?

Type 0 - Fixed dimension. No changes allowed i.e. data never changes.

Type 1 - No history. Updates the record directly, only current data is stored and no historical data.

Type 2 - Maintains history by storing the value as a new record.

It is of three types.

1) Flagging (Y/N)

2) Versioning ( Sequenced by numbers)

3) Effective date ( Start date and End date)

Type 3 - Previous value column. Maintains history as the current and previous version column-wise.

Q4: What are the differences between connected lookup and unconnected lookup?

Connected Lookup	Unconnected Lookup
The input value is received directly from pipeline	The input value is received from the result of :lkpexp
Dynamic or static cache can be used	Only static cache can be used
Returns multiple columns	Returns only one column
If no match returns, a default value will be returned	Returns NULL
User-defined default value	Doesn’t support the user-defined default value

Q5: How to optimize a lookup transformation?

1) If the lookup table database is the same as the source, then perform the join in the database itself.

2) Apply filter to minimize the loop-up rows.

3) Optimize the lookup condition by commenting default group-by clause.

Q6: What are the different types of keys?

Primary key - Uniquely identifies each record. Doesn't allow NULL value.

Foreign key - Primary key from one table, which has a relationship with another table.

Surrogate key - Is the natural prime key, which is generated but not based on the database data.

Unique key - Uniquely identifies each value. Allows NULL value.

Q7: What are the files created by information during session runs?

1) Error log

2) Session log

3) Workflow log

4) Bad file/ Rejection file

Q7: What is pushdown optimization and its limitation?

Analyzes the logic, mapping, and session configuration to determine the logic to push to the database.

Types of pushdown optimization,

None – Does not push any logic

Source – Pushes the logic as much as possible to source

Target - Pushes the logic as much as possible to target

Full - Pushes the logic as much as possible to source and target

The limitation of pushdown optimization is that RANK transformation cannot be pushed.

Q7: Based on what factors session logs can be retained in Informatica?

The session logs can be retained based

1) Number of runs

2) Timestamps

Session properties

It can be configured in session property under the config obj tab as shown in the above image.

Q8: What is PMCMD?

The PMCMD is an in-build line program that communicates with the integration service to start, stop, schedule or monitor a workflow.

Q9: What is a router transformation and what are groups available in the router transformation?

Router transformation is an active and connected transformation that is similar to filter transformation.

But the router gives multiple output groups and each output group can hold different filters.

Unlike filter transformation, all the records that don't satisfy any of the condition is pushed to the default group.

Router property

Types of groups in the router,

1) Input group

2) Output group

3) User-defined group

4) Default group

Q10: Difference between filter and router transformation?

Filter	Router
Only one condition can be applied	Multiple conditions can be applied
Can use only one target	Can use multiple target
Returns multipe columns	Returns only one column
Rejected rows cannot be captured	Rejected rows can be captured by the default group

Read More :

Part 1 | Snowflake Developer Interview Questions

FREQUENTLY ASKED SNOWFLAKE DEVELOPER INTERVIEW QUESTIONS

In this post, we are going to explore snowflake interview questions. Topics Covered are Snowflake Architecture, warehouse, types of constraints, types of tables, types of stages, and snowflake indexes.

This will majorly cover all the theoretical questions, starting with the basics and moving to advanced concepts. Hoping that it will be helpful and wishing the best for your interview.

Q1: What is Snowflake and how it is different from other databases?

Snowflake database is a cloud-based data warehouse-as-a-cloud-service i.e. SaaS that requires no hardware or software installation.

Advantages of Snowflake databases are

Build on cloud
Unlimited storage
Data is encrypted by default
Simple and transparent pricing
Moves to idle mode when not used
Time travel (1 - 90 days)

Q2: Explain the snowflake architecture?

Snowflake architecture consists of three layers.

1. Cloud Services

2. Query Processing

3. Database Storage

Snowflake Architecture

Detailed Explanation:

Cloud services

Collection of services that coordinate activities across snowflake.
Services managed in this layer include: Authentication, Metadata management, Query parsing and optimization and Access control

Query processing

Uses virtual warehouses. Each virtual warehouse is a compute cluster connected to multiple nodes allocated by a snowflake.
Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.

Database storage

Snowflake reorganize the data into its internal optimized, compressed, and columnar format.
Data stored in snowflake can be accessed only using SQL queries.

Q3: What is shared disk and shared-nothing architecture?

Shared disk – It has a central repository for persisted data that is accessible from all the computable nodes.

Shared nothing – Uses massive parallel processing (MPP). Compute clusters where each node in the cluster stores a portion of the entire data set locally.

Q4: What are the different types of warehouse size?

XSMALL
SMALL
MEDIUM
LARGE
XLARGE
XXLARGE
XXXLARGE
X4LARGE

Most commonly medium-sized warehouse is used for heavy-duty load users.

Q5: What are the types of constraints in Snowflake?

Unique – To uniquely identify value, NULL is allowed.

Primary key – To uniquely identify each record, NULL is not allowed.

Foreign key - Primary key of another table, used to define the relationship between the tables.

NOT NULL - NULL values are not allowed.

Q6: Does Snowflake support indexes?

Snowflake does not use indexes. This is one of the things that makes Snowflake scale so well for arbitrary queries.

Q7: What is a surrogate key?

The surrogate key is the natural prime key, which means it is not generated based on the database data or values.

Q8: What are the different types of tables in Snowflake?

Temporary Table – Stores non-permanent, transitory data.

Only exists within the session, will not be visible to other users.

Permanent Table – Available until explicitly dropped.

Has fail-safe period.

Transient Table - Available until explicitly dropped.

Doesn’t have a fail-safe period, can prefer when the data doesn’t require the same level of data protection and recovery as the permanent table.

External Tables - When queried, an external table reads the data set from a set of files in an external stage.

Q9: What is the difference between NULL, Blank, and Zero?

NULL - Unavailable, Unassigned, Unknown, or Not applicable

Blank - It is treated as a character (' ')

Zero - It is a number

Q10: What is a stage and explain its types?

Stage is streamline for bulk loading or unloading data out of database tables.

There are two types of stage,

1. External stage

2. Internal stage

Detailed Explanation:

External

The external stage is used to access the files placed in external locations like S3, Google Cloud, Azure, etc.

Internal Stage

With the internal stage, the files are stored in the Snowflake itself.

Snowflake Developer Interview Questions and Answers ( Part 2)

Informatica Interview Questions and Answers ( Part 1)

Tuesday, 4 May 2021

Convert Nested JSON to CSV using Boto3 and Pandas

PARSE THE NESTED JSON FILE PLACE IN THE S3 BUCKET TO CSV USING BOTO3 AND PANDAS JSON_NORMALIZE FUNCTION

This blog will explore how to convert the nested JSON file in the S3 bucket to a CSV file using Boto3 and Pandas.

It is easier to flatten the nested JSON using Pandas, the main element to flatten the nested JSON is json_normalize from pandas.io.json.

json_normalize converts an array of nested JSON objects into a flat DataFrame with dotted-namespace column names.

Example data of the JSON file

{"xmp:CreatorTool":"Adobe InDesign CC 2015 (Macintosh)",
"dam:Physicalheightininches":"15.0",
"dam:Physicalwidthininches":"26.666666666666668",
"dam:Producer":"Adobe PDF Library 15.0",
"branding":"branding",
"dam:Trapped":"False",
"productionType":"Closed Caption",
"intellectualProperty":"GTM2",
"dc:format":"application/pdf",
"xmpMM:DocumentID":"xmp.id:3aad8938-517f-49c6-a0bb",
"GTMID":661845,"dam:extracted":"Thu Feb 11 2021 15:19:53 GMT+0000",
"xmp:CreateDate":"Mon Feb 06 2017 23:48:28 GMT+0000",
"xmpMM:RenditionClass":"proof:pdf",
"xmpMM:OriginalDocumentID":"xmp.did:02801174072068",
"xmp:ModifyDate":"Mon Feb 06 2017 23:50:38 GMT+0000",
"xmp:MetadataDate":"Mon Feb 06 2017 18:50:38 GMT-0500",
"xmpMM:DerivedFrom":
{"stRef:instanceID":"xmp.iid:5ce8ffe1-5066-4aa3-8fc8-9e860fc26829","xmpNodeType":"xmpStruct"
,"jcr:primaryType":"nt:unstructured","stRef:documentID":"xmp.did:b4995bec-895c-4b10-a0f1",
"stRef:originalDocumentID":"xmp.did:0280117083B9C41E4EEF8C","stRef:renditionClass":"default"}}

json_normalize from the pandas.io.json - Normalized JSON data

Python Code To Convert Nested JSON in S3 bucket to CSV file :


import boto3
import json
import csv
import pandas as pd
from pandas.io.json import json_normalize

fname="/tmp/Sample.csv"

s3Dev = boto3.client('s3',aws_access_key_id='awsAccessKeyDev'

,aws_secret_access_key='awsSecretAccessKeyDev')

#Retrieving the file from S3 bucket
obj = s3Dev.get_object(Bucket='bucket_name', Key='Folder/Sample.json')

#Streaming the JSON file data from S3 bucket
data = obj["Body"].read().decode()

#Converts the data in byte format to string to pass the data for normalization
json_data = json.loads(data)
  
print (json_data)

#Normalizes the data in array format  
normalized= pd.io.json.json_normalize(json_data)

print(normalized)

#Converts the normalized data into dataframe 
normalized.to_csv(fname,index=False, encoding='utf-8')

s3Dev.upload_file(fname, 'bucket_name' , 'Folder/Sample.csv')

Here are some other posts that might be helpful :

How to Copy an Object from one S3 bucket to another between AWS Accounts using boto3

Pages

Thursday, 6 May 2021

FREQUENTLY ASKED SNOWFLAKE CLOUD DEVELOPER INTERVIEW QUESTIONS

Q1: What is Snowflake URL?

Q2: Explain the difference between NVL2 and NVL function?

Q3: Is it possible to suspend a warehouse manually?

Q4: Difference between DBMS and RDBMS?

DBMS :

Q5: What happens you suspend a warehouse?

Q6: What is stream?

Q7: What is an external table?

Q8: What are the types of tables in Snowflake?

Q9: Difference between Time travel and Failsafe?

Q10: Difference between materialized view and normal view?

MOVE S3 OBJECT BETWEEN AWS ACCOUNTS USING BOTO3

Temporary storage with /tmp

Here are some other posts that might be helpful :

Wednesday, 5 May 2021

FREQUENTLY ASKED INFORMATICA INTERVIEW QUESTIONS

Q1: What is database, datawarehouse, and datamart?

Q2: What are the types of dimension modeling? Or What are the different types of schema?

Q3: What are the different types of Slowly Changing Dimension (SCD)?

Q4: What are the differences between connected lookup and unconnected lookup?

Q5: How to optimize a lookup transformation?

Q6: What are the different types of keys?

Q7: What are the files created by information during session runs?

Q7: What is pushdown optimization and its limitation?

Q7: Based on what factors session logs can be retained in Informatica?

Q8: What is PMCMD?

Q9: What is a router transformation and what are groups available in the router transformation?

Q10: Difference between filter and router transformation?

FREQUENTLY ASKED SNOWFLAKE DEVELOPER INTERVIEW QUESTIONS

Q1: What is Snowflake and how it is different from other databases?

Q2: Explain the snowflake architecture?

Q3: What is shared disk and shared-nothing architecture?

Q4: What are the different types of warehouse size?

Q5: What are the types of constraints in Snowflake?

Q6: Does Snowflake support indexes?

Q7: What is a surrogate key?

Q8: What are the different types of tables in Snowflake?

Q9: What is the difference between NULL, Blank, and Zero?

Q10: What is a stage and explain its types?

Read More :

Tuesday, 4 May 2021

Python Code To Convert Nested JSON in S3 bucket to CSV file :

Here are some other posts that might be helpful :