Showing posts with label interview questions. Show all posts
Showing posts with label interview questions. Show all posts

Monday, 5 July 2021

UNIX Interview Questions | ETL


UNIX Commands for ETL Developer/ Tester/ Support


  • LIST 

1) Unix command to list files in descending order

ls -ltr

2) Unix command to list files based on file size

ls -ls

3) Unix command to list hidden files 

ls -la | grep ^[.]


  • KILL

1) UNIX command to kill background jobs

KILL $!



  • PRINT

1) UNIX command to print the first line of the  file

HEAD -1 filename.txt
or 
SED '2$' filename.txt

2) UNIX command to print the last line of the  file

TAIL -1 filename.txt
or 
SED -n '$p' filename.txt

3) UNIX command to print an Nth line of the  file

SED -n '<n> $p' filename.txt

<n> - number of line

4) UNIX command to print a word in reverse

echo 
| rev

5) UNIX command print the last word of a line

echo 'C for Car' | rev | CUT -f 1 d ' ' | rev


  • REMOVE
1) UNIX command to remove the first line of the file 

SED '1d' filename.txt

2) UNIX command to remove the last  line of the file 

SED  -i '$d' filename.txt

3) UNIX command to remove Nth line of the  file

SED  -i '<n> $d' filename.txt

<n> - number of line

4) UNIX command to remove the empty line from the file 

GREP -v ^"$" filename.txt

5) 
UNIX command to cut Nth word of a line
 
CUT -f <n> -d ' '

<n> - number of line


  • COUNT 

1) UNIX command to find the number of lines in the file

filename.txt | wc -l

2) UNIX command to find the length of the Nth line in the file

SED -n '<n> p' filename.txt | wc -L

<n> - number of line


  • OTHERS

1) View non-printable words in the file

vi filename.txt 

In vi mode you can view the non-printable words in the file

2) Remove BOM characters

SED -i '1s/^\xEF\xBB\xBF//' filename.txt


Wednesday, 26 May 2021

Types Of Dimension Tables with examples | Datawarehousing


DIMENSION:

The dimension table contains dimensions of fact.
It contains keys and descriptive attributes for measurements in the fact table.


TYPES OF DIMENSION:

Conformed dimension:

Conformed dimensions mean the exact same thing with every possible fact table to which they are joined.

Example: 

The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts.

Junk dimension:

A junk dimension is a collection of random transactional codes flags and/or text attributes that are unrelated to any particular dimension. The junk dimension is simply a structure that provides a convenient place to store the junk attributes.

Example: 

Assume that we have a gender dimension and marital status dimension. In the fact table, we need to maintain two keys referring to these dimensions. Instead of that create a junk dimension that has all the combinations of gender and marital status (cross join gender and marital status table and create a junk table). Now we can maintain only one key in the fact table.

Degenerate dimension:

A degenerate dimension is a dimension that is derived from the fact table and doesn’t have its own dimension table.

Example: 

A transactional code in a fact table.


Role-playing dimension:

Dimensions that are often used for multiple purposes within the same database are called role-playing dimensions. 

Example: 

A date dimension can be used for “date of sale”, as well as “date of delivery”, or “date of hire”.

SCD – Slowly Changing Dimension

TYPE – 1:  Updates changes without preserving history.
TYPE – 2:  Maintains history by storing the value as a new record.
  • Effective date
  • Flagging
  • Versioning

TYPE – 3:  Maintains history as the current and previous 
version column-wise.


Read More :


Tuesday, 25 May 2021

Optimization in Informatica | Interview Question

OPTIMIZING TARGET:

  • Defining constraints and indexes will slow down the loading of data. So drop indexes and key constraints.
  • Use bulk loading. It will load the data without writing database logs which makes it more efficient. 

OPTIMIZING SOURCE:

  • Do not use ORDER BY or GROUP BY since it creates index on the source.
  • Use fast export reader instead of relation reader for source.

OPTIMIZING MAPPING:

  • Add as much as filter whenever it is required.

OPTIMIZING TRANSFORMATION:


AGGREGATOR:

  1. Use sorted input option
  2. Use simple values for GROUP BY columns
  3. Filter data before using the aggregator
  4. Use incremental aggregation option

JOINER:

  1. Design the master source with fewer duplicate key value
  2. Perform join in the database whenever possible
  3. Use sorted data 

LOOKUP:

  1. If the lookup table database is the same as the source, then perform join in the database itself
  2. Filter the filter rows
  3. Optimize the lookup condition

NORMALIZER:

  1. Place the normalizer close to the target.

SEQUENCE GENERATOR:

  1. Make the transformation reusable

SORTER:

  1. Allocate enough memory to sort the data

SOURCE QUALIFIER:

  1. Add filter as much as possible
  2. Use distinct option



Performance Tuning of Informatica mapping | Interview Question

 One of the most frequently asked Informatica interview question is about performance tuning. 

  • How do you optimize mapping?
  • How do you identify which part of mapping is taking more time to run?
  • How do you optimize a transformation in your mapping?
  • How do you improve the performance of a mapping?



First of all, what is performance tuning?

  • To identify and resolve bottlenecks 
  • Performance bottleneck can occur at target, source, mapping, and session.
In this post, we will see where the bottleneck is caused, how to identify and eliminate them.

1) TARGET BOTTLENECK

  • Caused while Integration Service writes to a target database.

How to identify Target Bottleneck?

  • Configure to write a flat-file. If the performance increases then you have a target bottleneck.
  • This means more time is spent on writer thread than transformation or reader thread.

How to eliminate Target Bottleneck?

  • Configure index and key constraints on the target.

2) SOURCE BOTTLENECK

  • Caused while reading from the source database.

How to identify Source Bottleneck?

  • If the session reads from a flat file, then probably no source bottleneck.
  • If the source connection is relation then add the filter to the source, perform a read test.
  • This means more time is spent on the reader thread than transformation or writer thread.

How to eliminate Source Bottleneck?

  • Configure index and key constraints on the source.

3) MAPPING BOTTLENECK

  • Caused while transforming data.

How to identify Mapping Bottleneck?

  • Check the session log to find how time is spent on the transformation thread.
  • This means more time is spent on the transformation thread than the reader or writer thread.

How to eliminate Mapping Bottleneck?

  • By optimizing the transformation properties.

4) SESSION BOTTLENECK

  • If the above bottlenecks are eliminated then you are probably facing the session bottleneck.

How to identify Session Bottleneck?

  • Check the information about each transformation input, output, and error row details.

How to eliminate Mapping Bottleneck?

  • By optimizing the session properties.

Thursday, 6 May 2021

Part 2 | Snowflake Developer Interview Questions

FREQUENTLY ASKED SNOWFLAKE CLOUD DEVELOPER INTERVIEW QUESTIONS

In continuation of Snowflake Developer Interview Questions and Answers (Part 1).

Topics Covered are Snowflake NVL2 and NVL function, Warehouse, STREAM, External table, types of tables in snowflake, Time travel, Fail-Safe, and materialized view.

Let's jump right away into the question.

Q1: What is Snowflake URL?


Format URL: account.region.snowflakecomputing.com 

Sample URL: blogger.us-east-1.snowflakecomputing.com

Q2: Explain the difference between NVL2 and NVL function?


NVL2: One input parameter and two output parameter 
Syntax : NVL2( <expr1>, <expr2>, <expr3>)
  • If expr1 is not null, it returns expr2
  • If expr1 is null, it returns expr3
NVL:  One input parameter and one output parameter 
Syntax : NVL( <expr1>, <expr2>)
  • If expr1 is not null, it returns expr1 itself
  • If expr1 is null, it returns expr2
NVL2 , NVL query
NVL2 , NVL query

Q3: Is it possible to suspend a warehouse manually?


Yes a warehouse can be suspended manually by two ways.

1) Through GUI, at left corner of the page you can see the Warehouses tab. Navigate to the Warehouses tab, select the warehouse and click on suspend.

Snowflake
Snowflake

2) Second method is through executing query as below.
Syntax:

CREATE OR REPLACE WAREHOUSE name

WITH warehouse_size=’MEDIUM’;

ALTER WAREHOUSE [ IF EXISTS ] name SUSPEND|RESUME;

Q4: Difference between DBMS and RDBMS?

DBMS :

  • Abbreviation - Database Management System
  • Stored in hierarchical form
RDBMS :
  • Abbreviation - Relational Database Management System
  • Stored in tabular form


Q5: What happens you suspend a warehouse?

When you suspend a warehouse, snowflake suspends all idle servers for the warehouse, but allows any server that are executing statements to continue until the statements complete, at which time the servers are shut down and the changes the to suspend mode. 

Q6: What is stream?


Changes made to a table, including information about inserts, updates, and deletes. The table for which changes are recorded is called the source table.
The stream will create three extra columns (METADATA$ACTION, METADATA$ISUPDATE, and METADATA$ROW_ID).
Based on it is decided whether the incoming record is new, deleted, or updated.


Syntax: CREATE STREAM stream_name ON TABLE table_name;

Metadata$Action

Metadata$IsUpdate

Scenario

INSERT

FALSE

Insert  - New record

DELETE

FALSE

Delete - Deleted record

INSERT

TRUE

Update - Updated record

DELETE

TRUE




Q7: What is an external table?


When queried, an external table reads data from a set from a set of file in an external stage.

Syntax :  
CREATE OR REPLACE EXTERNAL TABLE ext_table

WITH LOCATION = @stage_name

AUTO_REFRESH = true

FILE_FORMAT = file_format_name;

Q8: What are the types of tables in Snowflake?

Temporary Table – Stores non-permanent, transitory data.Only exists within the session, will not be visible to other users.

Syntax : Create temporary table mytemptable (id number, creation_date date);

Permanent Table – Available until explicitly dropped.

Has fail safe period.

Transient Table - Available until explicitly dropped. Doesn’t have fail safe period, can prefer when the data doesn’t require same level of data protection and recovery as permanent table.

Syntax : Create transient table mytranstable (id number, creation_date date);

Q9: Difference between Time travel and  Failsafe?

Time Travel

  • Access historical data that has been changed or deleted at any point within defined time period. (SELECT, CREATE … CLONE, UNDROP) 
  • Retention period – 1 to 90 days (default 1 day)
  • If retention period for time travel ends, it will move to fail safe.

Fail safe

  • Historical data is protected in case of system failure
  • Data is recoverable only by snowflake.
  • Retention period – 7 days for Permanent table, 0 days for transient table

Q10: Difference between materialized view and normal view?


Materialized View

View

Data is stored on disk

Holds the data virtually

Updated periodically based on query definition

Refresh only when accessed

WHEN TO USE – Resultant doesn’t change often

WHEN TO USE - Resultant  changes often

View is used often

View that is not used often










Wednesday, 5 May 2021

Part 1 | Informatica Interview Questions

FREQUENTLY ASKED INFORMATICA INTERVIEW QUESTIONS


In this blog, we are going to explore some of the most frequently asked Informatica interview questions. 
Hoping that it will be helpful and wishing the best for your interview.

Brief: Informatica is one of the most widely used ETL data integration tool for loading data and export functions. 
It has its own transformation language that you can use in your expressions, filters, and such.

Let's move on to the questions.

Q1: What is database, datawarehouse, and datamart?


The database stores different kinds of data, however, it is smaller than the data warehouse.

Datawarehouse is used to store historical data.

And datamart is a subset of data warehouse, which is used for analytical purposes like prediction.

Q2: What are the types of dimension modeling? Or What are the different types of schema?


Dimension modeling or schema is of two types.
1) Star Schema
2) Snowflake Schema 

Detailed Explanation:
Star Schema:
  • Dimensional tables are not normalized
  • Dimensional tables are not linked to each other 
Star Schema
Star Schema
Snowflake Schema:
  • Dimension tables are normalized
  • Dimension tables are divided into sub-dimension  
Snowflake Schema
Snowflake Schema


Q3: What are the different types of Slowly Changing Dimension (SCD)?


Type 0 - Fixed dimension. No changes allowed i.e. data never changes.

Type 1 -  No history. Updates the record directly, only current data is stored and no historical data.

Type 2 Maintains history by storing the value as a new record.
It is of three types.
1) Flagging (Y/N)
2) Versioning ( Sequenced by numbers)
3) Effective date ( Start date and End date)

Type 3 - Previous value column. Maintains history as the current and previous version column-wise.

Q4: What are the differences between connected lookup and unconnected lookup?


Connected Lookup

Unconnected Lookup

The input value is received directly from pipeline

The input value is received  from the result of :lkpexp

Dynamic or static cache can be used

Only static cache can be used

Returns multiple columns

Returns only one column

If no match returns, a default value will be returned

Returns NULL

User-defined default value

Doesn’t support the user-defined default value


Q5: How to optimize a lookup transformation?


1) If the lookup table database is the same as the source, then perform the join in the database itself.
2) Apply filter to minimize the loop-up rows.
3) Optimize the lookup condition by commenting default group-by clause.

Q6: What are the different types of keys?


Primary key - Uniquely identifies each record. Doesn't allow NULL value.

Foreign key - Primary key from one table, which has a relationship with another table.

Surrogate key - Is the natural prime key, which is generated but not based on the database data.

Unique key - Uniquely identifies each value. Allows NULL value.

Q7: What are the files created by information during session runs?


1) Error log 
2) Session log
3) Workflow log
4) Bad file/ Rejection file

Q7: What is pushdown optimization and its limitation?

Analyzes the logic, mapping, and session configuration to determine the logic to push to the database.

Types of pushdown optimization,

None Does not push any logic

Source Pushes the logic as much as possible to source

Target Pushes the logic as much as possible to target

Full Pushes the logic as much as possible to source and target

The limitation of pushdown optimization is that RANK transformation cannot be pushed.

Q7: Based on what factors session logs can be retained in Informatica?


The session logs can be retained based 
1) Number of runs
2) Timestamps

Session properties
Session properties

It can be configured in session property under the config obj tab as shown in the above image.

Q8: What is PMCMD?


The PMCMD is an in-build line program that communicates with the integration service to start, stop, schedule or monitor a workflow.

Q9: What is a router transformation and what are groups available in the router transformation?


Router transformation is an active and connected transformation that is similar to filter transformation.
But the router gives multiple output groups and each output group can hold different filters.
Unlike filter transformation, all the records that don't satisfy any of the condition is pushed to the default group.

Router property
Router property

Types of groups in the router,

1) Input group
2) Output group
3) User-defined group
4) Default group

Q10: Difference between filter and router transformation?


Filter

Router

Only one condition can be applied

Multiple conditions can be applied

Can use only one  target

Can use multiple target

Returns multipe columns

Returns only one column

Rejected rows cannot be captured

Rejected rows can be captured by the default group



Read More :

Part 1 | Snowflake Developer Interview Questions

FREQUENTLY ASKED SNOWFLAKE DEVELOPER INTERVIEW QUESTIONS

In this post, we are going to explore snowflake interview questions. Topics Covered are Snowflake Architecture, warehouse, types of constraints, types of tables,  types of stages, and snowflake indexes.

This will majorly cover all the theoretical questions, starting with the basics and moving to advanced concepts. Hoping that it will be helpful and wishing the best for your interview.


Q1: What is Snowflake and how it is different from other databases?

Snowflake database is a cloud-based data warehouse-as-a-cloud-service i.e. SaaS that requires no hardware or software installation.

Advantages of Snowflake databases are

  1. Build on cloud
  2. Unlimited storage
  3. Data is encrypted by default
  4. Simple and transparent pricing
  5. Moves to idle mode when not used
  6. Time travel (1 - 90 days)

Q2: Explain the snowflake architecture?


Snowflake architecture consists of three layers.
    1. Cloud  Services
    2. Query Processing
    3. Database Storage

Snowflake Architecture
Snowflake Architecture
Detailed Explanation:
Cloud services

  • Collection of services that coordinate activities across snowflake.
  • Services managed in this layer include: Authentication, Metadata management, Query parsing and optimization and Access control

Query processing

  • Uses virtual warehouses. Each virtual warehouse is a compute cluster connected to multiple nodes allocated by a snowflake.
  • Each virtual warehouse is an independent compute cluster that does not share compute resources with other virtual warehouses. As a result, each virtual warehouse has no impact on the performance of other virtual warehouses.

Database storage

  • Snowflake reorganize the data into its internal optimized, compressed, and columnar format.
  • Data stored in snowflake can be accessed only using SQL queries.

Q3: What is shared disk and shared-nothing architecture?

Shared disk – It has a central repository for persisted data that is accessible from all the computable nodes.

Shared nothing – Uses massive parallel processing (MPP). Compute clusters where each node in the cluster stores a portion of the entire data set locally.

Q4: What are the different types of warehouse size?

  • XSMALL
  • SMALL
  • MEDIUM 
  • LARGE
  • XLARGE
  • XXLARGE 
  • XXXLARGE 
  • X4LARGE 
Most commonly medium-sized warehouse is used for heavy-duty load users.

Q5: What are the types of constraints in Snowflake?


Unique – To uniquely identify value, NULL is allowed.

Primary key – To uniquely identify each record, NULL is not allowed.

Foreign key - Primary key of another table, used to define the relationship between the tables.

NOT NULL - NULL values are not allowed.

Q6: Does Snowflake support indexes?

Snowflake does not use indexes. This is one of the things that makes Snowflake scale so well for arbitrary queries.

Q7: What is a surrogate key?


The surrogate key is the natural prime key, which means it is not generated based on the database data or values.

Q8: What are the different types of tables in Snowflake?

Temporary Table – Stores non-permanent, transitory data.

  • Only exists within the session, will not be visible to other users.

Permanent Table – Available until explicitly dropped.

  • Has fail-safe period.

Transient Table - Available until explicitly dropped.

  • Doesn’t have a fail-safe period, can prefer when the data doesn’t require the same level of data protection and recovery as the permanent table.

External Tables - When queried, an external table reads the data set from a set of files in an external stage.

Q9: What is the difference between NULL, Blank, and Zero?


NULL - Unavailable, Unassigned, Unknown, or Not applicable

Blank - It is treated as a character (' ')

Zero - It is a number

Q10: What is a  stage and explain its types?


Stage is streamline for bulk loading or unloading data out of database tables.

There are two types of stage,
1. External stage 
2. Internal stage 

Detailed Explanation:
External 
The external stage is used to access the files placed in external locations like S3, Google Cloud, Azure, etc.

Internal Stage
With the internal stage, the files are stored in the Snowflake itself.