Вы находитесь на странице: 1из 47

Analyst Interview Questions

HiringTheBest HiringAnalysts
3 significant areas to cover. Assumption is that questions in these areas will provide data
to assess leadership, culture fit, & communication skills
First: Business Process Assessment The candidate should be able to assess
problem/opportunities in a "case" study method.
Second: Technical Depth The candidate will need to retrieve, manipulate, & evaluate
large sets of data efficiently
Third: Self-directed/Leadership Will this candidate look for business opportunities with
passion?
1. Business Process Assessment Quick Questions
What are you reading currently?
What has influenced your business behavior most heavily in the last year?
What kind of process or project mgmt training have you had?
What do you think of (five forces, rational, UML, competitive advantage, ted levitt's
criticism of the "product lifecycle", Six Sigma)
How do you get to root cause for an issue such as
Looking for: Evidence that the person is growing, stretching in a direction that is
successful at Amazon. Is the candidate reading the learning organization, the innovators
dilemma or who moved my cheese. Do they read the Economist, Mit Tech review, HBR
or Newsweek? Do they recognize process terms?
Longer Questions These should be of 2 types - simple give me an equation for situation
and then more vague case types.
Simple Equations
Profitability = Revenue - Costs
Need Inventory = OH -Demand (bonus if time phased & includes forecasts,
intransits)
Predicted OH Inventory = OH +Intransits + Pos - Demand -Forecast
Healthy Inventory =

Start just asking for a simple definition, then start discussing factors with the candidate.
For example in profitability, what factors go into costs?
Looking for thoughtfulness & testing of assumptions. How does the candidate think
through the question - systematically or ad-hoc
Cases:
Skip suggested to preface questions of this type with "there is no right answer, I want to
use this as an example to see how you approach a problem" <insert Nimrod, Janice, Skip
questions here>
2. Technical Assessment Questions Quick questions:
From a list of orders over the last week using the tool of your choice
1. rank the orders by quantity
2. avg quantity for each vendor
3. # of distinct vendors per week.
4. Find & count lines in a log file that have a specific ASIN or user id
(In an onsite, could have a data file on a laptop and say show me?.)
Looking for:
Unix - cut cat find grep sort
Excel/Access : basic functions, pivot tables, data structures, domain tables
SQL - nested queries, functions, basic joins Perl - any scripting?, RegExp
Reference - does the candidate know how to find help, admit boundaries
Longer SQL/Unix Questions
1. I need to provide a report with:
2. -the total units & the average cost of book orders by day of week over the last 10
weeks by country
A good answer will look something like
Select sum(quantity), avg(cost), product, to_char(date,'DY'),
country
from (

select quantity, (quantity * cost) as cost,


to_char(order_day,'dy'), product, country
from order_items
where product = books
and order_day between x & now )
The candidate should recognize
-cost must be calculated on an item basis before averaging nested or inline query
-sql functions exist for total, average, & date manipulation

For extra credit add a join - such as book name

1. "Say there's a text file of the form "userid-tab-command" that tracks all the
commands that a given user runs. How would you find out how many times user
"Bob" has run any command at all?"
A Good Answer: At its most basic:
"grep -c Bob filename" or "cat filename | grep Bob".

If they understand that "Bob" could be part of the command, then the correct grep is
actually:
grep -c "^Bob"

to anchor the user. Even better, so in case there's a user called "Bob" and another called
"BobH", they should do:
grep -c "^Bob<tab>".

3. Self-directed/Leadership Assessing these behaviors may occur throughout the


questions of other areas
-Look for an opportunity to challenge the candidate on something that is
obviously right or true.
-Do they hold their ground?
-Do they get angry or look to understand why they're being
challenged?
-What does the resume say about the candidate?
-did they found anything, start anything, volunteer on something
huge,

Is this person an AutoDidact?


Hiring Analysts

back to HiringTheBest

Companies that hire lots of analysts have the process down to a science, just as Amazon
does for SDEs. The interview process at a big 5 consulting firm is very defined,
behaviorally focused, looking at capabilities. An Analyst is a unique creature but not
impossible to find & assess.
During the interview loop in addition to culture fit & interpersonal skills, a candidate
should be reviewed on how they've displayed analyst type competencies in the past AND
solve a problem to display the competency in actuality. Analysts are usually good
presenters, just asking them about the past may not display the limits of their abilities.
Here are a couple frameworks of core Analyst Competencies:
--1. Think broad and deep: can take the big picture strategic business view and can also
dive into the details to understand a problem
2. Problem solving skills: can they structure and frame a problem, make estimates when
necessary, figure out the dataset needed (smallest, easiest dataset to draw solid
conclusions), get and analyze the data, summarize the conclusions and their reasoning
3. Communication skills: clear, organized, concise, ability to adapt to audience (VP to
SDE), think on the fly, thoughtful
4. Multi-tasking: can they juggle many issues at one time?
5. Independence: ability to work with minimal direction and ask for help when needed
6. Customer focus
7. Cultural fit: Team (COFS, SCOS,...) and Amazon
8. Leadership
--Find, Frame, Analyze, & Deliver within Amazon
Find Problems/Opportunities
An analyst should be able to recognize broken processes, bad processes, troubleshoot
processes. But also prioritize whether the proposal is polishing a pig or creating a golden
cow. Building pretty toys with no ROI is a waste of time. Given the business maturity at
Amazon, there are a lot of process improvements or new businesses where money can be
saved/found.

Past Example A candidate should be able to point to past projects where they:
-worked as support
-Saved x USD, n Minutes as a result of a process change

Have them explain their role, then ask what upstream/downstream business impacts
occurred after the change occurred. Drive into specifics
Problem Solving Provide a problem for them to solve - -What should amazon add to its
site to deliver more on the "Find Discover & Buy anything online" -What is different for
Amazon over Blockbuster video? -What impacts to supply chain & customer experience
would be felt by adding an Amazon Air Travel Store -Given factors x,y,z - how would
you calculate ROI on project Q to present to a sr. vice president.... in an hour
This competency is a display of business competency - does the candidate see the big
picture or get wrapped up into their project?
Potential Skills -Process mapping: Can a candidate draw out a level1,2,3 diagram?
Understand ICOMs or do they move into systems & dataflow -Basic Business Measures:
ROI, DRP, Forecasting, S&OP
--Frame Model & Hypothesis
The analyst is part wizard, part math professor. They are called in to explain the past,
look into the crystal ball about the future, and draw a cool looking formula to make you
believe it. Acceptable skills vary here, an Ops Research will need to demonstrate different
skills than an MBA type or a Supply Chain type. But at the end of the day, an Amazon
analyst will need to be able to stand at a whiteboard and draw some algebraic looking
formula (sum of receive time - confirm time for n asins * min fifo cost layer.... )
Can they identify the opportunity (competency 1) then define and model it here?
Past Example A candidate should be able to explain past projects where they:
-developed a model (forecasting, spreadsheet, financial...)
-describe the tool(s) with which they've worked (AMPL,Excel,Pkg
Software)

Have them explain their role, Drive into specifics


Problem Solving Provide a problem for them to solve - tweak it for ecommerce -Why do
split shipments matter? -How would you build a forecasting model for new products with
no history? -What data does Amazon have that is unique, how can this be used in Supply
Chain? -How many customers does a 2% damage rate to the top 10 best selling items at
the top 4 FCs impact?

This competency is a display of analytic skills - does the candidate set assumptions,
challenge the definitions, and display the ability to draft a reasonable model? Could they
build a metrics package?
Potential Skills -Modeling: Can a candidate draw out a forecast equation, linear
programming -Advanced Business Measures: Time Value of Money
--Analyze
Once a candidate has built a model, no-one is going to go get data for you. The tools on
hand will be limited or perhaps not available. To succeed the analyst will need to identify
and evaluate a data source, then get the data themselves or negotiating for SDE time.
Since SDE time is money, this is usually the less preferred choice. The key elements here
are abilities to: *Retrieve Data *Evaluate Data Quality *Data Scale
So an analyst has found a good opportunity, determined how to quantify it, but how will
the control be built ongoing?
Past Example A candidate should be able to explain past projects where they:
-built a tool or heavily configured software
-what were the shortcomings? how did they drive through their weaknesses
-What data gathering tools were used, how big was the data set

Have them explain their role, Drive into specifics


Problem Solving Provide a problem for them to solve - tweak it for ecommerce -If SQL
is a listed skill ask for a query that tests aggregation, functions,joins, & business
definitions i.e. Write a query from a order items table that results in average # of orders,
average cost of orders by product line over the last 15 weeks
A great candidate should question the assumptions - why 15 wks, why average, why
aggregated at all. Follow up with "What decisions could I do with that data?
-SDE design questions are good here too
This competency is a display of technical skills & business skills - Could the candidate
analyze a data set with 2million rows? What conclusions do they draw from the results
Potential Skills -SQL -Design
--Deliver

Once the analysis is completed, is it just a report on a shelf? What changed? Were cost
reductions actually realized? What form did the analysis results take - powerpoint, 3 ring
binder, email, whitepaper? Who saw them and what did they do? Is the candidate aware
of good visualization guidelines (Tufte, W. Cleveland) or do they LOVE powerpoint? At
Amazon, Analysts often present their own results - will the work stand up to scrutiny?
Past Example A candidate should be able to explain past projects where they:
-presented results in detail, in 15 minutes
-How did you get your points across in your allotted 10 minutes of
executive time?
-What data presentation tools were used?

Problem Solving Provide a problem for them to solve - tweak it for ecommerce -"You
have 15 minutes tomorrow afternoon to report back to a VP about a question he asked
you today regarding specific metric accuracy - Could you prepare an outline of your
answer, what format would it be in, how would you followup on your
recommendations?"
Look for creativity
Potential Skills -Creativity -Effective communication -a get it done attitude
Data Engineer Interview Questions

Contents
[hide]

1 Sample Interview Questions for Data Engineering Candidates


o

1.1 DW Concepts

1.2 Tuning

1.3 SQL

1.4 Oracle

1.5 ETL

1.6 Linux/Unix

1.7 Teradata

1.8 Data Modeling

1.9 Additional Questions for DEIII (Level 6) Bar

1.9.1 Oracle

1.9.2 Architecture and design

o 1.10 Reporting Specific Interview Questions


[edit][hide] Sample Interview Questions for Data Engineering Candidates
[edit][hide] DW Concepts

What the advantages of star schema design


1. Allows business entities to map directly with schema design for
highly optimized performance when querying.
2. It is widely supported by a number of BI tools.
3. It is the simplest data warehouse schema.

Can you provide the different types of slow changing dimensions (Type
I, II, III). What are the key differences in their implementation
1. Type I SCD's are dimensions where old data is overwritten with
new data and no historical data is kept.
2. Type II SCD's are dimensions where multiple records are kept to
track historical data. 'Version' or 'Effective Date' are common
ways to allow unlimited history preserved with each
update/record.

3. Type III SCD's are dimensions where a limited amount of history


is preserved by using seperate columns. 'Original' or 'Previous'
columns for another column, are common was to track a limited
number of changes.

What are the difficulties in implementing a Type II dimension table


o

Given a type II dimension table having a 32bit guid as the natural key,
how would you design the fact tables to support both point in time as
well as current hierarchy reporting
o

ANS: Create a 'bridge' table to collect and assign keys to all


unique combinations of the GUID and timestamp/level, and store
the unique bridge key in the fact table.

What are semi-additive facts, give some examples


o

ANS: When new records are created to represent changes in a


dimension table, the relationships between the fact tables and
common keys can become inconsistent and lead to inaccurate
results. Depending on the relationship between the dimension
and fact table, the fact table may not capture all relevant
dimensions when being queried.

ANS:Semi-additive facts are facts that can be aggregated for


some dimensions, but may not be logical for others.

An example of this is Current Balance and Profit Margin.


Current Balance is a semi-additive fact because it makes
sense to add the Current Balance for all accounts at one
point in time, but not for a period of time, whereas for
Profit Margin, you may want both.

Another example of a semi-additive fact is Local Net


Revenue. Local Net Revenue can only be aggregated in
the context of the local currency to give an accurate
calculation. This semi-additive fact would need to be
(Local Net Revenue * Conversion Rate) to be a fully
additive fact.

Take a source schema below.


1. Product table Product_Id, Product_Name, Launch_Dt,
Product_Price
2. Store table Store_Id, Store_Name, Launch_Dt
3. SalesRep table SalesRep_Id, SalesRep_Name, DateofJoin,
StoreId

4. Orders table Order_Id, Order_Date, Store_Id, SalesRep_Id, Total


Amount, Total Quantity
5. Order Items table Order_Id, Item_Id, Quantity, Amount
1. How would you approach building the DW schema for the above
model?
2. ANS: Star schema or Snowflake Schema or Same model as
source

What kind of factor should be considered while build fact table? Would
merging a Order table and Order Item table make more sense or not.
o

How would you maintain the history of Product_Price?


o

ANS: OLTP's records transactions in real-time and aims to


automate clerical data entry processes of a business entity. DW
systems are a storage space of current and historical data
extracted from external sources for aggregation and analytical
querying.

What is Full/Initial load & Incremental/Refresh load?


o

ANS: Integer or Numeric

What is the difference between OLTP and DW systems?


o

ANS: Type 2 SCD would solve the problem

What is the data type of the surrogate key


o

ANS : Factors like storage, maintenance, less duplicate, denormalization, volume, backfill contribute to decision of design
of fact table in such scenario.

ANS: Initial Load is when you are populating tables in the DW


schema for the first time. Full load refers to populating the entire
table, whether the first time or to overwrite data.
Incremental/Refresh load refers to populating tables with records
that were not already in the tables.

What is a staging area? Do we need it? What is the purpose of a


staging area?
o

ANS: A staging area is intermediate storage space between


external sources and the DW. Yes we need staging areas. The
purpose of staging areas are:

1. Gathering data from different sources for transforming at


different times
2. OLTP's can quickly offload data when in need of free space
3. Use data to compare against current datasets within DW
4. For pre-joining and aggregating data as well as 'data
cleansing'

How to determine what records to extract in Incremental/Refresh load?


o

What is a data mart?


o

ANS: I would store the foreign key of the Product Group


dimension table in the Item Orders facts table, so that it could
be used for pivoting and reporting on Product Groups.

What are push and pull ETL strategies?


o

ANS: The Star schema consists of one or more Fact tables


relating to any number of Dimension tables. The Snowflake
schema is represented by centralized Fact tables related to
dimensions on multiple levels. The main difference is in a Star
schema, one dimension would have only one table. In a
Snowflake schema, one dimension could have a subset of
dimensions that are related to the fact tables.

You have an Item Orders fact table? Will you store the Product group of
the item in the fact? If so why? Else why not?
o

ANS: A data mart is a subset of the DW, usually as a seperate


schema so a specific business unit/group can modify and
maintain the data within it, without affecting other data marts or
the DW.

What is Star, Snow Flake Schema? What is the difference?


o

ANS: By using Type II and Type III SCD's as a common key to


determine the most recent or newly creatd records.

ANS: Push and pull ETL strategies refer to the way in which data
is transferred from source to ETL tool. Push ETL is when external
source sends data to ETL tool. Pull ETL is when ETL tool
requests/retrieves data from source.

What does level of granularity in a fact table mean?

ANS: The level of granularity in a fact table refers to the detail


and precision at which a fact is captured within a given context.

[edit][hide] Tuning

If you have a poorly performing report/etl process, how would you investigate and tune it
going all the way back to table design.

explain plans - when tuning what do you look for in an explain plan
that screams red flags.

'what if you didn't have indexes'

What about partitioning...

What about the oracle level join types (hash, nested loop) and when
each should be used

Different types of joins and when each should be used

[edit][hide] SQL

DE 1 bar is questions 1-3


1. Given an orders (order_id,order_day) table.. count(*) of orders last
week
o

SELECT COUNT(ORDER_ID) AS NUM_OF_ORDERS

FROM ORDERS

WHERE ORDER_DAY BETWEEN TRUNC(SYSDATE,DAY)-7 AND


TRUNC(SYSDATE,DAY)

2. Order_items table (item_id, order_id, qty)... sum of qty by month.


o

SELECT

TO_CHAR(ORDER_DAY,'MONTH') AS MONTH

, SUM(QTY) AS MONTHLY_SUM

FROM ORDER_ITEMS

GROUP BY TO_CHAR(ORDER_DAY,'MONTH')

3. Order_items table (item_id, order_id, qty)... sum of qty by month when


more than 50.
o

SELECT TO_CHAR(ORDER_DAY,'MONTH') AS MONTH, SUM(QTY)


AS MONTHLY_SUM

FROM ORDER_ITEMS

GROUP BY TO_CHAR(ORDER_DAY,'MONTH')

HAVING SUM(QTY) > 50

4. Pivot:
o

using the data from #3. give me the data with the Months as
columns instead of rows or

SELECT

CASE WHEN MONTH = 1 THEN MONTHLY_SUM END AS JAN

CASE WHEN MONTH = 2 THEN MONTHLY_SUM END AS


FEB

CASE WHEN MONTH = 3 THEN MONTHLY_SUM END AS


MAR

FROM

(SELECT TO_CHAR(ORDER_DAY,'MONTH') AS MONTH,


SUM(QTY) AS MONTHLY_SUM

FROM ORDER_ITEMS

GROUP BY TO_CHAR(ORDER_DAY,'MONTH')

HAVING SUM(QTY) > 50)

given item_properties (asin, binding, value), provide sql that


gives 1 row per asin

SELECT

ASIN,

COUNT(CASE WHEN BINDING = 'DVD' THEN 1 ELSE 0


END) AS NUM_OF_DVDS,

SUM(VALUE) AS ITEM_VALUE,

FROM ORDER_ITEMS

GROUP BY ASIN

2. Given an orders table (order_id, order_day, billing_address_id,


customer_id). Provide the last billing address every customer used.3

SELECT CUSTOMER_ID, FIRST(BILLING_ADDRESS_ID)

FROM ORDERS

Query for customer that they bought a year ago and yesterday.

Create a query that the result set contains a running total.


Example table. orders (order_id, order_day, qty): running sum
total on day.

SELECT ORDER_ID, ORDER_QTY, SUM(ORDER_QTY) OVER


(ORDER BY ORDER_DAY) AS RUNNING_TOTAL

FROM ORDERS

What are the differences between aggregates and analytic


functions.. and how does oracle handle them differently

ANS: Aggregate functions returns one result per each group of


the result set. Where as analytical functions returns multiple
results per each group i.e. using analytical functions we may
display group results along with individual rows.

Given an orders table with order_id, customer_id and order_date with


the sample data
o

Order_id, Customer_id, order_date

O1, C1, 01-Jan-2000

O2, C2, 01-Jan-2002

O3, C3, 01-Apr-2002

O4, C4, 01-Apr-2003

O5, C4, 01-Jan-2006

O6,C1, 01-May-2006

Give SQL for the list of customer_ids who placed more than 1 order
o

SELECT Customer, COUNT(OrderID) FROM Orders

GROUP BY Customer

HAVING Count(OrderID) > 1

Give the Sql for the list of customer_ids who have placed at least 1
order in 2000 and at least 1 order in 2006.
o

SELECT Customer, COUNT(OrderID) FROM Orders

GROUP BY Customer

HAVING ((Count(OrderID) > 1 AND TO_CHAR(order_date,'YYYY')


= 2000) OR (Count(OrderID) > 1 AND
TO_CHAR(order_date,'YYYY') = 2006))

Please write a sql which can generate the number of Orders for each
year, 2000 to 2006.
o

SELECT

COUNT(DISTINCT CASE WHEN TO_CHAR(order_date,'YYYY') =


2000 THEN 1 ELSE 0 END) AS 2000

COUNT(DISTINCT CASE WHEN TO_CHAR(order_date,'YYYY') =


2001 THEN 1 ELSE 0 END) AS 2001

COUNT(DISTINCT CASE WHEN TO_CHAR(order_date,'YYYY') =


2002 THEN 1 ELSE 0 END) AS 2002

COUNT(DISTINCT CASE WHEN TO_CHAR(order_date,'YYYY') =


2003 THEN 1 ELSE 0 END) AS 2003

COUNT(DISTINCT CASE WHEN TO_CHAR(order_date,'YYYY') =


2004 THEN 1 ELSE 0 END) AS 2004

COUNT(DISTINCT CASE WHEN TO_CHAR(order_date,'YYYY') =


2005 THEN 1 ELSE 0 END) AS 2005

COUNT(DISTINCT CASE WHEN TO_CHAR(order_date,'YYYY') =


2006 THEN 1 ELSE 0 END) AS 2006

FROM ORDERS

Display the employee records who joins the department before their
manager?
o

SELECT emp1.*

FROM EMPLOYEES emp1, EMPLOYEES emp2

WHERE emp1.MANAGER_ID = emp2.EMPLOYEE_ID

AND emp1.EMPLOYEE_JOIN_DATE < emp2.EMPLOYEE_JOIN_DATE

Display employee records getting more salary than the average salary
in their department?
o

SELECT

DEPT, EMPLOYEE, SALARY, AVG(SALARY)

FROM EMPLOYEES

GROUP BY DEPT, EMPLOYEE, SALARY

HAVING AVG(SALARY) < SALARY

Display the highest paid employee in each department.


o

SELECT

DEPT, EMPLOYEE, SALARY

FROM EMPLOYEES

GROUP BY DEPT, EMPLOYEE

HAVING MAX(SALARY) = SALARY

Display the 2nd highest paid employee in each department.


o

SELECT DEPT, EMPLOYEE

FROM

(SELECT DEPT, EMPLOYEE, RANK() OVER (PARTITION BY DEPT


ORDER BY SALARY DESC) AS RANK FROM EMPLOYEES)

Select student_id, student_name from students where student_id = 1


and student_id = 2. What does the query return?
o

ANS: Views are virtual tables based on a query that can be


realized based on multiple tables by containing combined data
from each of them. Materialized views are the same as views
except they have to be manually refreshed to contain updated
date. Views are updated automatically whenever an underlying
table is modified.

Can you insert data into a view?


o

ANS: A Cartesian product returns all the rows in all the tables
listed in the query. Each row in the one table is paired with all
the rows in each of the rest of the tables. This happens when
there is no relationship defined between tables.

What is a view? What is materialized View? What is the difference


between view and materialized view?
o

ANS: SELECT COUNT(*) FROM TABLE_NAME

What is Cartesian product in the SQL?


o

ANS: DESC can be used to describe a schema, or arrange


records in descending order.

How do you find the number of rows in a Table


o

ANS: It returns nothing since student_id is generally considered


a unique value and a student can't have two IDs at once.

What is the use of DESC in SQL?


o

WHERE RANK = 2

ANS: Yes.

What is a merge statement? What is the requirement for a merge


statement? Is PK necessary for merge?
o

ANS: The MERGE statement is used to select rows from one or


more sources for update or insertion into a table or view. You can
specify conditions to determine whether to update or insert into
the target table or view. You must have the INSERT and UPDATE
object privileges on the target table and the SELECT object
privilege on the source table. To specify the DELETE clause of
the merge_update_clause, you must also have the DELETE

object privilege on the target table. Another requirement is you


cannot update the same row of the target table multiple times in
the same MERGE statement, so for this to to take place, a
unique/primary key is necessary.

What is dual? Is it a table? If so what columns does it have? Whats the


data type?
o

ANS: The DUAL table is a pseudo table, not a real table. The
DUAL table has only one column named DUMMY, which is a
VATCHAR2 data type.

Give some examples where you have used analytics functions.


o

ANS: Rank and Percent_Rank are good analytic functions when


wanting to create column values, based on the rest of the
dataset and it's relationship to each record.

[edit][hide] Oracle

What is difference between UNIQUE and PRIMARY KEY constraints?


o

Differentiate between TRUNCATE and DELETE.


o

ANS: IN tells SQL to run an outer query using the list of values
within the clause. EXISTS tells SQL to run an outer query on a list
of values within the clause until there is a match. EXISTS is
faster because SQL stops executing that operation after the first
match, whereas SQL has to look at all values in an IN clause.

What the difference between UNION and UNIONALL?


o

ANS: TRUNCATE is a DDL command and cannot be rolled back.


DELETE is a DML command and can be rolled back. Both
commands essentially perform the same task, except TRUNCATE
does it faster.

Differentiate between IN and EXISTS? Which is faster - IN or EXISTS?


o

ANS: You can have more than 1 UNIQUE constraint within a table
and it can be NULL, whereas there can only be one PK constraint
per table, and cannot be NULL.

ANS: UNION will filter duplicate values to give DISTINCT results,


while UNIONALL will not.

Difference between CHAR and VARCHAR2?

What is the NVL statement? How is it different from decode? Is it


possible to implement NVL with Decode?
o

ANS: The ROLLBACK statement is the inverse of the COMMIT


statement. It undoes some or all database changes made during
the current transaction.

What are partitions?


o

ANS: COMMIT makes permanent the changes resulting from all


SQL statements in the transaction.

What does ROLLBACK do?


o

ANS: Yes, but it is then referred as a Composite Primary Key.


Primary key assumes only one column describes it.

What does COMMIT do?


o

ANS: Insert.

Can a primary key contain more than one column?


o

ANS: Yes. By using the ALTER TABLE...ALTER COLUMN command.

Which is faster Insert or Delete?


o

ANS: DECODE can only work with scalar values. CASE can work
with predicates and sub queries, and handles NULL differently.

Is there any way we can change the column name in a table


o

ANS: The NVL statement says if FIELD_NAME is NULL, assign


value X: NVL(FIELD_NAME, REPLACEMENT_VALUE). It is different
from DECODE in that DECODE has an if-then-else structure. Yes,
NVL can be implemented by DECODE using:
DECODE(FIELD_NAME, NULL, REPLACEMENT_VALUE)

Difference between CASE and DECODE?


o

ANS: CHAR is a fixed length data type. VARCHAR2 is a variable


length data type and can free up unused space if possible.

ANS: A table partition is a collection of rows that is a subset of a


user-created table.

Whats difference in 10G and 11g partitioning.

What is meant by analyzing tables?


o

ANS: New features of 11G allow INTERVAL partitions, which


moves part of functionality solved currently by ETL pre-wrappers
to default processing of RDBMS defined in Data dictionary
metadata (automatic partition creation).

ANS: Analyzing a table involves collecting and interpreting


statistics on a table such as the following:

Collect or delete statistics about an index or index


partition, table or table partition, index-organized table,
cluster, or scalar object attribute.

Validate the structure of an index or index partition, table


or table partition, index-organized table, cluster, or object
reference (REF).

Identify migrated and chained rows of a table or cluster.

What is oracle hint? Is the hint a command or Oracle uses it optionally?


o

ANS: A hint is code snippet that is embedded into a SQL


statement to suggest to Oracle how the statement should be
executed.

What is an Explain Plan?


o

Note: Hints should only be used as a last-resort if


statistics were gathered and the query is still following a
sub-optimal execution plan.

ANS: An Explain Plan is an ordered set of steps used to access or


modify information, based on a query, while estimating the time
and cost of processing.

Difference between hash and nested loop joins?


o

ANS: Hash joins are used for joining large data sets. The
optimizer uses the smaller of two tables or data sources to build
a hash table on the join key in memory. It then scans the larger
table, probing the hash table to find the joined rows. Nested
loops nested join small number of rows, with a good driving
condition between the two tables. It drives from the outer loop
to the inner loop. The inner loop is iterated for every row
returned from the outer loop, ideally by an index scan.

The difference is the performance in which these joins are


conducted. Hash joins are optimal when joining large subsets of
data together, where as nested loops are more efficient for
smaller datasets that preferably has an index to use. For the
DW, hash joins are generally recommended as most tables are
not small enough to utilize the nested loop join efficiently.

[edit][hide] ETL
1. Add in world wide reporting. How would that affect your ETL?
o

ANS: Your ETL will then have to be adjusted to ensure that the
data is available for reporting, based on the different time zones.

2. Given a billion row table, How do you add a new column and backfill
the data from source without impacting the user?
o

ANS: You will create a new table with the additional column and
then backfill the data from the existing table. Once the backfill is
complete, you can then deprecate the original table and publish
the new table with the additional column to the users. This
approach causes no impact to the users, as you are creating a
separate table to backfill (with the additional column) instead of
attempting to perform an UPDATE on a billion row table.

3. What is the best strategy to use when you have to delete 400 million
from a billion row table.
o

ANS: Create a new table and backfill it with the existing data in
the original table. Delete the desired 400 million rows from the
new table, and then publish that table to the users, while
deprecating the original.

[edit][hide] Linux/Unix
1. cron
o

ANS: Cron is the time-based job scheduler in Unix-like computer


operating systems. cron enables users to schedule jobs
(commands or shell scripts) to run periodically at certain times
or dates. It is commonly used to automate system maintenance
or administration, though its general-purpose nature means that
it can be used for other purposes, such as connecting to the
Internet and downloading email.

2. combine 2 files
o

ANS: cat file1 file2 >> mergedfile

3. dedupe #2
o

ANS: sort mergedfile | uniq

4. pipes
o

ANS: Pipes are a function of text filtering in Linux that can be


used to construct a pipeline of commands where the output from
one command is piped or redirected to be used as input to the
next. Using pipelines in this way is not restricted to text streams,
although that is often where they are used.

5. remove a known row from a file too large for vi


o

ANS: The SED command provides an effective and versatile way


of deleting one or more lines from a designated file to match the
needs of the user.

Example: To remove the 3rd line in a file: sed '3d'


fileName.txt

Example: To remove the last line in a file: sed '$d'


filename.txt

6. given a process named 'foo' - find and kill it


o

ANS: pkill foo

7. describe linux/unix permissions


8. given a large 1 column (for example used 'names') file, get a list of the
duplicate values
o

What does ls do?


o

ANS: The ls command lists the files in a directory.

If a file has permissions 000, then who can access the file?
o

ANS: sort file | uniq -d

ANS: Only root can read/write the file, while only the owner can
change the file's permissions. No one can execute the file.

What is the difference between grep and find commands?

What is redirection?
o

ANS: grep -c "pattern" file.txt

Given that 3rd column is the primary key, how would you find if there
are duplicates in the file.
o

ANS: grep is used to search for patterns in a file.

Count the number of lines in a file with a pattern given


o

ANS: Piping is when you are redirecting standard inputs and


outputs of a command by using pipes.

Find a pattern in a file


o

ANS: Redirection is when you change the standard input and


outputs of a command to a user-specified location. Pipes are
generally used for redirection.

What is piping?
o

ANS: grep is used to search for patterns in a file, where as, find
is used to search files or directories.

ANS: awk '$3' file | sort | uniq -d

How do you check for null value in a particular column in a file


o

ANS: You could use the Awk NR command, which gives you the
total number of records being processed or line number. For
example, if a file has 10 columns, then you would check to see if
a line number has NR<10.

[edit][hide] Teradata

Advantages of using Teradata over Oracle


o

ANS: The advantage of Teradata is that it uses MPP architecture,


so that a query running against large tables can run over
multiple threads, hardware, etc.

Disadvantages of using Teradata as compared to Oracle


o

ANS: The disadvantage of Teradata is it's ability to handle a


large volume of simultaneous queries, while processing them
efficiently.

[edit][hide] Data Modeling

The following are just definitions. Try to provide a real-life problem, like how would
model so you can report on delay times between order state statuses - pending, success,
error, etc.
1. What are the primary the differences between a transactional database
vs a data warehouse database?
1. Transaction Database is Relational Database with the normalized
table, whereas Data Warehouse is with denormalized tables.
2. Transaction Database is highly volatile. Designed to maintain
transactions of the business Where Data Warehouse is non
volatile with periodic updates.
3. Transaction Database is OLTP. Data warehouse is for analysis.
4. Transaction Database is functional data. Data Warehouse
database is subject oriented.
2. Differentiate Primary Key and Partition Key?
1. ANS: Primary key is the key we define on the table column or set
of columns(composite pk) to make sure all the rows in a table
are unique. Partition key is the key that we use to partition the
table with.
o

What is the difference between a Type 1 and Type 2 Dimension?

1. Type I: Replace the old record with a new record with updated
data, there by we lose the history. But data warehouse has a
responsibility to track the history effectively, where Type I
implementation fails.
2. Type II: Create a new additional dimension table record with new
value. By this way we can keep the history. We can determine
which dimension is current by adding a current record flag or by
time stamp on the dimensional row.
2. What is the difference between Snow flake and Star Schema?
What are the benefits of each?
1. Star Schema
1. Star join is a primary key to foreign key join of the
dimension tables to a fact table.

2. Provide a direct and intuitive mapping between the


business entities being analyzed by end users and the
schema design.
3. Provide highly optimized performance for typical star
queries.
4. Are widely supported by a large number of business
intelligence tools, which may anticipate or even require
that the data warehouse schema contain dimension
tables.
2. Snowflake Schema
1. Normalize dimensions to eliminate redundancy. That is,
the dimension data has been grouped into multiple tables
instead of one large table. For example, a product
dimension table in a star schema might be normalized
into a products table, a product_category table, and a
product_manufacturer table in a snowflake schema.
2. It increases the number of dimension tables and requires
more foreign key joins.
3. The result is more complex queries and reduced query
performance.

What is a context-driven data model? When would you need one?


o

ANS: A context-driven data model is based on contextual


information to enhance the "understanding" of object-to-object
associations, which measures similarities of data objects that are
an abstraction from real world. This model is good to use if you
need to identify and understand the similarities/differences
between data objects, which can help determine the relevancy
of data during consumption.

What is the difference between dimensional modeling vs. ER modeling?


o

ANS: Dimensional modelling is very flexible for the user


perspective. Dimensional data model is mapped for creating
schemas. Where as ER Model is not mapped for creating shemas
and does not use in conversion of normalization of data into
denormalized form. ER Model is utilized for OLTP databases that
uses any of the 1st or 2nd or 3rd normal forms, where as
dimensional data model is used for data warehousing and uses

3rd normal form. ER model contains normalized data where as


Dimensional model contains denormalized data.

Describe the normal forms? What is BCNF? 2nd normal form? 3rd
normal form?
o

BoyceCodd normal form (BCNF) represents a table where


every non-trivial functional dependency in the table is a
dependency on a superkey.

2nd normal form represents a table where no non-prime


attribute in the table is functionally dependent on a
proper subset of any candidate key.

3rd normal form represents a table where every nonprime attribute is non-transitively dependent on every
candidate key in the table. The attributes that do not
contribute to the description of the primary key are
removed from the table. In other words, no transitive
dependency is allowed.

ANS: Implement an internal audit column, such as Last_Updated


or Created, to track each iteration of the rapid-changing
dimensions.

Describe a scenario where you would have to snowflake a model?


o

How do you design a data model for rapid changing dimensions?


o

ANS: The normal forms of relational database theory provide


criteria for determining a table's degree of vulnerability to
logical inconsistencies and anomalies.

ANS: One scenario for creating a snowflake schema, is if you


have entities that have a parent-child relationship.

Provide candidate an OLTP model similar to amazon ordering with 2-3


dimensions (product/customer/merchant etc.) Ask to build a data
model? If header-detail approach..Why? If not why not?
Advantages/disadvantages.
o

Add a value to the OLTP design that alters the grain of one
associated dimension (e.g: new/used books). Where would the
change be propagated to?

2. Provide dimensional data from an OLTP source in a key-value pair. Ask

to de-norm / dimensionalize the data? Key questions: how to partition


the data? Context-driven approach?

3. How do you handle many to many relationships in star schema.


o

ANS: One way is by using bridge tables that holds at least the 2
foreign keys from the 2 tables that have the M:M relationship.

What kind of design will you propose in source as well as data


warehouses for tables which have hard deletes occurring on
regular basis in source systems.

ANS: A design where Type 2 SCD's are implemented in


Dimension tables and temporal fact tables are created to give
DW users the visibility of whether the facts have been hard
deleted in the source systems.

[edit][hide] Additional Questions for DEIII (Level 6) Bar


[edit][hide] Oracle
1. You are starting a transaction and reading from an oracle table and
processing the data. How will you ensure that the select is consistent
i.e. you ignore inserts, updates, deletes by any other transaction
o

ANS: By setting the table to the serializable isolation level, which


will prevent dirty reads, non-repeatable reads and phantom
reads from occurring during the initial select transaction.

2. What kind of logging system would you design for sql and pl/sql scripts
so that all errors get logged in error tables? Provide at least two design
solutions
o

ANS: One design solution within Oracle is, you can create a
stored procedure call that can be attached to any other
package/procedure that would be able to gather data on an
error/exception or user/system checkpoint, and insert into a
specified table. Another solution, using UNIX, would be to create
one script as the controller file that captures the unique
identifiers for each error/checkpoint, while another script uses
the data from the controller file to pull more data from Oracle's
error logs.

3. Advantages and disadvantages of Oracle RAC systems.


o

ANS: RAC systems allows closer to 100% uptime, can scale with
less hardware, and possibly even handle a larger load. On the
other hand, RAC systems can be costly, and more difficult to
manage (training and troubleshooting). Also, RAC systems usally
only improve availability, which is just one aspect of a well
designed system.

4. Advantages of using Oracle vs other database systems


o

The advantages may differ, depending on which database


system is being compared to Oracle. Each database system was
designed with specific advantages and disadvantages that may
outweigh or downplay the advantages of Oracle (which also
depends on the intended application of the database system).

[edit][hide] Architecture and design


1. We get click stream data on a daily basis from source team. We need
to design a data mart for storing and querying the raw data for a 2
year history. Daily volume is around 600 million rows. What kind of
solutions will you provide, database and non database. Provide
solutions which have a very low price performance.
2. Data warehouse receives orders from multiple ordering systems. This
data needs to be stored and sales commission needs to be paid based
on the state based on the $ sales made. A mapping table is present
which associates state to Sales manager. How would you implement
such a solution? What are the major design decisions you will take to
ensure that a payment is tracked .

Follow up questions: What happens when a backfill happens?


1. Amazon receives Products from Vendors. Amazon cuts POs to vendors
which are fulfilled. Products get traffic, get ordered and then shipped
out of warehouses. We also receive customer returns. The objective is
given that the base tables are there in the datawarehouse, a Vendor
portal needs to be created which provides daily (d-o-d), weekly (w-ow), monthly (m-o-m), quarterly(q-o-q) and yearly(Y-O-Y) metrics . The
vendor portal will be accessed from the external systems. What kind of
end to end architecture you would design for such a scenario.
2. When designing a data warehouse solution (both etl and reporting) for
a company having businesses around the world, what are the major
factors you need to consider

followup: Have you architected a reporting solution. What were the challenges faced.
SQLInterviewQuestions

I've created this page as a place to put SQL puzzles to assign candidates who claim strong
SQL backgrounds as homework, or on-site, or phone screen questions (in decreasing
order of difficulty).

[edit][hide] Homework Questions


1. Prove or disprove the following equation:
( X join(f(X,Y)) Y ) left join(g(Y,Z)) Z == X join(f(X,Y)) ( Y
left join(g(Y,Z)) Z )

where all the field names of X, Y, and Z are distinct. [Answer: true.
Argument via set-theoretic calculation. Incidentally, Oracle
Corporation's query-plan optimizer team is in a state of denial about
this equivalence.]
[edit][hide] On-site Questions
1. Suppose I have two entities in my DB: Objects, and Tags. Suppose also
that I have a mapping table ObjectTag which represents a many-tomany relationship between Objects and Tags. Now I wish to find, given
a finite input list of Tag ids, the set of Objects which map to (a) any of
the input tags [easy], and (b) all of the input tags [harder]. Can you do
(a) and (b) with one query each?
o
o
o

(a)
select distinct o.*
from Objects o join ObjectTag ot on o.id = ot.obj_id
where ot.tag_id in ( <input list> )

(b) Two ways, with the second worth many more points than the
first in terms of elegance. Let n be the length of the input list:
1.
2.

select distinct o.*


from Objects o join ObjectTag ot1 on o.id =
ot1.obj_id
3.
join ObjectTag ot2 on o.id = ot2.obj_id
4.
join ...
5.
join ObjectTag ot<n> on o.id = ot<n>.obj_id
6.
where ot1.tag_id = <input 1> and ... and
ot<n>.tag_id = <input n>
7. select o.*
8. from Objects o join (
9.
select count( tag_id ) tag_count, obj_id
10.
from ObjectTag
11.
where tag_id in ( <input list> )
12.
group by obj_id
13.
having count( tag_id ) = <n>
14.
) ot on o.id = ot.obj_id

2. Suppose I have a table X with a numeric field N. How do I write a single


query with one numeric query parameter such that if the parameter is
some number m, the result will only contain rows where X.N = m, and
if the parameter is null, the result will include all rows of X?

3.
4.

select * from X where N = NVL( ?, N ) /* oracle */


select * from X where N = IFNULL( ?, N ) /* SQL Server, MySQL */
select * from X where N = COALESCE( ?, N ) /* PostGreSQL */

[edit][hide] Phone Screen Questions

1. Suppose my enormous table X has a uniquely indexed string field


"name". Now I want to find all records in X whose name field starts
with the prefix 'foo'. What's the fastest query possible? [Answer:
select * from X where X.name >= 'foo' and X.name < 'fop'

as opposed to the much more common response


select * from X where X.name like 'foo%'

SQL

[edit][hide] Base tables


Employee
empid (primary key)
name
title
salary
deptid (foreign key to Department.deptid)
mgrid (foreign key to Employee.empid)

Department
deptid (primary key)
deptname
[edit][hide] Questions
Type

Question

SQL

All employees from


department = GFS

select emp.* from Employee emp,


Department dept where emp.deptid =
dept.deptid and dept.deptname = 'GFS'

Group by

Dept Name with


number of employees

select deptname, count(empid) from


Employee emp, Department dept where
emp.deptid = dept.deptid group by
deptname

Group by
having

Dept Name with


number of employees
> 10

select deptname, count(empid) from


Employee emp, Department dept where
emp.deptid = dept.deptid group by
deptname having count(empid) > 10

Dept Name with


number of employees
Outer Join
include depts with no
employees also

select deptname, count(empid) from


Employee emp, Department dept where
emp.deptid (+)= dept.deptid group by
deptname

Highest salary
Sub query employee with dept
name

select emp.*, dept.* from Employee


emp, Department dept where
emp.deptid = dept.deptid and salary =
(select max(salary) from Employee)

Join

Self join

select emp.name, mgr.name from


Emp name & Mgr name Employee emp, Employee mgr where
emp.mgrid = mgr.empid

Self join

select emp.name from Employee emp,


All employees reporting
Employee mgr where emp.mgrid =
to Aneesh
mgr.empid and mgr.name = 'Aneesh'

Corelated
subquery

Employees with salary


more than their
managers

select emp.name from Employee emp


where emp.salary > (select mgr.salary
from Employee mgr where emp.mgrid =
mgr.empid)

All employees reporting select empname, mgrid from Employee


Heirarchic
to Ramya (directly or
start with empname = 'Ramya' connect
al query
indirectly)
by prior empid = mgrid
is null

Find the top most


employee

select emp.* from Employee emp where


mgrid is null

The above will cover some basic scenarios. If you want multiple joining condition may
be add another table like address into the mix and create some joining conditions. Can
ask about EXISTS, NOT EXISTS and other correlated subquery conditions.
Ask some question regarding partitioning say we have tables : orders, customers.
Orders has order date, performance issues how to improve. Should arrive at partitioning
by date. May be one question about giving hints in sql query.
PipsInterviewQuestions

A few Interview questions in sections


[edit][hide] Statistics

1. What is the 'Simpson's paradox'? Give an example. followup: How might this paradox
occur in continuous distributions?
[edit][hide] SQL

1. Suppose you are aggregating shipping_addresses over customers; each customer has a
customer_id and each address has an address_id; customers may have multiple shipping
addresses.
We want to aggregate shipping address zip codes up to customers to choose a
'representative' zip code for each customer that can be used for model building.
There are three tables

purchases - has customer purchases including shipping_address_id


(key is purchase_id)

addresses - has address_id and postal_code (for US customers


postal_code = zip_code; key is address_id)

zipcode2000census - zip code and demographic data on zip-code level


(key is zip_code)

Create a sql query that returns the most recently used zip-code and the most commonly
used zip code for each customer. Join the results of this query with the census table to get
the medianHouseValue for the zip code for each customer.
SQLInterviewQuestions

I've created this page as a place to put SQL puzzles to assign candidates who claim strong
SQL backgrounds as homework, or on-site, or phone screen questions (in decreasing
order of difficulty).
[edit][hide] Homework Questions
1. Prove or disprove the following equation:
( X join(f(X,Y)) Y ) left join(g(Y,Z)) Z == X join(f(X,Y)) ( Y
left join(g(Y,Z)) Z )

where all the field names of X, Y, and Z are distinct. [Answer: true.
Argument via set-theoretic calculation. Incidentally, Oracle
Corporation's query-plan optimizer team is in a state of denial about
this equivalence.]
[edit][hide] On-site Questions
1. Suppose I have two entities in my DB: Objects, and Tags. Suppose also
that I have a mapping table ObjectTag which represents a many-tomany relationship between Objects and Tags. Now I wish to find, given
a finite input list of Tag ids, the set of Objects which map to (a) any of
the input tags [easy], and (b) all of the input tags [harder]. Can you do
(a) and (b) with one query each?
o
o
o

(a)
select distinct o.*
from Objects o join ObjectTag ot on o.id = ot.obj_id
where ot.tag_id in ( <input list> )

(b) Two ways, with the second worth many more points than the
first in terms of elegance. Let n be the length of the input list:
1.
2.

select distinct o.*


from Objects o join ObjectTag ot1 on o.id =
ot1.obj_id
3.
join ObjectTag ot2 on o.id = ot2.obj_id
4.
join ...
5.
join ObjectTag ot<n> on o.id = ot<n>.obj_id
6.
where ot1.tag_id = <input 1> and ... and
ot<n>.tag_id = <input n>
7. select o.*
8. from Objects o join (
9.
select count( tag_id ) tag_count, obj_id
10.
from ObjectTag
11.
where tag_id in ( <input list> )
12.
group by obj_id

13.
14.

having count( tag_id ) = <n>


) ot on o.id = ot.obj_id

2. Suppose I have a table X with a numeric field N. How do I write a single


query with one numeric query parameter such that if the parameter is
some number m, the result will only contain rows where X.N = m, and
if the parameter is null, the result will include all rows of X?
3.
4.

select * from X where N = NVL( ?, N ) /* oracle */


select * from X where N = IFNULL( ?, N ) /* SQL Server, MySQL */
select * from X where N = COALESCE( ?, N ) /* PostGreSQL */

Phone Screen Questions

1. Suppose my enormous table X has a uniquely indexed string field


"name". Now I want to find all records in X whose name field starts
with the prefix 'foo'. What's the fastest query possible? [Answer:
select * from X where X.name >= 'foo' and X.name < 'fop'

as opposed to the much more common response


select * from X where X.name like 'foo%'

which can't use the index on name.]


nterview Question

[edit][hide] Some Interview Questions

What command would I use to search for a specific string or regular


expression in a file?

What command would I use to change the permissions of a file?

What command would I use to find the names off all processes running
as a specific user?

How to kill a process

What are some common data structures in Java?

What is a binary tree:

What is a BST

Some of the tree traversals?

Give me a case where I would want to use a hash table?

What is the time complexity of retrieving an element from hash table?

Give me a regex to match a 10-digit phone number of the form 555555-5555.

Write a method to print out a binary tree's nodes in level-order.

Find Nth element from the last in a linked list

You are responsible for supporting a web service, ABC service. It is a


distributed service where incoming requests are load-balanced
between 10 different hosts. You get notification that the service is
failing assuming you do not know much about the service (but you
know that you own it), how would you try to fix it?

I gave the candidate the structure of two database tables, "employees"


and "managers" and asked him to write a couple of SQL queries.

or

Write a SQL query to find the total number of active employees.

Write a SQL query to list all active employees and their


managers.

employee => Eid, name, active, MiD


manager => MId, name, active

-----------------------------------------------------------------------------------------------------------------------------DWInterviewCompetencies

|| Help Desk || Alphabetical Listing || All DW Topics ||


DataWarehouse Interview Competencies
Contents
[hide]

1 Competencies
o

1.1 Data Engineering

1.2 Data Modeling and Design

1.3 Database Concepts

1.4 Coding and Problem Solving

1.5 Hiring Manager

1.6 Bar Raiser

1.7 DW Grid

o 1.8 Competency - Interviewer Pool


[edit][hide] Competencies

Following are the competencies that are identified that each person should focus on for
DW Data Engineer role. Before looking into the competencies, please abide by the
following:

Please make sure you have only two Competencies from the list below,
if its more please ask your HM.

Please make sure you have sufficient number of questions in each


competency to support your vote.

Any skill set that you have a serious data point and that is outside of
your competency, please dont vote for it, instead keep it in Pros/Cons.

[edit][hide] Data Engineering

This should include Operational Data Engineering skills that is required in DW world.
ONLY HIRING MANAGER WILL DO THIS Candidates should be comfortable
with partitioning, parallelism, impacts to objects (indexes, MVs), huge backfills,
different granularity handling etc.
Examples questions:

Huge, billion rows, multi terabyte data. Some section corrupted, how
will you backfill only those affected rows?

Expect for partitions, exchange partitions.

A DE II and DE III should be aware of impact to indexes,


global/local.

Tables in three Clusters out of sync, how will you correct it?
o

Expect for more clarifying questions like which one is correct,


again articulation around partitions..

Load errors during huge volumes o

Duplicates

Data errors when they dont match column data type definition

A big file 500 M rows (200 GB), how will you load into tables?
o

External table

How will he parallelize?

Unix/OS level familiarity to parallelize etc.

SQL producing 500 M rows, writing takes long time, what are his
thought process to make it better?

[edit][hide] Data Modeling and Design

This includes both Oltp Data Modeling and DW Data Modeling and Design.
ONLY HIRING MANAGER and BR WILL DO OLTP DM, OTHERS PLEASE
ASK MORE DW DM and Design

Please give Votes as follows:


o

RAISES - In addition to giving you a correct answer, the


candidate

Resolves all ambiguities by himself

Asks lot of relevant questions

States his assumptions

Commits mistakes, but when probed understands and


corrects.

Thinks more on extensibility, scalability (its ok for you to


probe him on this)

Doesnt give up when you make more complex.

MEETS - Candidate just gives you the correct answer and he


doesnt exhibit all the above. It took lot of probing to get the
above things cleared up, but he picks your probing and solves.

LOWERS - You know.

OLTP Examples include:

Give a use case and ask him to design a data model (some DMs that
we ask are bookmyshow.com, car pooling, table management in a
restaurant etc..)

Any SQL questions from the DM and your judgement should go


ONLY as Pros or Cons

DW DM and Design include:

Etl Design (ex. T_Changed) for denormalization of multiple tables into


one where updates happen asynchronously.

SCD type of implementation (Customer address change), ask for actual


implementation, deleting old rows, marking it as deleted flag.

Aggregate designs

How will you design a multi granular table with some measures
against 3 dimensions?

How will you select a particular granularity row (see for bitmap
indexes on booleans that describe the granularity of that row?)

Daily query of 15 months scanning to do YoY, how will he implement it?


How will he make it fast? Any improvements on the only changed ones
etc.

[edit][hide] Database Concepts

This strictly includes only Database concepts. Examples include:

Execution plans

Difference between Hash and Nested loop joins

Partitioning concepts

Parallelism concepts

Consistent reads (ora-snapshot too old errors)

Indexes, MVs etc etc..

Distributed Databases (pros and cons)

[edit][hide] Coding and Problem Solving

This includes giving candidates problems and observing the approach and SQL coding
skills for the same. My recommendation will be start off with simple SQL coding skills to
medium to complex problems that requires intermediate designs and implementing above
with SQL code as well. You can also give problems that requires procedural coding
(PL/SQL programming). In SQL, please Observe for minimal scans, effective joins, not
too many subqueries, set operators, temporary tables, With tables etc..
Examples include:

You can start off with Top 10 salaries in a employee table

Self join type of questions, employees - manager relation in same


employee table

Joins, outer joins

Case when statements/decode

Analytical functions (lag lead, ranks, rownumbers, first value, last value
etc..), if the candidate DOESNT know analytical fucntions, ITS
FINE. Ask more on solving using group bys.

Questions on rollup, cube, grouping sets etc..

Pivotting, De-Pivotting

Cartesian Joins (yes it serves some purpose too, row explosion!)

Designing intermediate structures, W_ Tables, temp tables etc

Recursive functions in PL/SQL programming

Efficient for looping in PL/SQL programs (if its two for loops, can it be
done in a single for loop etc)

Data structure usage in PL/SQL programs (Cursors, tables, arrays etc).

[edit][hide] Hiring Manager

Project management

cult fit

HM can also pick any skill set form above just to be comfortable and
please include that in Pros/Cons.

[edit][hide] Bar Raiser

BR competencies.

[edit][hide] DW Grid

DW Grid should look something like this:

Data
DM and
Engineering Design(OLTP/DW)

2 Votes

2 Votes

Coding
and PS

3 Votes

DB
Concept
s

2 Votes

HM

BR

2
2
Votes Votes

Total Interviewers : A, B, C, D, HM, BR

DE - A B
DM - C D
Coding - A B C
DB Concepts - D HM
HM Round - HM BR
BR - BR HM
So we need 4 onsite interviewers + a HM + a BR. (HM should do one of the
competencies as well).

[edit][hide] Competency - Interviewer Pool


Interviewer

Data
Engineerin
g

Coding
and PS

DB
DM
Concept (OLTP/DW
s
)

General
HM/BR skill
sets

Venkatesh
Mohan

Abhishek
Agrawal

Rakesh Singh

Naidu Rongali

Aniruddha
Vishnupurikar

Paparao
Chinthagunti

Ankush Kuhar

Samar Sodhi

AmazonAnalyticsDEInterviewsQuestions

Contents
[hide]

1 Amazon Analytics Data Engineer Interview Questions


o

1.1 Outline of Phone Screen

1.2 Questions by Subject Area

1.2.1 BI Tools

1.2.2 Reporting

1.2.3 SQL

1.2.4 Data Modelling

1.2.5 Unix

1.2.6 Oracle DB Technology

1.2.7 Data Warehousing

1.2.8 Essbase

[edit][hide] Amazon Analytics Data Engineer Interview Questions

See Amazon_Analytics_DE_Interviews for summary of typical DE interview for BI


Reporting.
[edit][hide] Outline of Phone Screen
1. 5 min - Introduction (hello and quick "who you are", describe job
position)
2. 10 min - Ask about background and most recent experience
3. 25 min - Deep dive with questions
4. 5 min - "Why amazon?", "any questions for me?", describe next steps
[edit][hide] Questions by Subject Area
[edit][hide] BI Tools

Basic
o

What is the purpose of BI

Intermediate

Advanced

[edit][hide] Reporting
1. What are drill down and drill across reports, what is the difference?

2. What is pivoting? How will you write a pivoting sql?


3. What is a dashboard?
4. What is scorecarding?
5. Explain the Dimension Hierarchy!
6. OBIEE Product overview
7. Request processing flow in OBIEE, role of each layer
8. Different type of cache?
9. Level based measures and Preffered drill path
10.What is shared logon property in the Connection pool setup
11.Connection pool optimization
12.Difference between online and offline repository
13.Steps for MUD development
14.Steps for LDAP setup
15.What is Guided navigation and how it works
16.Variable type and usage (Both at RPD and PS level)
[edit][hide] SQL
1. What are the different join types?
2. Display the employee records who joins the department before their
manager?
3. Display employee records getting more salary than the average salary
in their department?
4. Display the highest paid employee in each department.
5. Display the 2nd highest paid employee in each department.
6. Select student_id, student_name from students where student_id = 1
and student_id = 2. What does the query return?

7. given order and order items tables, select customer ids of customers
who placed orders with more than 3 items (having or subquery)
8. What is the use of DESC in SQL?
9. How do you find the number of rows in a Table?
10.What is Cartesian product in the SQL?
11.What is a view? What is materialized View? What is the difference
between view and materialized view
12.Can you insert data into a view?
13.What is a merge statement? What is the requirement for a merge
statement? Is PK necessary for merge?
14.What is dual? Is it a table? if so what columns does it have? Whats the
data type?

Basic
o

Describe different joins

given order and order items tables, select customer ids of


customers who placed orders with more than 3 items (having or
subquery)

create buckets for if they placed more than 10 and more


than 100 orders (case)

Intermediate
o

Difference between hash join and nested loops

dedupe (analytics, temp tables, etc)

Advanced
o

Describe explain plan for query (give query)

[edit][hide] Data Modelling


1. How do you model a many-to-many relationship?
2. What is normalization? denormalization?
3. What is 3NF? How normalize/denormalize?

4. What is a type-2 dimension? How many types are there?


[edit][hide] Unix
1. What does ls do?
2. If a file has permissions 000, then who can access the file?
3. What is the difference between grep and find commands?
4. What is redirection?
5. What is piping?
6. How would you dedupe a text file?
7. How do you view in-use ports?

[edit][hide] Oracle DB Technology


1. What is difference between UNIQUE and PRIMARY KEY constraints?
2. Differentiate between TRUNCATE and DELETE
3. Differentiate between IN and EXISTS? Which is faster - IN or EXISTS?
4. What is the difference between UNION and UNION ALL?
5. Difference between CHAR and VARCHAR2?
6. What is the NVL statement? How is it different from decode? Is it
possible to implement NVL with Decode?
7. what is COALESCE function?
8. Difference between CASE and DECODE?
9. Is there any way we can change the column name in a table
10.Which is faster Insert or Delete?
11.Can a primary key contain more than one column?
12.What does COMMIT do?

13.What does ROLLBACK do?


14.What are partitions?
[edit][hide] Data Warehousing
1. What is the data type of the surrogate key?
2. What is the difference between OLTP and DW systems?
3. What is Full/Initial load & Incremental/Refresh load?
4. What is a staging area? Do we need it? What is the purpose of a
staging area?
5. How do you determine what records to extract in Incremental/Refresh
load?
6. What is a data mart?
7. What are Star & Snow Flake Schemas? What is the difference? When
do you use one or the other?
8. You have an Item Orders fact table: Will you store the Product group of
the item in the fact? If so why? Else why not?
9. What are push and pull ETL strategies?
10.What does level of granularity in a fact table mean?
11.What is the difference between Inmon and Kimball methodology?
[edit][hide] Essbase
1. Difference between Block and Aggregate Storage?
2. What are Levels and Generations?
3. Can there be more than one Accounts Dimension in a cube?
4. Is it possible to have duplicate level-0 members in a cube?
5. If there are duplicate members in the cube to which member does
Essbase attribute fact value?
6. What is a Rule file?
7. Is it possible to update dimension values during fact load?

8. What is MAXL?
9. What is MDX?
10.What is aggregation?
11.Why is aggregation needed?
12.If new data is added to the cube, without adding new dim members, is
re-aggregation required?
13.What is query based aggregation and stop value based aggregation?