Вы находитесь на странице: 1из 11

1

Question Answer ==========================================================


Phases vs Checkpoints
Phases - are used to break the graph into pieces. Temporary fies created during a phase wi be deeted
after its competion. Phases are used to effectivey separatey manage resource-consuming !memory"
CP#" disk$ parts of the appication.
Checkpoints - created for recovery purposes. These are points where everything is written to disk. %ou
can recover to the atest saved point - and rerun from it.
%ou can have phase breaks with or without checkpoints.
&fr
A new sandbo& wi have many directories' mp" dm" &fr" db" ... . &fr is a directory where you put fies
with e&tension .&fr containing your own custom functions !and then use ' incude
(somepath)&fr)yourfie.&fr($. #suay *+, stores mapping.
three types of paraeism
1$ -ata Paraesim - data !copies of the same - or different data$ processed simutaneousy by different
components.
.$ Componnent Paraeism !e&ecute simutaneousy on different branches of the graph$
/$ Pipeine !se0uentia$.
1+2
1uti-+ie 2ystem
m3mkfs - create a mutifie !m3mkfs ctrfie mpfie1 ... mpfie4$
m3s - ist a the mutifies
m3rm - remove the mutifie
m3cp - copy a mutifie
m3mkdir - to add more directories to e&isting directory structure
1emory re0uirements of a
graph
5ach partition of a component uses' 6 7 18 9 ma&-core !if any$
Add si:e of ookup fies used in phase !if mutipe components use same ookup ony count it
once$
1utipy by degree of paraeism. Add up a components in a phase; that is how much memory
is used in that phase.
2eect the argest-memory phase in the graph
<ow to cacuate a 2#1
2CA4
,=>>#P
2CA4?@T<,=>>#P
2can foowed by -edup sort and seect the ast
dedup sort with nu key @f we donAt use any key in the sort component whie using the dedup sort"
then the output depends on the keep parameter.
first - ony the first record
ast - ony ast record
uni0ue3ony - there wi be no records in the output fie.
Boin on partitioned fow
fie1 !A"8"C$ " fie. !A"8"-$. ?e partition both fies by (A(" and then Boin by (A"8(. @2 it =CD =r
shoud we partition by (A"8( D 4ot cear.
checkin" checkout %ou can do checkin)checkout using the wi:ard right from the E-5 using versions and tags
how to have different
passwords for QA and
production
parameteri:e the .dbc fie - or use environmenta variabe.
<ow to get records FG-HF out of
1GG
use scan and fiter
m3dump IdmJ Imfs fieJ -start FG -end HF
use ne&t3in3se0uence!$ function and fiter by e&pression component !ne&t3in3se0uence!$ JFG
KK ne&t3in3se0uence!$ IHF$
<ot to convert a seria fie into
++2
create 1+2" then use partition component
proBect parameters vs. sandbo&
parameters
?hen you check out a proBect into your sandbo& - you get proBect parameters. =nce in your sandbo& -
you can refer to them as sandbo& parameters.
8ad-2traight-fow
error you get when connecting mismatching components !for e&ampe" connecting seria fow directy
to mfs fow without using a partition component$
merging graphs %ou can not merge two ab initio graphs. %ou can use the ouput of one graph as input for another. %ou
can aso copy)paste the contents between graphs. 2ee aso about using .pan
partitioning" re-partitioning"
departitioning
partitioning - dividing a singe fow of records!seria fie" mfs$ into mutipe fows.
departitioning - removing partitionning !gather an merge component$
re-partitioning - change the number of partitions !eg" from . to L fows$
ookup fie for arge amounts of data use 1+2 ookup fie !instead of seria$
inde&ing
4o inde&es as such. 8ut there is an (output inde&ing( using reformat and doing necessary coding in
transform part.
5nvironment proBect 5nvironment proBect - specia pubic proBect that e&ists in every Ab @nitio environment. @t contains a
the environment parameters re0uired by the private or pubic proBects which constitute A@ 2tandard
5nvironment.
Aggregate vs ,oup
Aggregate - od component
,oup - newer" e&tended" recommended to use instead of Agregate.
!buit-in functions ike sum count avg min ma& product" ...$
515" E-5" Co-operating
sytem
515 = 5nterprise 1etdata 5nvironment. +unctions !repository" version contro" statistica
anaysis" dependency anaysis$. @t is on the server side and hods a the proBects !metadata of
transformations" config info" source and target info' graph dm &fr ksh s0" etc..$. This is where
you checkin)checkout. )ProBect dir of 515 contains common directories for a appication
sandbo&es connected to it. @t aso heps in dependency anaysis of codes. Ab @nitio has series of
air commands to manipuate repository obBects.
E-5 = Eraphica -evopment 5nvironment !on the cient bo&$
Co-operating sytem = Ab @nitio server instaed on top of native !uni&$ os on the server
fencing
fencing means Bob controing on priority basis.
@n A@ it actuay refers to customi:ed phase breaking. A we fenced graph means no matter what is
source data voume process wi not cough in dead ocks. @t actuay imits the number of simutaneous
processes.
+encing - changing a priority of a Bob
Phasing - managing the resources to avoid deadocks.
+or e&ampe" imiting the number of simutaneous processes
!by breaking the graph into phases" ony 1 of which can run at any given time$
Continuous components
Continuous components - produce usefu output fie whie running continousy. +or e&ampe"
Continuous roup" Continuous update batch subscribe
.
Question
Answer
====================================================
======
deadock
-eadock is when two or more processes are re0uesting the same
resource. To avoid use phasing and resource pooing.
environment
A83<=15 - where coJoperating system is instaed
A83A@,3,==T - defaut ocation for 515 datastore
sandbo&es standard environment
A@32=,T31A*3C=,5" A@3<=15" A@325,@A>" A@31+2" etc.
from uni& prompt' env M grep A@
wrapper script uni& script to run graphs
mutistage component
A mutistage component is a component which transforms input records
in F stages !1.input seect" ..temporary initiai:ation" /.processing" L.
output seection" F.finai:e$. 2o it is a transform component which has
packages. 5&ampes' scan 4ormai:e and -enormai:e" roup scan
normai:e and denormai:e sorted.
-ynamic -1>
-ynamic -1> is used if the input metadata can change. 5&ampe' at
different time different input fies are recieved for processing which have
different dm. in that case we can use fag in the dm and the fag is first
read in the input fie recieved and according to the fag its corresponding
dm is used.
fan in" fan out
fan out - partition component !increase paraeism$
fan in departition component !decrease paraeism$
ock
a user can ock the graph for editing so that others wi see the message
and can not edit the same graph.
Boin vs ookup
>ookup is good for spped for sma fies !wi oad whoe fie in
memory$. +or arge fies use Boin. %ou may need to increase the ma&core
imit to hande big Boins.
muti update muti update e&ecutes 2Q> statements - it treats each input record as a
competey separate piece of work.
scheduer
?e can use Autosys" Contro-1" or any other e&terna scheduer.
?e can take care of dependencies in many ways. +or e&ampe" if
scripts shoud run se0uentiay" we can arrange for this in
Autosys" or we can create a wrapper script and put there severa
se0uentia commands !nohup command1.ksh K ; nohup
command..ksh K; etc$. ?e can even create a specia graph in Ab
@nitio to e&ecute individua scripts as needed.
Api and #tiity modes in input tabe
These are database interfaces !api - uses 2Q>" utiity - buk oads"
whatever vendor provides$
ookup fie
ookup fie component. +unctions' ookup" ookup3count"
ookup3ne&t" ookup3match" ookup3oca.
>ookups are aways used with combination of the reformat
components.
Caing stored proc in -8
%ou can ca stored proc !for e&ampe" from input component$. @n fact"
you can even write 2P in Ab @nitio. 1ake it (with recompie( to assure
good performance.
+re0uenty used functions string3trim" string3rtrim" string3substring" reinterpret3as" today!$"
now!$
data vaidation is3vaid" is3nu" is3bank" is3defined
driving port
?hen Boining inputs !inG" in1" ...$ one of the ports is used as (driving !by
defaut - inG$. -riving input is usuay the argest one. ?hereas the
smaest can have (2orted-@nput( parameter be set to (@nput need not be
sorted( because it wi be oaded competey in memory.
Ab @nitio vs @nformatica for 5T>
Ab @nitio benefits' paraeism buit in" muitifie system" handes huge
amounts of data" easy to buid and run. Eenerates scripts which can be
easiy modified as needed $if something coudnAt be done in 5T> too
itsef$. The scripts can be easiy schedued using any e&terna scheduer -
and easiy integrated with other systems.
Ab @nitio doesnAt re0uire a dedicated administrator.
Ab @nitio doesnAt have buit-in C-C capabiities !C-C = Change -ata
Capture$.
Ab @nitio aows to !attach error ) reBect fies$ to each transformation and
capture and anay:e the message and data separatey !as opposed to
@nformatica which has Bust one huge og$. Ab @nitio provides immediate
metrics for each component.
override key
override key option is used when we need to Boin . fieds which have
different fied names.
contro fie
contro fie shoud be in the mutifie directory !contains the addresses of
the seria fies$
ma&-core
ma&-core parameter !for e&ampe" sort 1GG 18ytes$ specifies the amount
of memory used by a component !ike 2ort or ,oup$ - per partition -
before spiing to disk. #suay you donAt need to change it - Bust use
defaut vaue. 2etting it too high may degrade the performance because
of =2 swapping and degrading of the performance of other components.
@nput Parameters
graph J seect parameters tab J cick (create( - and create a parameter.
#sage' Nparamname. 5dit J parameters. These parameters wi be
substituted during run time. %ou may need to decare you parameter
scope as forma.
5rror Trapping
5ach component has reBect" error" and og ports. ,eBect captures reBected
records" 5rror captures corresponding error" and og captures the
e&ecution statistics of the component. %ou can contro reBect status of
each component by setting reBect threshod to either 4ever Abort" Abort
on first reBect" or setting ramp)imit. %ou can aso use force3error!$
function in transform function.
/
Question
Answer
==========================================================
<ow to see resource usage
@n E-5 goto options Oiew J Tracking -etais - wi see each componentAs CP#
and memory usage" etc.
assign keys component
5asy and saves deveopment time. 4eed to understand how to feed parameters"
and you canAt contro it easiy.
Poin in -8 vs Boin in Ab @nitio
2cenario 1 !preferred$' we run 0uery which Boins . tabes in -8 and gives
us the resut in Bust 1 -8 component.
2cenario . !much sower$' we use . database components" e&tract a data -
and Boin them in Ab @nitio.
Poin with -8
not recommended if number of records is big. @t is better to retrieve the data out -
and then Boin in Ab @nitio.
-ata 2kew
Parameter showing how data is uneveny distributed between partitions.
skew = !partition si:e - avg.part.si:e$Q 1GG ) !si:e of the argest partition$
dbc vs cfg
.dbc - database configuration fie !dbname" nodes" version user)pwd$ - resides in
the db directory
.cfg - any tyoe of config fie. for e&ampe" remote connection config !name of
remote server" user)pwd to connect to db" ocation of =2 on remote machine"
connection method$. .cfg fie resides in the config dir.
compiation errors
depth not e0ua data format error etc...
depth error ' we get this error.. when two components connected together but
doesAt match there ayout
types of partitions broadcast pbye&pression pbyroundrobin pbykey pwithoadbaance
unused port
when Boining" used records go to the output port" unused records - to the unused
port
tuning performance Eo parae using partitionning. ,oundrobin partitionning gives good
baance.
#se 1uti-fie system !1+2$.
#se Ad <oc 1+2 to read many seria fies in parae" and use concat
component.
=nce data is partitionned - do not switch it to seria and back. ,epartition
instead.
-o not acceess arge fiess via 4+2 - use +TP instead
use ookup oca rather than ookup !especiay for big ookups$.
#se roup and +iter as soon as possibe to reduce number of records.
@deay do it in the source !database D$ before you get the data.
,emove unnecessary components. +or e&ampe" instead of using fiter by
e&p" you can impement the same function in reformat)Poin),oup.
Another e&ampe - when Boining data from . fies" use union function
instead of adding an additiona component for removing dupicates.
use gather instead of concatenate.
it is faster to do a sort after a partitino" than to do a sort before a partition.
try to avoid using a Boin with the (db( component.
when getting data from database - make sure your 0ueries are fast !use
inde&es" etc.$. @f possibe" do necessary seection ) aggregation ) sorting in
the database before getting data into Ab @nitio.
tune 1a&3core for =ptima performance !for sort depends on the si:e of
the input fie$.
4ote - @f in-memory Boin cannot fit its non-driving inputs in the provided
1A*-C=,5" then it wi drop a the inputs to disk and in-memory does
not make sence.
#sing phase breaks et you aocate more memory in individua
components - thus improving performance.
#se checkpoint after sort to and data on disk
#se Poin and roup in-memory feature
?hen Boining very sma dataset to a very arge dataset it is more efficient
to broadcast the sma dataset to 1+2 using broadcast component" or use
the sma fie as ookup. 8ut for arge dataset donAt use broadcast as a
partitioner.
#se Ab @nitio ayout instead of database defaut to achieve parae oads
Change A83,5P=,T parameter to increased monitoring duration
#se cataogs for reusabiity
Components ike Boin) roup shoud have the option (@nput must be sorted(
if they are paced after a sort component.
minimi:e number of sort components. 1inimi:e usage of sorted Boin
component" and if possibe repace them by in-memory Boin)hash Boin. #se
ony re0uired fieds in the sort reformat Boin components. #se (2ort within
Eroups( instead of Bust 2ort when data was aready presorted.
#se phasing)fow buffers in case of merge sorted Boins
1inimi:e the use of reguar e&pression functions ike re3inde& in the
transfer functions
Avoid repartitioning of data unnecessariy. ?hen spitting records into
more than two fows" use ,eformat rather than 8roadcast component.
+or Boining records from . fows use Concatenate component =4>% when
there is a need to foow some specific order in Boining records. @f no order
is re0uired then it is preferabe to use Eather component.
@nstead of putting many ,eformat components consecutivey" use output
inde&es parameter in the first ,eformat component and mention the
condition there.
deta tabe
-eta tabe maintain the se0uencer of each data tabe.
1aster !or base$ tabe - a tabe on tp of which we create a view
scan vs roup
roup - performs aggregate cacuations on groups" scan - cacuates cumuative
totas
packages used in mutistage components or transform components
,eformat vs (,edefine +ormat(
,eformat - deriving new data by adding)dropping fieds
,edefine format - rename fieds
Conditiona -1> -1> which is separated based on a condition
2=,T?@T<@4E,=#P
The prere0uisit for using sortwithingroup is that the data is aready sorted
by the maBor key. sortwithingroup outputs the data once it has finished
reading the maBor key group. @t is ike an impicit phase.
passing a condition as a parameter
-efine a +orma Ceyword Parameter of type string. +or e&ampe" you ca it
+iterCondition" and you want it to do fitering on C=#4T J G . Aso in your
graph in your (+iter by e&pression( Component enter foowing condition'
N+iterCondition
4ow on your command ine or in wrapper script give the foowing command
%ourEraphname.ksh -+iterCondition C=#4T J G
Passing fie name as a parameter
#!/bin/ksh
#Running the set up script on enviornment
typeset PROJ_DIR $(cd $(dirnme $!"/##$ p%d"
# $PROJ_DIR/b_pro&ect_setup#ksh $PROJ_DIR
#'(porting the script prmeter) to I*P+,_-I.'_*/0'
i1 2 $# 3ne 4 5$
then
I*P+,_-I.'_P/R/0','R_) $)
I*P+,_-I.'_P/R/0','R_4 $4
# ,his grph is using the input 1i6e
cd $/I_R+*
#/my_grph)#ksh $I*P+,_-I.'_P/R/0','R_)
# ,his grph 6so is using the input 1i6e#
#/my_grph4#ksh $I*P+,_-I.'_P/R/0','R_4
e(it !$
e6se
echo Insu11icient prmeters
e(it )$
1i
3333333333333333333333333333333333333
#!/bin/ksh
#Running the set up script on enviornment
typeset PROJ_DIR $(cd $(dirnme $!"/##$ p%d"
# $PROJ_DIR/b_pro&ect_setup#ksh $PROJ_DIR
#'(porting the script prmeter) to I*P+,_-I.'_*/0'
e(port I*P+,_-I.'_*/0' $)
# ,his grph is using the input 1i6e
cd $/I_R+*
#/my_grph)#ksh
# ,his grph 6so is using the input 1i6e#
#/my_grph4#ksh
e(it !$
<ow to remove header and traier
inesD
use conditiona dm where you can separate detai from header and traier. +or
vaidations use reformat with count '/ !outG'header out1'detai out.'traier.$
<ow to create a muti fie system on
?indows
first method' in E-5 go to ,#4 J 5&ecute Command - and run m3mkfs
c'contro c'dp1 c'dp. c'dp/ c'dpL
second method' doube-cick on the fie component" and in ports tab
doube-cick on partitions - there you can enter the number of partitions.
Oector
A vector is simpy an array. @t is an ordered set of eements of the same type !type
can be any type" incuding a vector or a record$.
-ependency Anaysis
-ependency anaysis wi answer the 0uestions regarding
does the data come from what appications prodeuce and depend on this data etc..
L
Question
Answer
==========================================================
2urrogate key There are many ways to create a surrogate key. +or e&ampe" you can use
ne&t3in3se0uence!$ function in your transform. =r you can use (
vaues( component. =r you can write a stored procedure - and ca it.
4ote' if you use partitions" then do something ike this'
!ne&t3in3se0uence!$-1$Qno3of3partition!$9this3partition!$
.abinitiorc
This is a config fie for ab initio - in userAs home directory and in
NA83<=15)Config. @t sets abinitio home path" configuration variabes
!A83?=,C3-@," A83-ATA3-@," etc.$" ogin info !id" encrypted password$"
ogin methods for hosts for e&ecution !ike 515 host" etc.$" etc.
.profie your ksh init fie ! environment" aiases" path variabes" history fie settings"
command prompt settings" etc.$
data mapping" data modeing
<wo to e&ecute the graph +rom E-5 - whoe graph or by phases. +rom checkpoint. Aso using ksh scripts
?rite 1utipefies A component which aows to write simutaneousy into mutipe oca fies
Testing ,un the graph - see the resuts. #se components from Oaidate category.
2andbo& vs 515
2andbo& is your private area where you deveop and test. =ny one proBect and
one version can be in the sandbo& at any time. The 515 -atastore
versions of the code that have been checked into it !source contro$.
>ayout
?here the data-fies are and where the components are running. +or e&ampe" for
data - seria or partitioned !muti-fie$. The ayout is defined by the ocation of the
fie !or a contro fie for the mutifie$. @n the graph the ayout can propagate
automaticay !for mutifie you have to provide detais$.
>atest versions Apri .GGR' E-5 ver.1.1F.S" Co-operative system ver ..1L.
Eraph parameters
menu edit J parameters - aows you to specify private parameters for the graph.
They can be of . types - oca and forma.
PanJ@t
%ou can define pre- and post-processes" triggers. Aso you can specify methods to
run on success or on faiure of the graphs.
+re0uenty used components input fie ) output fie
input tabe ) output tabe
ookup ) ookup3oca
reformat
gather ) concatenate
Boin
runs0
Boin with db
compression components
fiter by e&pression
sort !singe or mutipe keys$
roup
trash
partition by e&pression ) partition by key
running on hosts
coJoperating system is ayered on top of native =2 !uni&$. ?hen running from
E-5" E-5 generates a script !according to (run( setings$. CoJop system wi
e&ecute the scripts on different machines !using specified host settings and
connection methods" ike re&ec tenet rsh rogin$ - and then return error or success
codes back.
conventiona oading vs direct
oading
This is basicay an =race 0uestion - regarding 2Q>>-, !2Q> >oader$ utiity.
Conventiona oad - using insert statements. A triggers wi fire" a contraints
wi be checked" a inde&es wi be updated.
-irect oad - data is written directy bock by bock. Can oad into specific
partition. 2ome constraints are checked" inde&es may be disabed - need to specify
native options to skip inde& maintenance.
semi-Boin
in abinitio there are / types of Boins' inner Boin" outer Boin" and semi Boin.
for inner Boin Arecord3re0uired4A parameter is true for a (in( ports.
for outer Boin it is fase for a the (in( ports.
for semi Boin it is true for the re0uired component and fase for other
components.
http'))www.geekinterview.com)@nterview-Questions)-ata-?arehouse)Abinitio)page1G

Вам также может понравиться