Phases vs Checkpoints Phases - are used to break the graph into pieces. Temporary fies created during a phase wi be deeted after its competion. Phases are used to effectivey separatey manage resource-consuming !memory" CP#" disk$ parts of the appication. Checkpoints - created for recovery purposes. These are points where everything is written to disk. %ou can recover to the atest saved point - and rerun from it. %ou can have phase breaks with or without checkpoints. &fr A new sandbo& wi have many directories' mp" dm" &fr" db" ... . &fr is a directory where you put fies with e&tension .&fr containing your own custom functions !and then use ' incude (somepath)&fr)yourfie.&fr($. #suay *+, stores mapping. three types of paraeism 1$ -ata Paraesim - data !copies of the same - or different data$ processed simutaneousy by different components. .$ Componnent Paraeism !e&ecute simutaneousy on different branches of the graph$ /$ Pipeine !se0uentia$. 1+2 1uti-+ie 2ystem m3mkfs - create a mutifie !m3mkfs ctrfie mpfie1 ... mpfie4$ m3s - ist a the mutifies m3rm - remove the mutifie m3cp - copy a mutifie m3mkdir - to add more directories to e&isting directory structure 1emory re0uirements of a graph 5ach partition of a component uses' 6 7 18 9 ma&-core !if any$ Add si:e of ookup fies used in phase !if mutipe components use same ookup ony count it once$ 1utipy by degree of paraeism. Add up a components in a phase; that is how much memory is used in that phase. 2eect the argest-memory phase in the graph <ow to cacuate a 2#1 2CA4 ,=>>#P 2CA4?@T<,=>>#P 2can foowed by -edup sort and seect the ast dedup sort with nu key @f we donAt use any key in the sort component whie using the dedup sort" then the output depends on the keep parameter. first - ony the first record ast - ony ast record uni0ue3ony - there wi be no records in the output fie. Boin on partitioned fow fie1 !A"8"C$ " fie. !A"8"-$. ?e partition both fies by (A(" and then Boin by (A"8(. @2 it =CD =r shoud we partition by (A"8( D 4ot cear. checkin" checkout %ou can do checkin)checkout using the wi:ard right from the E-5 using versions and tags how to have different passwords for QA and production parameteri:e the .dbc fie - or use environmenta variabe. <ow to get records FG-HF out of 1GG use scan and fiter m3dump IdmJ Imfs fieJ -start FG -end HF use ne&t3in3se0uence!$ function and fiter by e&pression component !ne&t3in3se0uence!$ JFG KK ne&t3in3se0uence!$ IHF$ <ot to convert a seria fie into ++2 create 1+2" then use partition component proBect parameters vs. sandbo& parameters ?hen you check out a proBect into your sandbo& - you get proBect parameters. =nce in your sandbo& - you can refer to them as sandbo& parameters. 8ad-2traight-fow error you get when connecting mismatching components !for e&e" connecting seria fow directy to mfs fow without using a partition component$ merging graphs %ou can not merge two ab initio graphs. %ou can use the ouput of one graph as input for another. %ou can aso copy)paste the contents between graphs. 2ee aso about using .pan partitioning" re-partitioning" departitioning partitioning - dividing a singe fow of records!seria fie" mfs$ into mutipe fows. departitioning - removing partitionning !gather an merge component$ re-partitioning - change the number of partitions !eg" from . to L fows$ ookup fie for arge amounts of data use 1+2 ookup fie !instead of seria$ inde&ing 4o inde&es as such. 8ut there is an (output inde&ing( using reformat and doing necessary coding in transform part. 5nvironment proBect 5nvironment proBect - specia pubic proBect that e&ists in every Ab @nitio environment. @t contains a the environment parameters re0uired by the private or pubic proBects which constitute A@ 2tandard 5nvironment. Aggregate vs ,oup Aggregate - od component ,oup - newer" e&tended" recommended to use instead of Agregate. !buit-in functions ike sum count avg min ma& product" ...$ 515" E-5" Co-operating sytem 515 = 5nterprise 1etdata 5nvironment. +unctions !repository" version contro" statistica anaysis" dependency anaysis$. @t is on the server side and hods a the proBects !metadata of transformations" config info" source and target info' graph dm &fr ksh s0" etc..$. This is where you checkin)checkout. )ProBect dir of 515 contains common directories for a appication sandbo&es connected to it. @t aso heps in dependency anaysis of codes. Ab @nitio has series of air commands to manipuate repository obBects. E-5 = Eraphica -evopment 5nvironment !on the cient bo&$ Co-operating sytem = Ab @nitio server instaed on top of native !uni&$ os on the server fencing fencing means Bob controing on priority basis. @n A@ it actuay refers to customi:ed phase breaking. A we fenced graph means no matter what is source data voume process wi not cough in dead ocks. @t actuay imits the number of simutaneous processes. +encing - changing a priority of a Bob Phasing - managing the resources to avoid deadocks. +or e&e" imiting the number of simutaneous processes !by breaking the graph into phases" ony 1 of which can run at any given time$ Continuous components Continuous components - produce usefu output fie whie running continousy. +or e&e" Continuous roup" Continuous update batch subscribe . Question Answer ==================================================== ====== deadock -eadock is when two or more processes are re0uesting the same resource. To avoid use phasing and resource pooing. environment A83<=15 - where coJoperating system is instaed A83A@,3,==T - defaut ocation for 515 datastore sandbo&es standard environment A@32=,T31A*3C=,5" A@3<=15" A@325,@A>" A@31+2" etc. from uni& prompt' env M grep A@ wrapper script uni& script to run graphs mutistage component A mutistage component is a component which transforms input records in F stages !1.input seect" ..temporary initiai:ation" /.processing" L. output seection" F.finai:e$. 2o it is a transform component which has packages. 5&es' scan 4ormai:e and -enormai:e" roup scan normai:e and denormai:e sorted. -ynamic -1> -ynamic -1> is used if the input metadata can change. 5&e' at different time different input fies are recieved for processing which have different dm. in that case we can use fag in the dm and the fag is first read in the input fie recieved and according to the fag its corresponding dm is used. fan in" fan out fan out - partition component !increase paraeism$ fan in departition component !decrease paraeism$ ock a user can ock the graph for editing so that others wi see the message and can not edit the same graph. Boin vs ookup >ookup is good for spped for sma fies !wi oad whoe fie in memory$. +or arge fies use Boin. %ou may need to increase the ma&core imit to hande big Boins. muti update muti update e&ecutes 2Q> statements - it treats each input record as a competey separate piece of work. scheduer ?e can use Autosys" Contro-1" or any other e&terna scheduer. ?e can take care of dependencies in many ways. +or e&e" if scripts shoud run se0uentiay" we can arrange for this in Autosys" or we can create a wrapper script and put there severa se0uentia commands !nohup command1.ksh K ; nohup command..ksh K; etc$. ?e can even create a specia graph in Ab @nitio to e&ecute individua scripts as needed. Api and #tiity modes in input tabe These are database interfaces !api - uses 2Q>" utiity - buk oads" whatever vendor provides$ ookup fie ookup fie component. +unctions' ookup" ookup3count" ookup3ne&t" ookup3match" ookup3oca. >ookups are aways used with combination of the reformat components. Caing stored proc in -8 %ou can ca stored proc !for e&e" from input component$. @n fact" you can even write 2P in Ab @nitio. 1ake it (with recompie( to assure good performance. +re0uenty used functions string3trim" string3rtrim" string3substring" reinterpret3as" today!$" now!$ data vaidation is3vaid" is3nu" is3bank" is3defined driving port ?hen Boining inputs !inG" in1" ...$ one of the ports is used as (driving !by defaut - inG$. -riving input is usuay the argest one. ?hereas the smaest can have (2orted-@nput( parameter be set to (@nput need not be sorted( because it wi be oaded competey in memory. Ab @nitio vs @nformatica for 5T> Ab @nitio benefits' paraeism buit in" muitifie system" handes huge amounts of data" easy to buid and run. Eenerates scripts which can be easiy modified as needed $if something coudnAt be done in 5T> too itsef$. The scripts can be easiy schedued using any e&terna scheduer - and easiy integrated with other systems. Ab @nitio doesnAt re0uire a dedicated administrator. Ab @nitio doesnAt have buit-in C-C capabiities !C-C = Change -ata Capture$. Ab @nitio aows to !attach error ) reBect fies$ to each transformation and capture and anay:e the message and data separatey !as opposed to @nformatica which has Bust one huge og$. Ab @nitio provides immediate metrics for each component. override key override key option is used when we need to Boin . fieds which have different fied names. contro fie contro fie shoud be in the mutifie directory !contains the addresses of the seria fies$ ma&-core ma&-core parameter !for e&e" sort 1GG 18ytes$ specifies the amount of memory used by a component !ike 2ort or ,oup$ - per partition - before spiing to disk. #suay you donAt need to change it - Bust use defaut vaue. 2etting it too high may degrade the performance because of =2 swapping and degrading of the performance of other components. @nput Parameters graph J seect parameters tab J cick (create( - and create a parameter. #sage' Nparamname. 5dit J parameters. These parameters wi be substituted during run time. %ou may need to decare you parameter scope as forma. 5rror Trapping 5ach component has reBect" error" and og ports. ,eBect captures reBected records" 5rror captures corresponding error" and og captures the e&ecution statistics of the component. %ou can contro reBect status of each component by setting reBect threshod to either 4ever Abort" Abort on first reBect" or setting ramp)imit. %ou can aso use force3error!$ function in transform function. / Question Answer ========================================================== <ow to see resource usage @n E-5 goto options Oiew J Tracking -etais - wi see each componentAs CP# and memory usage" etc. assign keys component 5asy and saves deveopment time. 4eed to understand how to feed parameters" and you canAt contro it easiy. Poin in -8 vs Boin in Ab @nitio 2cenario 1 !preferred$' we run 0uery which Boins . tabes in -8 and gives us the resut in Bust 1 -8 component. 2cenario . !much sower$' we use . database components" e&tract a data - and Boin them in Ab @nitio. Poin with -8 not recommended if number of records is big. @t is better to retrieve the data out - and then Boin in Ab @nitio. -ata 2kew Parameter showing how data is uneveny distributed between partitions. skew = !partition si:e - avg.part.si:e$Q 1GG ) !si:e of the argest partition$ dbc vs cfg .dbc - database configuration fie !dbname" nodes" version user)pwd$ - resides in the db directory .cfg - any tyoe of config fie. for e&e" remote connection config !name of remote server" user)pwd to connect to db" ocation of =2 on remote machine" connection method$. .cfg fie resides in the config dir. compiation errors depth not e0ua data format error etc... depth error ' we get this error.. when two components connected together but doesAt match there ayout types of partitions broadcast pbye&pression pbyroundrobin pbykey pwithoadbaance unused port when Boining" used records go to the output port" unused records - to the unused port tuning performance Eo parae using partitionning. ,oundrobin partitionning gives good baance. #se 1uti-fie system !1+2$. #se Ad <oc 1+2 to read many seria fies in parae" and use concat component. =nce data is partitionned - do not switch it to seria and back. ,epartition instead. -o not acceess arge fiess via 4+2 - use +TP instead use ookup oca rather than ookup !especiay for big ookups$. #se roup and +iter as soon as possibe to reduce number of records. @deay do it in the source !database D$ before you get the data. ,emove unnecessary components. +or e&e" instead of using fiter by e&p" you can impement the same function in reformat)Poin),oup. Another e&e - when Boining data from . fies" use union function instead of adding an additiona component for removing dupicates. use gather instead of concatenate. it is faster to do a sort after a partitino" than to do a sort before a partition. try to avoid using a Boin with the (db( component. when getting data from database - make sure your 0ueries are fast !use inde&es" etc.$. @f possibe" do necessary seection ) aggregation ) sorting in the database before getting data into Ab @nitio. tune 1a&3core for =ptima performance !for sort depends on the si:e of the input fie$. 4ote - @f in-memory Boin cannot fit its non-driving inputs in the provided 1A*-C=,5" then it wi drop a the inputs to disk and in-memory does not make sence. #sing phase breaks et you aocate more memory in individua components - thus improving performance. #se checkpoint after sort to and data on disk #se Poin and roup in-memory feature ?hen Boining very sma dataset to a very arge dataset it is more efficient to broadcast the sma dataset to 1+2 using broadcast component" or use the sma fie as ookup. 8ut for arge dataset donAt use broadcast as a partitioner. #se Ab @nitio ayout instead of database defaut to achieve parae oads Change A83,5P=,T parameter to increased monitoring duration #se cataogs for reusabiity Components ike Boin) roup shoud have the option (@nput must be sorted( if they are paced after a sort component. minimi:e number of sort components. 1inimi:e usage of sorted Boin component" and if possibe repace them by in-memory Boin)hash Boin. #se ony re0uired fieds in the sort reformat Boin components. #se (2ort within Eroups( instead of Bust 2ort when data was aready presorted. #se phasing)fow buffers in case of merge sorted Boins 1inimi:e the use of reguar e&pression functions ike re3inde& in the transfer functions Avoid repartitioning of data unnecessariy. ?hen spitting records into more than two fows" use ,eformat rather than 8roadcast component. +or Boining records from . fows use Concatenate component =4>% when there is a need to foow some specific order in Boining records. @f no order is re0uired then it is preferabe to use Eather component. @nstead of putting many ,eformat components consecutivey" use output inde&es parameter in the first ,eformat component and mention the condition there. deta tabe -eta tabe maintain the se0uencer of each data tabe. 1aster !or base$ tabe - a tabe on tp of which we create a view scan vs roup roup - performs aggregate cacuations on groups" scan - cacuates cumuative totas packages used in mutistage components or transform components ,eformat vs (,edefine +ormat( ,eformat - deriving new data by adding)dropping fieds ,edefine format - rename fieds Conditiona -1> -1> which is separated based on a condition 2=,T?@T<@4E,=#P The prere0uisit for using sortwithingroup is that the data is aready sorted by the maBor key. sortwithingroup outputs the data once it has finished reading the maBor key group. @t is ike an impicit phase. passing a condition as a parameter -efine a +orma Ceyword Parameter of type string. +or e&e" you ca it +iterCondition" and you want it to do fitering on C=#4T J G . Aso in your graph in your (+iter by e&pression( Component enter foowing condition' N+iterCondition 4ow on your command ine or in wrapper script give the foowing command %ourEraphname.ksh -+iterCondition C=#4T J G Passing fie name as a parameter #!/bin/ksh #Running the set up script on enviornment typeset PROJ_DIR $(cd $(dirnme $!"/##$ p%d" # $PROJ_DIR/b_pro&ect_setup#ksh $PROJ_DIR #'(porting the script prmeter) to I*P+,_-I.'_*/0' i1 2 $# 3ne 4 5$ then I*P+,_-I.'_P/R/0','R_) $) I*P+,_-I.'_P/R/0','R_4 $4 # ,his grph is using the input 1i6e cd $/I_R+* #/my_grph)#ksh $I*P+,_-I.'_P/R/0','R_) # ,his grph 6so is using the input 1i6e# #/my_grph4#ksh $I*P+,_-I.'_P/R/0','R_4 e(it !$ e6se echo Insu11icient prmeters e(it )$ 1i 3333333333333333333333333333333333333 #!/bin/ksh #Running the set up script on enviornment typeset PROJ_DIR $(cd $(dirnme $!"/##$ p%d" # $PROJ_DIR/b_pro&ect_setup#ksh $PROJ_DIR #'(porting the script prmeter) to I*P+,_-I.'_*/0' e(port I*P+,_-I.'_*/0' $) # ,his grph is using the input 1i6e cd $/I_R+* #/my_grph)#ksh # ,his grph 6so is using the input 1i6e# #/my_grph4#ksh e(it !$ <ow to remove header and traier inesD use conditiona dm where you can separate detai from header and traier. +or vaidations use reformat with count '/ !outG'header out1'detai out.'traier.$ <ow to create a muti fie system on ?indows first method' in E-5 go to ,#4 J 5&ecute Command - and run m3mkfs c'contro c'dp1 c'dp. c'dp/ c'dpL second method' doube-cick on the fie component" and in ports tab doube-cick on partitions - there you can enter the number of partitions. Oector A vector is simpy an array. @t is an ordered set of eements of the same type !type can be any type" incuding a vector or a record$. -ependency Anaysis -ependency anaysis wi answer the 0uestions regarding does the data come from what appications prodeuce and depend on this data etc.. L Question Answer ========================================================== 2urrogate key There are many ways to create a surrogate key. +or e&e" you can use ne&t3in3se0uence!$ function in your transform. =r you can use ( vaues( component. =r you can write a stored procedure - and ca it. 4ote' if you use partitions" then do something ike this' !ne&t3in3se0uence!$-1$Qno3of3partition!$9this3partition!$ .abinitiorc This is a config fie for ab initio - in userAs home directory and in NA83<=15)Config. @t sets abinitio home path" configuration variabes !A83?=,C3-@," A83-ATA3-@," etc.$" ogin info !id" encrypted password$" ogin methods for hosts for e&ecution !ike 515 host" etc.$" etc. .profie your ksh init fie ! environment" aiases" path variabes" history fie settings" command prompt settings" etc.$ data mapping" data modeing <wo to e&ecute the graph +rom E-5 - whoe graph or by phases. +rom checkpoint. Aso using ksh scripts ?rite 1utipefies A component which aows to write simutaneousy into mutipe oca fies Testing ,un the graph - see the resuts. #se components from Oaidate category. 2andbo& vs 515 2andbo& is your private area where you deveop and test. =ny one proBect and one version can be in the sandbo& at any time. The 515 -atastore versions of the code that have been checked into it !source contro$. >ayout ?here the data-fies are and where the components are running. +or e&e" for data - seria or partitioned !muti-fie$. The ayout is defined by the ocation of the fie !or a contro fie for the mutifie$. @n the graph the ayout can propagate automaticay !for mutifie you have to provide detais$. >atest versions Apri .GGR' E-5 ver.1.1F.S" Co-operative system ver ..1L. Eraph parameters menu edit J parameters - aows you to specify private parameters for the graph. They can be of . types - oca and forma. PanJ@t %ou can define pre- and post-processes" triggers. Aso you can specify methods to run on success or on faiure of the graphs. +re0uenty used components input fie ) output fie input tabe ) output tabe ookup ) ookup3oca reformat gather ) concatenate Boin runs0 Boin with db compression components fiter by e&pression sort !singe or mutipe keys$ roup trash partition by e&pression ) partition by key running on hosts coJoperating system is ayered on top of native =2 !uni&$. ?hen running from E-5" E-5 generates a script !according to (run( setings$. CoJop system wi e&ecute the scripts on different machines !using specified host settings and connection methods" ike re&ec tenet rsh rogin$ - and then return error or success codes back. conventiona oading vs direct oading This is basicay an =race 0uestion - regarding 2Q>>-, !2Q> >oader$ utiity. Conventiona oad - using insert statements. A triggers wi fire" a contraints wi be checked" a inde&es wi be updated. -irect oad - data is written directy bock by bock. Can oad into specific partition. 2ome constraints are checked" inde&es may be disabed - need to specify native options to skip inde& maintenance. semi-Boin in abinitio there are / types of Boins' inner Boin" outer Boin" and semi Boin. for inner Boin Arecord3re0uired4A parameter is true for a (in( ports. for outer Boin it is fase for a the (in( ports. for semi Boin it is true for the re0uired component and fase for other components. http'))www.geekinterview.com)@nterview-Questions)-ata-?arehouse)Abinitio)page1G