Вы находитесь на странице: 1из 5

PMI Virtual Library

© 2010 Ben Harden

Estimating Extract, Transform, and


Load (ETL) Projects
By Ben Harden, PMP

I
n the consulting world, project estimation is a critical Before starting your ETL estimation, you need to
component required for the delivery of a successful understand what type of estimate you are trying to produce.
project. If you estimate correctly, you will deliver a How precise does the estimate need to be? Will you be
project on time and within budget; get it wrong and you estimating effort, schedule, or both? Will you build your
could end up over budget, with an unhappy client and a estimate top down or bottom up? Is the result being used for
burned out team. Project estimation inclusion in an RFP response or will it
for business intelligence and data be used in an unofficial capacity? By
integration projects is especially answering these questions, you can
difficult, given the number of
The key is being clear assess risk and produce an estimate that
stakeholders involved across the best mitigates that risk.
organization as well as the unknowns about how the estimate In many cases, the information
of data complexity and quality. Add you have to base your estimate on
to this mix a firm fixed price RFP should be used and what the is high level, with only a few key


(request for proposal) response for a limitations are. data points do go on, and you do
client your organization has not done not have either the time or ability
work for, and you have the perfect to ask for more details. In these
climate for a poor estimate. In this situations, the response I hear most
article, I share my thoughts about often is that an estimate cannot be
the best way to approach a project estimate for an extract, produced. I disagree! As long as the precision of the estimate
transform, and load (ETL) project. produced is understood by the customer, there is value in
For those of you not familiar with ETL, it is a common the estimate and it should be done. The alternative to a
technique used in data warehousing to move data from high-level estimate is none at all, and as someone who has to
one database (the source) to another (the target). In order deliver on the estimate, I would rather have a bad estimate
to accomplish this data movement, the data first must be with clear assumptions than no baseline at all. The key is
extracted out of the source system—the “E.” Once the data being clear about how the estimate should be used and what
extract is complete, data transformation may need to occur. the limitations are. I have found that one of the best ways
For example, it may be necessary to transform a state name to to frame the accuracy of the estimate with the customer
a two-digit state code (Virginia to VA)—the “T.” After the and project team is through the use of assumptions. Every
data have been extracted from the source and transformed to estimate is built with many assumptions in mind and having
meet the target system requirements, they can then be loaded them clearly laid out almost always generates good discussion
into the target database—the “L.” and, eventually, a more refined and accurate estimate.
A common question that comes up during the estimation Phase Percentage of Development
process is effort versus schedule; in other words, how many Requirements 50% of Development
hours will the work take versus the duration it will take to Design 25% of Development
complete the effort. To simplify the estimating process, I
Development
start with a model that delivers the effort and completely
System Test 25% of Development
ignores the schedule. Once the effort has been refined, it can
Integration Test 25% of Development
be taken to the delivery team for a secondary discussion on
overlaying the estimated effort across time.
Once you know what type of estimate you are trying to Once I have my verticals established, I break my
deliver and who your audience is, you can begin the process estimate horizontally into low, medium, and high, using the
of effectively estimating the work. All too often, this up-front percentages below:
thinking is ignored and the resulting estimate does not meet
expectations. Complexity Percent of Medium
I’ve reviewed a number of the different ETL estimating Low 50% of Medium
techniques available and have found some to be extremely Medium N/A
complex and others more straightforward. Then there are High 150% of Medium
the theory of estimating and the tried and true models of
Wide Band Delphi and COCOMO. All of these theories Generally, when doing a high-level ETL estimate, I know
are interesting and have value but they don’t easily produce the number of sources I am dealing with and, if I’m lucky, I
the data to support the questions I am always asked in also have some broad stroke level of complexity information.
the consulting world: How much will this project cost? Once I have my model built out, as described above, I work
How many people will you need to deliver it? What does with my development team to understand the effort involved
the delivery schedule look like? I have discovered that for a single source. I then take the numbers of sources and
most models focus on one part of the effort (generally plug them into my model, as shown below (Figure 1, in
development) but neglect to include requirements, design, yellow). If I don’t have complexity information, I simply
testing, data stewardship, production deployment, warranty record the same numbers of sources in the low, medium, and
support, and so forth. When estimating a project in the high columns to give me an estimate range of +/−50%.
consulting world, we care about the total cost, not just how I now have a framework I can share with my team
long it will take to develop the ETL code. to shape my estimate. After my initial cut, I meet with
key team members to review the estimate, and I inevitably
Estimating an ETL Project end up with a revised estimate and, more importantly, a
In the ETL space I use two models (top down and bottom comprehensive set of assumptions. There is no substitute for
up) for my estimation, if I have been provided enough data socializing your estimate with your team or with a group of
to support both; this helps better ground the estimate and subject matter experts; they are closest to the work and have
confirms that there are no major gaps in the model.

Estimating an ETL Project Using a Top Down


Sourcing Data: Task Low (Hrs) Medium (Hrs) High (Hrs)
Technique Requirements and Data Mapping 3.0 6.0 9.0
High Level Design 4.0 4.0 8.0
To start a top down estimate, I break down the project by Technical Design 4.0 8.0 12.0
Development & Unit Testing 16.0 24.0 40.0
phase and then add in key oversight roles that don’t pertain System/QA test 8.0 12.0 20.0
Integration Test and Production Rollout Support 9.0 12.0 18.0
specifically to any single phase (i.e., project manager, technical Tech Lead Support 4.4 6.6 10.7
lead, subject matter expert, operations, etc.). Once I have the Project Management Support
Subject Matter Expert
2.2
4.4
3.3
6.6
5.4
10.7
phases that relate to the project I am estimating for, I estimate Totals Per Source
each phase vertically as a percentage of the development Total Hours 55.0 82.5 133.8
Total Days 6.9 10.3 16.7
effort, as shown in the chart below. Everyone has a different Total Weeks 1.4 2.1 3.3
Sourcing Totals
idea about what percentage to use in the estimate and there Number of Sources 2.0 4.0 3.0
Total Effort (Hours) 110.0 330.0 401.3
is no one right answer. I start with the numbers below and Total Days 13.8 41.3 50.2
Total Weeks 2.8 8.3 10.0
tweak them accordingly, based on the project environment
and resource experience. Figure 1: Sample Top Down Estimate.

PMI Virtual Library | www.PMI.org | © 2010 Ben Harden


2
input and ideas that help refine the estimate into something Task Hours Per Attribute
that is accurate and defendable when cost or hours are Requirements & Mapping 2
challenged by the client. High Level Design 0.1
Technical Design 0.5
Estimating an ETL Project Using a Bottom Up Data Modeling 1
Estimate Development & Unit Testing 1
When enough data are available to construct a bottom up System Test 0.5
User Acceptance Testing 0.25
estimate, this estimate can provide a powerful model that
Production Support 0.2
is highly defendable. To start a bottom up ETL estimate, a Tech Lead Support 0.5
minimum of two key data elements are required: the number Project Management Support 0.5
of data attributes required and the number of target structures Product Owner Support 0.5
that exist. Understanding the target data structure is a critical Subject Matter Expert 0.5
input to ETL estimation, because data modeling is a time- Data Steward Support 0.5
consuming and specialized skill that can have a significant
impact on the cost and schedule. To complete the effort, estimate the hours per task that
When starting a bottom up ETL estimate, it is important can be multiplied by the total number of attributes to get
to break up the attributes into logical blocks of information. If effort by task. In addition, the tasks can be broken out across
a data warehouse is the target, subject areas work best as starting the expected project resource role, providing a jump start on
points for segmenting the estimation. A subject area is a logical how the effort should be scheduled. As with any estimate, I
grouping of data within the warehouse and is a great way to always add a contingency factor at the bottom to account for
break down the project into smaller chunks that align with how unforeseen risk.
you will deliver the work. Once you have a logical grouping
of how the data will be stored, break down the number of Effort Summary
attributes into the various groups, noting the percentages of Effort (Hours) Effort (Days)
attributes that do not have a target data structure. Business System Analyst 2140.0 267.5
Developer 1886.0 235.8
Tester 475.0 59.4
Model Inputs: Number of Attributes by Subject Area Tech Lead 535.0 66.9
Project Manager 535.0 66.9
Percentage of Product Owner 652.5 81.6
Number of Data Unmodeled
Target Model Attributes Attributes Data Steward 685.0 85.6
Subject Area 1 200 100% Data Modeler 610.0 76.3
Subject Area 2 400 25% Subject Matter Expert 535.0 66.9
Subject Area 3 150 100%
SubTotal 8053.5 1006.7
Subject Area 4 200 50%
Subject Area 5 50 10% Contingency 805.4 100.7
Subject Area 6 50 100% Grand Total 8858.9 1107.4
Subject Area 7 20 100%
Total Number of Attributes 1070
Comparing a top down estimate with a bottom up
Once you have defined the target data subject areas, estimate will provide two good data sets that can drive
attributes, and percentages of data modeled, the time spent discussion about the quality of the estimate as well us uncover
per task, per attribute can be estimated. It is important to additional assumptions.
define all tasks that will be completed during the life cycle of
the project. Clearly defining the assumptions around each Scheduling the Work
task is also critical, because consumers of the model will Once the effort estimate is complete (regardless of the
interpret the tasks differently. type), I can start thinking about how much time and
In the example shown, there is a calculation that adjusts how many resources are needed to complete the project.
the modeling hours based on the percentage of attributes Generally, the requestor of the estimate has an expected
that are not modeled, giving more modeling time as the delivery date in mind and I know the earliest time we
percentage increases. This technique can be used for any task can start the work. With those two data points, I can
that has a large variance in effort based on an external factor. calculate the number of business days I have to deliver the

PMI Virtual Library | www.PMI.org | © 2010 Ben Harden


3
project and get a rough order of magnitude estimate of the technique allows the work to be broken down to a very
resources required. detailed level. To effectively estimate bottom up ETL
The first thing I do is map the phases established in projects, the granularity needed is typically the number of
the effort estimate to the various project team roles (BSA, reports, data elements, data sources, or the metrics required
developer, tester, etc.). Once I break down the effort for the project.
into roles, I can then divide the effort by the number of When a low level of detail is not available, using a top
days available in the project to get the expected number down technique is the best option. Top down estimates are
of resources required. In the example below (Figure 2), I derived using a qualitative model and are more likely to be
shorten the time that the BSA, developer, and tester will skewed based on the experience factor of the person doing
work, taking into account that each life cycle phase does the estimate. I find that these estimates are also much more
not run for the duration of the project. At this stage, I also difficult to defend because of their qualitative nature. When
take into consideration the cost of each resource and add in doing a top down estimate for a proposal, I like to include
a contingency factor. This method allows for the ability to additional money in the budget for contingency to cover the
adjust the duration of the project without impacting the level unknowns that certainly lie in the unknown details.
of effort needed to complete the work. There is an argument that a bottom up estimate is no
Using the techniques described above provides you with more precise than a top down estimate. The thinking here is
the flexibility to easily answer the “what if ” questions that that with a lower level of detail, you make smaller estimating
always come up when estimating work. By keeping the errors more often, netting the same result as the large errors
effort and the schedule separate, you have total control over made in a top down approach. Although this is a compelling
the model. argument (and why I do both estimates when I can), the more
granular the estimate you have, the quicker you can identify
Delivering on the Estimate flaws and make corrections. With a top down estimate, errors
Once the effort and duration of the project are stabilized, a take longer to be revealed and are harder to correct.
project planning tool (e.g., Microsoft Project) can be used An estimate is only as good as the data used to start the
to dive into the details of the work breakdown structure and estimate and the assumptions captured. Providing clear and
further map out the details of the project. consistent estimates helps build credibility with business
It is important to continue to validate your estimate customers and clients and provides a concrete defendable
throughout the project. As you finish each project phase, position on how you plan to deliver against scope and it also
revisiting the estimate to evaluate assumptions and estimating provides a constant reminder of the impact of additional
factors will help make future estimates better, which is scope. No matter how easy or small a project appears to be,
especially important if you expect to do additional projects in always start with an estimate and be prepared for that estimate
the same department. to need fine tuning as new information becomes available.

Conclusion Glossary of Terms


In my experience, bottom up estimates produce the most ETL – Extract, Transform, and Load. A technique used to
accurate results, but often the information required to move data from one database (the source) to another database
produce such an estimate is not available. The bottom up (the target).

# Resources
Role Effort (Hours) Effort (Days) Target Days Needed Rate Cost
Business System Analyst 1833.0 229.1 71 3.2 $ 10.00 $ 18,330.00
Developer 2420.0 302.5 71 4.2 $ 15.00 $ 36,300.00
Tester 1756.0 219.5 71 3.1 $ 12.00 $ 21,072.00
Tech Lead 600.9 75.1 86 0.9 $ 14.00 $ 8,412.60
Project Manager 300.5 37.6 86 0.4 $ 18.00 $ 5,408.10
Subject Matter Expert 600.9 75.1 86 0.9 $ 20.00 $ 12,018.00
SubTotal 7511.3 938.9 12.7 $ -
Contingency 751.1 93.9 86 1.1 $ 14.83 $ 11,141.69
Grand Total 8262.4 1032.8 13.8 $ 11,141.69

Figure 2: Effort Summary.

PMI Virtual Library | www.PMI.org | © 2010 Ben Harden


4
Business Intelligence – A technique used to analyze data to Source – An ETL term used to describe the source system
support better business decision making. that provides data to the ETL process.

Data Integration – The process of combining data from Subject Area – A term used in data warehousing that
multiple sources to provide end users with a unified view of describes a set of data with a common theme or set of related
the data. measurements (e.g., customer, account, or claim).

Data Steward – The person responsible for maintaining the Target – An ETL term used to describe the database that
metadata repository that describes the data within the data receives the transformed data.
warehouse.
About the Author
Data Warehouse – A repository of data designed to facilitate Ben Harden, PMP, is a manager in the Data Management
reporting and business intelligence analysis. and Business Intelligence practice of the Richmond,
Virginia–based consulting firm CapTech. He specializes in
RFP – A request for proposal (RFP) is an early stage in the the project management and delivery of data integration and
procurement process, issuing an invitation for suppliers, often business intelligence projects for Fortune 500 organizations.
through a bidding process, to submit a proposal on a specific Mr. Harden has successfully managed data-related projects
commodity or service. in the health care, financial services, telecommunications,
and governmental sectors and can be reached via e-mail at
bharden@captechconsulting.com.

PMI Virtual Library | www.PMI.org | © 2010 Ben Harden


5

Вам также может понравиться