Вы находитесь на странице: 1из 5

SSIS / ETL Example Yahoo Equity & Mutual Fund Price History

by Terrance A. Snyder Download Example SSIS Here

The Concept
Been working with generating trailing return history for portfolios composed of various financial products from annuities to equities to mutual funds (open-ended, ETF, close-ended, etc). Problems usually come down to where to get this information and how to store it. And thats where this concept comes into play. The core focus here is to get and store effectively the entire range of historical prices (volume, open, close) for a rather large set of equities and mutual funds. To do this we need a feed of data. Where I work this doesnt exactly exist so I had to do a quick scan on the internets and I found the Yahoo price history feed. Yahoo Price History Feed The yahoo price history feed takes in a symbol and the date range you would like to get back and returns a nice CSV formatted download that can be used to extract the price history. An example of this URL is listed below.
view sourceprint?

1.http://ichart.finance.yahoo.com/table.csv?s={symbol}&a={startMM}&b={startDD }&c={startYYYY}&d={endMM}&e={endDD}&f={endYYYY}&g={res}&ignore=.csv

* startMM = The month with zero based index. (00 = Jan, 01 = Feb, 02 = Mar 11 = Dec) * startDD = The start date with padding to two digits (01, 02, 31) * startYYYY = The start year (2009, 2008, etc) * endMM = Usually todays month with zero based index. (00 = Jan, 01 = Feb, 02 = Mar 11 = Dec) * endDD = Usually todays with padding two digits. (00 = Jan, 01 = Feb, 02 = Mar 11 = Dec) * endMM = Using . (00 = Jan, 01 = Feb, 02 = Mar 11 = Dec) * res = is the resolution you want to return, this can be either d for day, w for weekly, m for monthly, and y for yearly. Example: Download GOOG

The Database Implementation

Product Dimension Table

To perform our daily download of feeds we need to have a couple things a table that can be used to get all symbols that can be and should be priced every day. Currently I read from a quick table like this
view sourceprint?

01.CREATE TABLE [dbo].[product_dim]( 02. [product_id] [int] IDENTITY(1,1) NOT NULL, 03. [cusip] [varchar](20) NULL, 04. [symbol] [varchar](8) NULL, 05. [isin] [varchar](20) NULL, 06. CONSTRAINT [pk_product_dim] PRIMARY KEY CLUSTERED 07.( 08. [product_id] ASC 09.)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] 10.) ON [PRIMARY]

Granted this is a very very basic product dimension table. You should certainly have more information than this in your data-warehouse for a financial product. So after we have our table to read from to get the symbols to pass to YahooTM we need a place to store the results of call YahooTM. SQL Table to Store Price History This is the table we will be inserting into basic again. In a bit well talk about SQL partitioning when we get to performance well talk about partitioning the table based on the date id.
view sourceprint?

01.CREATE TABLE [dbo].[product_price_fact]( 02. [product_id] [int] NOT NULL, 03. [price_dt_id] [int] NOT NULL, -- simple date id like 20090901 04. [price_open] [decimal](14, 2) NULL, 05. [price_low] [decimal](14, 2) NULL, 06. [price_high] [decimal](14, 2) NULL, 07. [price_close] [decimal](14, 2) NULL, 08. [price_adj_close] [decimal](14, 2) NULL, 09. [price_volume] [bigint] NULL, 10. CONSTRAINT [pk_product_price_fact] PRIMARY KEY CLUSTERED 11.( 12. [product_id] ASC, 13. [price_dt_id] ASC 14.)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 15.)

The SSIS Package

The SSIS package is a 2008 SSIS package built using VS2008. It has been uploaded to my skydrive account. An overview is as follows.

SSIS Yahoo Package The following components make up this SSIS package. Product DIM Cache This data flow grabs all the products that need to be priced along with the date that they were last priced. Download History A C# SSIS script that will call each symbol and cusip to download the price history for each one using the last price date as the start and todays date as the end. It will also augment the CSV file written to include the product id from the product_dim table as well as the CUSIP used to identify the symbol. This makes our SSIS package more performant when doing the inserts. It relies on a stored procedure to get all products and their last update price history date.
view sourceprint?

01.-- ================================================ 02.-- Template generated from Template Explorer using: 03.-- Create Procedure (New Menu).SQL 04.-05.-- Use the Specify Values for Template Parameters 06.-- command (Ctrl-Shift-M) to fill in the parameter 07.-- values below. 08.-09.-- This block of comments will not be included in 10.-- the definition of the procedure. 11.-- ================================================ 12.SET ANSI_NULLS ON

13.GO 14.SET QUOTED_IDENTIFIER ON 15.GO 16.-- ============================================= 17.-- Author: Terrance A. Snyder 18.-- Create date: 2009/09/09 19.-- Description: gets product price history with the last price date. 20.-- ============================================= 21.CREATE PROCEDURE SSIS_GetProductPriceHistory 22.AS 23.BEGIN 24. -- SET NOCOUNT ON added to prevent extra result sets from 25. -- interfering with SELECT statements. 26. SET NOCOUNT ON; 27. 28. select 29. product_dim.product_id 30. , product_dim.cusip 31. , product_dim.symbol 32. , MAX(product_price_fact.price_dt_id) [last_price_dt_id] 33. from product_dim with (nolock) 34. left join product_price_fact with (nolock) on 35. product_price_fact.product_id = product_dim.product_id 36. where 37. product_dim.cusip is not null and product_dim.symbol is not null 38. group by 39. product_dim.product_id, product_dim.cusip, product_dim.symbol 40. order by MAX(product_price_fact.price_dt_id) desc 41. 42.END 43.GO

For-Each Enumerator This is a file for-each enumerator for every *.csv file found in the specified output directory. This walks the directory and runs the Update data flow task. And once the update data flow task completes it deletes the CSV file. Update Data Flow Task This is the heart and sole as it contains most of the SSIS logic. We grab each file and take the CSV and fiddle with it to ensure data integrity and only insert those rows which are new (in case we are working on old files or got terminated mid-stream).

Performance Notes
So there are two performance improvements you can make here the first is to get the product_price_fact table into a partitioned view. I cover this is the SSIS Tips/Tricks post from before. The second is in the SSIS package itself rather than using the for-each enumerator we could use a MULTIFLATFILE connection. This will make SSIS do a bulk operation against all files however my current machine lacks the RAM for this. Please note that if you download the entire history the first time you will get roughly 4GB of data or more depending on how many symbols you are quoting