New York Business Intelligence Message Board The New York Business Intelligence Meetup Discussion Forum › Design patterns for ETL mappings

Design patterns for ETL mappings

Sudeep Amin
Posted Mar 5, 2009 1:29 PM
user 7048142
New York, NY
Post #: 1
Send an Email Post a Greeting
I am just putting this out there to see what people have done to streamline/standardize their ETL processes. I would be interested to know if people have been thinking around design patterns for ETL mappings and if yes, what kind of design patterns people have adopted for their ETL mappings?

Based on what I have seen in the past things that I want to avoid
- Multiple mappings to load one target table. I believe, from a maintenance standpoint 1 mapping per source table is ideal. In some cases there maybe efficiencies in loading 2 or 3 tables that get data from the same source tables. But in Datawarehouse applications typically it is the other way round as the exercise of denormailzation results in consolidation. Clubbing multiple targets in 1 mapping makes maintenance difficult and changes related to one target could inadvertently result in errors in some other part of the mapping.
- Different mappings for incremental and full loads. My philosophy is that if you need a full load then there is a problem with your incremental loads and usually a full load can mask issues with the incremental loads. In most cases the desirable outcome is a mapping that does both incremental and full load. Full load being classified as a incremental from 'beginning of time'.But that brings up the next point ...
- How do you handle an incremental strategy that is independant of source data anomalies i.e. lack of reliable last update date/time or indication of what changed via triggers or separate tables holding change information. If the incremental strategy relied on a last updated timestamp then things could be relatively simple. But as we all know this is never true in the real world. Either tables do not have a timestamp but if they do they are not reliable. Bad code sometimes can cause the timestamp to be update only when certain columns are changed but not others. Another strategy could be a full compare i.e. compare the target to the source every time. This has the advantage of being 'self-correcting'. If your target is truncated the next load will be a full load. But this method can be prohibitive with large datasets. And if you want to load data near-real time this may be the least desirable option.

Other list of desired items are long, but that could be for another discussion or down this thread if it gets interesting!

Discussions welcome...
Powered by mvnForum

Offer a perk for our members and get exposure.

Offer a perk →
Other nearby
Meetups
Why these groups?
x

The Meetup Groups shown here are topically similar to New York Business Intelligence.

Groups are more likely to be displayed here if they:

  • have a Meetup scheduled
  • have a high rating
  • have a group photo
  • are "public" and not "private"
  • have shown they are likely to stick around (older than 30 days)
Find more Meetup Groups
near New York

Log in

  • Not registered with us yet?
or

Log in to Meetup with your Facebook account.

Sign up

or

Join this Meetup Group even quicker with your Facebook account.

By clicking the "Sign up using Facebook" or "Sign up" buttons above, you agree to Meetup's Terms of Service