The last months I have been looking into a customer’s ETL process. They have a Siebel OLTP and Analytics (now OBI 10) system (Data warehouse) and every night an ETL process runs in order to maintain the Data warehouse. When I arrived at this customer, the ETL was not a big problem: it ran within the window although at first sight it did look like there were some inefficiencies. But since it was not the highest priority performance problem, I did not watch it very closely. In a Siebel environment, there are always several performance issues to look at, so I got busy enough. However, at one time the volume of the data in this fast growing system became a problem for the ETL: with more frequency, the ETL ran into the morning. Especially when some not foreseen large updates or inserts were done on OLTP and the statistics on the ETL”Image” tables (the tables that capture the records changed and to be sent to the Data Warehouse) did not reflect the real situation. To be short, there were/are several problems with the daily ETL, and this just my point of view:
- Statistics on the entire database were gathered every weekend, even on very volatile tables like the ETL Image tables and the Data warehouse Staging tables. This lead to problems once and a while with higher volume inserts/updates/deletes on the source system
- Every night there were/are some aggregate tables that have their indexes dropped, are completely TRUNCATEd, are refilled with the entire dataset again and have the indexes created on them again. That sounded to me like a resource waste.
- Everything is done in the ETL tool, but this also means that some workload that would fit perfectly on the database server alone, is done with the tool as not more than a throughput system with hardly any added functionality except for ‘visibility’. What I mean by this I will show later on. I do not mean that an ETL tool in itself is a bad tool!
For the first problem, a colleague gave the advice to delete and lock the statistics on those volatile tables and rely on Dynamic Sampling. And indeed, that sounded like the best approach for such volatile tables. We tested this a few weeks on Pre-production systems and we never ran into a problem again with that. So that was moved into Production also. Dynamic sampling is a great addition to the Oracle database. It is not intended for every table and query, because that would be too time consuming and CPU intensive. But for situations where the time spent parsing is allowed to be a little higher in favor of the execution plan calculation being much more accurate (that is: for some less frequently and somewhat longer running queries) it is alright. So you will not really want to use it in a typical OLTP application, but for parts of the system that are more batch oriented (if you have the time, you may also be able to run a DBMS_STATS estimate before executing the queries). Note (added 19-02-2009) : I came across a beautifull paper written by Karen Morton about statistics. It also covers a chapter on dynamic sampling with its additional feature to handle predicate correlations and the optimizer’s unawareness about it. Thanks Karen!
The second problem was something I recommended a kind of fast refresh mechanism for. And since FAST refreshable materialized views are a built-in feature of the Oracle RDBMS, I looked into that first. I think it is the best approach to start looking at what the database already provides, instead of developing your own mechanisms and algorithms. Only when something is really unavailable in the database itself, I start looking at the best way to do it myself.
The problem with the daily DROP (all) INDEX, TRUNCATE TABLE, SELECT FROM <source>, INSERT INTO <target table>, CREATE (all) INDEX is:
- Even if there are no updates/insert/deletes at all or only a few, the entire process is executed
- The process gets heavier and longer as data volume grows, not as changes grow: the elapsed time is a function of entire data volume, not of changes each day
- Downtime: you cannot run reports on a TRUNCATEd table or you can get bad performance when indexes have not been created yet. So anyone running reports as long as the ETL is not finished, is vulnerable to no results or bad performance, leading to frustration
- These very resource intensive operations mean less resources for other useful workload
- It is done using the ETL tool: one process reads the source data and passes it to another process that does the (Direct Path) Insert. It does not add real value to it, so why would we use the ETL tool to pull and push large volumes of data over the network: this adds much time to response times and is also hurting other processes
- And more….
The reason to use an ETL tool for everything is in my opinion more or less one of convenience and visibility: there is one place to look at for definitions of the ETL processes. It is considered more visible than having PL/SQL code or database functionality hidden in the database. I can understand that, still I think at the end it does not hold as an argument when it comes to performance. In my general opinion: far too much work is done in client and middle tiers and the database is often not used for the things it is good at. But I am very biased: I am a database guy (who also has done lots of development though, so I hope to be able to be somewhat objective).
There were two obviously long running workflows: one was taking about 8 hours in total, one was taking 5-6 hours. The one taking 8 hours was a FULL reload of three aggregation tables: an aggregation per month, one per trimester and one per year. So these each contained a GROUP BY and all were only dependent on the same fact table. The 5-6 hour workflow was a sequence of table reloads, some with GROUP BY SQL, some only containing expensive joins. The picture shows the flow needed and the approximate time spent in each step for the 5-6 hour workflow. Step 1 (Aggregate of Domain 1) and 2 (Helper Tables 1-4) tables were also TRUNCATED every night.
Rewriting these workflows to two FAST refreshable materialized views workflows was not even a very difficult task. Yes, I encountered some challenges, but they were solvable:
- some constructs in the source query had to be rewritten or replaced by an expression maintained at the fact table level
- some columns had to be eliminated, to be able to even create the materialized view because it contained too many columns in the GROUP BY: in a materialized view based on a GROUP BY query, Oracle creates a UNIQUE index on the group by columns, called something like I_SNAP$_<table_name>. Well, you cannot have more than 32 columns in an index…luckily there were some columns that could easily be removed or rewritten to a MAX(<column_name>) aggregate function.
- I thought I would be smart to have nested materialized views for the first 8 hour workflow: the Monthly MV that is based on the Fact table, then the Trimester MVbased on the Monthly MV and the Year MV based on the Trimester MV. That works, but at least in our case the refresh times were dramatic: as soon as a DELETE is done against the parent table (the Monthly MV in case of the Trimester MV), Oracle decided to delete all records in the partition this record was in and refresh the entire partition. I was not able to find out why. Soon I decided to base all three on the Fact table directly. That makes it even more parallel enabled: the refresh of all three can be done in parallel. It only meant I had to ask for two additional columns in the FACT table because the value expressing the Trimester and Year was an expression and this prohibited a FAST refresh after Delete/Update. You may wonder: will there be no deletes when everything is based on the Fact table alone? Indeed: we know that only Inserts and Updates are taking place on the fact table. But with a GROUP BY materialized view, deletes will happen so if another MV is based on that one, it well have to manage deletes. And that is something the Trimester MV was very bad at in that initial setup. Now that is is based on the Fact table, there are no issues with any deletes. One day I should try to figure out why those deletes were processed so inefficiently.
- For the 6 hour workflow shown in the picture I had to replace the Domain 1 and Helper 1-4 tables with FAST refreshable Materialized views as well, even though their “Full” refresh times were not very long: but you cannot base a FAST REFRESHABLE MV on tables that get TRUNCATED.
The first workflow was put into production when the ETL almost every day finished after 09:00 in the morning. After going into production, the average end time was around 05:30-06:00! Instead of 8 hours continuous workload every night, there was less than 30 minutes workload left. And most of the times even just 10 minutes. And that is mostly because a GROUP BY materialized view does not refresh that fast.
By then the last running workflow was that second one. And yesterday this was put into User Acceptance Test. It was a challenge to create all the necessary scripts that would work in every situation, for every table owner (in Development, Integration Testing and (Pre)Production the table owners are different) and could be executed by the offshore team, but the scripts executed rather well. There was one minor issue with partitions that I overlooked but after solving that, the scripts ran just fine. And my colleagues that needed to adjust the Loader mappings and workflows also did a good job. And the first result was very fine: instead of 6 hours of workload for the final aggregate table, the materialized view that replaced it was refreshed in…10 seconds (it refreshes faster than the GROUP BY MV’s because it is just a simple join result). Now these 10 seconds were just for a few updates/inserts, but tests on Development showed that even more than 100.000 inserts and updates on the Fact table made the refresh go in about 10 minutes at most. Now compare that to 6 hours and also see that the elapsed time is dependent on the amount of changes: 10 seconds for a few changes, 10 minutes for a daily average of changes. I know what I prefer…and it is not the fixed and ever increasing 5-6 hours, which is now available for other ETLworkload. Yesterday we also thought of a way to make sure that even in cases where the number of updates get extremely high and a FAST refresh is worse than a COMPLETE refresh: since the ETL tool calls a procedure to refresh our materialized view, I have all the information I need to determine if a COMPLETE refresh will be best or the FAST refresh: I can take a look in the materialized view log to see the outstanding number of changes to apply…with this in place we can be certain the refresh will never take more than say: 2 hours in the very worst case.
To me , the above reflects some of the simple guidelines I try to adhere to:
- Eliminate unnecessary workload from the system
- If something can be done entirely in the database, do it in the database. There is no need for records to flow to and from a client application (over the network) if it does not add value
- Use bulk processing as much as possible instead of row-by-row (also called slow-by-slow by many people)