Wed 3 Mar 2010
ETL Best Practices: Data-Flows vs. T-SQL-Statements
Posted by Ralf under IT, SQL Server
No Comments
Often ETL design is realized using set-based data processing (SQL-Statements). Especially developers with a database professional background are induced to extract, load and transform (ETL) data using Stored Procedures. On the other side there are advantages of record-based data processing.
The following slides give an overview and recommendation about usage scenarios. There are some tips which should be considered when comparing set-based operations and record-based operations.
T-SQL-Statements: Advantages
- SQL will outperform data-flows (if data is transferred table-to-table on the same server)
- Using transactions is simple
- SQL (Stored Procedures) can be debugged
- Set-based operation vs. record-based operation (seems to be more performing)
- SQL tools support statement generation
- Stored Procedures are supported by internal Optimization Engine (execution plan)
- DB experts can design data processing tasks
- Optimized algorithm for Sorting / Grouping statements (e.g. GROUP BY, DISTINCT, ORDER BY)
-
Data from heterogeneous sources (e.g. non-SQL-sources)
-
Data transfer machine-to-machine
-
Data transforms are flexible, transparent
(SSIS: buffers for heavy operations / T-SQL: temporary tables, cursors) -
Broad transformation functionality (fuzzy logic, aggregations, derived columns, scripts)
-
Easier handling of „bad“ data (different data sinks, exception handling)
-
.Net Data providers as alternative to OLE DB Providers (e.g. mySAP .Net Data Provider)
-
Visual representation of work flow / data flow à better maintenance, understanding of data-flows
(since they are a bit self-documenting) -
Parallel execution / step-by-step execution (Checkpoints) of data-flows
-
Easier debugging using events („component <component name> (xxxx) wrote xxxx rows“)
-
Easier handling of different data types (conversion, …)
-
Use bulk updates instead of single statement updates
-
OLE DB Command with parametrized query usually slow (no bulk operation)
-
Load changed records to staging table and perform set-based update with a stored procedure
-
Prefer usage of set-based updates (stored procedures) over OLE DB Command
Load only changed rows: “Delta Detection”
-
If possible extract and load only a delta set
Eliminating Unnecessary Work
-
Data columns and rows need space within the buffer (performance)
-
Only extract and process needed data
-
Use optimized SQL Statements for data extraction (up-to-date statistics required)
Job dependencies
-
Dependencies among ETL jobs (e.g. job “B” cannot start while job “A” is not finished)
-
Master packages should be used with maximum degree of parallelism
-
Implement parallel execution (MaxEngineThreads, MaxConcurrentExecutables)
Asynchronous transformations (bottlenecks)
-
slowest parts within ETL process
-
Asynchronous operations (block data flow) vs. synchronous operations (row by row basis)
-
E.g. Sort Transformation and Aggregate Transformation (ok for small data sets)
-
Blocking transformations should be avoided wherever possible
-
Merge Join: delivered data should be pre-sorted (by stored procedures or select queries)
-
Prefer stored procedures to inline selects and ensure that join keys are indexed
(then internal DB Optimizing Engine is used)
Slow database operations
-
Reason: concurrency, integrity checks, indexes, triggers
-
Data processing outside database (better performance)
-
Disable integrity checking (disable constraint) before load in target database
-
Disable DB triggers before load in target database
-
Drop indexes before load and recreate them after (e.g. SQL: drop index / create index)
-
Parallel bulk operations (if table is partitioned or there are no indexes)
Note: parallel load attempts into same table/partition can cause locks (on data rows or indexes)
Cross-DB Joins
-
Slow down ETL extracts considerably
-
Use staging tables (schemas should be used to separate staging areas)
Transaction Protocol
-
Disable Transaction Protocol (not convenient for OLAP)
Using SQL Statement to retrieve data
-
Avoid using table or view access mode in OLE DB Source
(otherwise a row set is opened for both to retrieve column meta data and the rows) -
Use SQL Statement: 10times faster(SQL Statements use internal DB Optimizer, but SSIS cannot optimize statements)
Heavy data load operations
-
Use staging tables and SSIS Checkpoints (only available for Control Flow)
-
Avoid buffer swapping (slow) to disk (use performance counters to monitor)
-
Reduce columns covered by Lookup transformation to the minimal possible set (natural key and surrogate key)
-
Use Merge Join in preference to Lookup transformation when possible
Auditing and Logging
-
Use event log for exception and data quality handling
-
Use validations (e.g. no. of extracts equals no. of inserts)
Conclusion
-
Generally use record-based operations (ETL-Tools)
-
Intelligent usage of set-based operations (e.g. for bulk operations: table-to-table)
-
No data transformations within set-based-operations
-
Sorting / Aggregation statements should be done in set-based operations (at source extraction)
-
Staging tables should be used
-
Asynchronous operations (Sorting, Aggregating) should be avoided when possible
-
Consequently eliminate unnecessary work
-
Only extract needed data from sources
-
Try to pre-sort Merge Join input flows
Slides (Download)






















