June 24, 2005

Performance characteristics of Integration Services data sorting in SQL Server 2005

Abstract
The author analyzes performance characteristics of Integration Services built-in sorting component and suggests some guidelines regarding conditions when the use of the built-in sorting component is or is not advisable.

Introduction
As a long term enthusiast of SQL Server, I was excited to test drive the pre-release build of Microsoft SQL Server 2005. This release includes a major break through in data integration capabilities in form of Integration Services component. Integration Services provides an integrated work flow and data flow environment and a set of components that allow the user to graphically define some very sophisticated data processing processes.

One of the most basic data processing tasks is sorting. Because of Integration Services' modular nature, the task of sorting a set of records in a file is handled in three steps. First, the user adds the Flat File Source component to the data flow. It is configured and plugged in to the Sort component. Finally, the Sort component is plugged in to the Flat File Destination component.

The source component reads rows from the disk file and loads them into the data processing pipeline. The sorting component actually sorts the rows and the destination component writes the sorted rows back to disk file.

Measurement Method
Sorting performance was measured on 1 GB, 2 GB, 3 GB, 5 GB, 10 GB and 30 GB files generated using PennySort dataset generator program. These files are composed of 100-byte rows with 10-byte sort key. For timing, the value reported by DTEXEC utility upon completion of the package was used. Dataset generator program and Integration Services packages necessary to reproduce these results are available upon request.

The system used for these measurements was 2P Intel Xeon 3.06 GHz 2 GB RAM with IDE drives with Hyper-Threading enabled. The cost of the system was $4000. For comparison of results with PennySort records, this systems' 3 year lifetime (94,608,000 seconds) divided by 400,000 pennies yields a budget of 236 seconds of use for the cost of a penny. Therefore, if we wished to run PennySort Daytona benchmark using Integration Services, we would measure the number of gigabytes sorted in 236 seconds. In 2005, PennySort Daytona record stands at 16 GB in time of 996 seconds on 1P system costing $950.

Results using Integration Services Sort

1 GB 170 sec
2 GB 5336 sec
3 GB 29400 sec

Analysis and Conclusion
The built-in sorting component of Integration Services is a good choice for internal sorting workloads, in other words, datasets that are small enough to fit entirely in the available physical memory. Performance deteriorates quickly and non-linearly when the dataset exceeds the size of physical memory. The degradation is spectacular, as tripling the input size from 1 GB to 3 GB causes the sort to take 175 times longer. Therefore, usage of the built-in sorting component of Integration Services is not recommended when there is any risk of dataset exceeding the size of physical memory. It appears the algorithm used has some real problems under those conditions. When using the built-in sorting component, special care must be taken to absolutely ensure that the dataset will not under any conditions grow larger than physical memory because if that occurs, the resulting exponential increase in the runtime might collapse your ETL process schedule.

Having obtained these surprising results, I got interested in researching Integration Services extensibility model to see whether a more suitable external sorting algorithm could be introduced into the data processing pipeline. In fact, the answer was yes. Integration Services provides all necessary hooks to replace the built-in sorting component with a user-defined component. I developed the ExaSort component that achieves much better scaling over larger datasets. The results using the same packages as above but using the ExtraSort component were as follows:

1 GB 51 sec
2 GB 105 sec
3 GB 180 sec
5 GB 370 sec
10 GB 720 sec
15 GB 1090 sec
30 GB 2850 sec

In conclusion, Integration Services provides an excellent data processing environment and any shortfalls in the components that are bundled with the product can be rectified using pluggable third-party components.
Posted 20 years, 5 months ago on June 24, 2005
The trackback url for this post is http://blog.ivolva.com/bblog/trackback.php/3/

Comments have now been turned off for this post