We have this database which is about 300 gig. A substantial portion of this data is logging data, and a substantial portion of the logging data is simply heartbeat records. Why on earth the home page logs to the database is beyond me, but moving on…
There are roughly 280 million records in one table in particular that need to be moved to an archiving database on the same server. Usually, my approach is to BCP out the data that I want to keep and BULK INSERT it back into a new table, but I decided to try a new approach based upon Henk Vandervalk’s article here.
This approach uses multiple streams in parallel in SSIS to SELECT the data out into the new table. This is accomplished by using the MODULO operator against an IDENTITY field in the table to partition out the streams based upon the remainder of the IDENTITY field. Here’s an example:
CREATE TABLE dbo.TestModulo( ID INT IDENTITY(1,1), VAL VARCHAR(10) ) GO INSERT INTO dbo.TestModulo(Val) SELECT 'Test' GO 100 SELECT TOP 5 * FROM dbo.TestModulo WHERE (ID%4)=3 OPTION (MAXDOP 1) SELECT TOP 5 * FROM dbo.TestModulo WHERE (ID%4)=2 OPTION (MAXDOP 1) SELECT TOP 5 * FROM dbo.TestModulo WHERE (ID%4)=1 OPTION (MAXDOP 1) SELECT TOP 5 * FROM dbo.TestModulo WHERE (ID%4)=0 OPTION (MAXDOP 1)
As you can see in the image below, this partitions the table up into sets based upon the remainder of the ID field divided by the MODULO operator rounded upwards.
So, in SSIS I created a Data Flow Task. In said Data Flow Task I created 4 OLEDB Sources. In the sources is where I put the Sql Select statements that utilize the MODULO operation to divide up the sets of data. I then dragged the output of all 4 to a UNION ALL transformation and then the output of the union all goes to the Sql Server Destination. Of note, you can only use the Sql Server Destination if the destination table is on the local server. If not, you must use an OLEDB Destination.
The result of this was moving 227 million rows in about an hour. Not bad! When doing this via BCP/BULK INSERT, the BCP out alone took over an hour, and the import took another 45 minutes on top of that. This essentially cut down the loading time by about half!
Another thing I did to speed up the transformation was to increase the default packet size on the connections from 4k (Sql Server default) to 32767 (also mentioned in Henk Vandervalk’s excellent post).
One thought on “Moving Large Amounts of Data”
Reading as fast as possible from a table with SSIS (Part II):