The purpose of this section is to provide an understanding of the process of loading large amounts of data. The examples used will be of a prototype using xPonent data.
Process Summary
1. Build base tables that have the same characteristics of columns and data types. The FILEGROUPs must be the same as the ending partitions. There should be one work table for each of partitions needed for the final table. No primary keys, foreign keys or indexes should be applied on the work tables before loading. Partitioning is also not applied at this time.
2. Load separate tables based on partition values. Loads can be run simultaneously.
3. Apply the Primary Key to each individual table. Validate data loaded meets the partition specification by issuing a Check Constraint for the range of values. These processes can also be run simultaneously.
4. Merge tables into one partitioned table using the PARTITION SWITCH.
5. Rename the partitioned work table to the Production table name.
6. Add additional indexing, check constraints, foreign keys, etc.
Large Data Load Process Diagram
xPonent Prototype Description
The xPonent Prototype was designed to determine how fast large amounts of data could be imported into SQL Server. The prototype used 100 million rows as a sample. Three work tables were created representing three partitions for three separate years of data. The final result of the entire process is a fact table (with three partitions), Primary and Foreign Keys and other constraints defined and the appropriate indexing included.
SSIS Package Data Flow
Timings
Below is a breakdown on the timings for loading the data an ancillary steps performed in the prototype.
1. 19 mins – Loading 100 million rows into 3 work tables.
2. 10 mins – Create Primary Key, Clustered (non partitioned) indexes on work tables
3. 3 mins – Constraint Checking on each work tables (verify partition ranges)
4. 1 min – Switch Partitions
5. 23 mins – Add other indexes to the combined partitions (now one table)
6. 3 mins – Add foreign keys
So, 19 mins to load three separate tables plus 40 mins for the other parts gives us about 1 hour to load 100,000,000 rows. We may be able to cut down the time of index creation as these were created serially with the work tables. We may also be able to create all the indexes in parallel on the work tables.
Other performance timing enhancements can be performed as referenced in these write-ups:
http://msdn.microsoft.com/en-us/library/dd537533(SQL.100).aspx
http://msdn.microsoft.com/en-us/library/dd425070.aspx
Another consideration for performance in running the import was the location of the file to be imported. This has to be on a server in the same domain with a fast network connection between servers. Or, better yet, on the same box as the SSIS package being run.
Work Tables
The DDL for each the Work tables is as follows (changing only the year imbedded in the name of the table and the FILEGROUP name):
CREATE TABLE [ETL].[wkRxVolumeByWeek_2009](
[ftRxVolumeByWeekKey] [uniqueidentifier] NOT NULL,
[quarterStartDt] [datetime] NOT NULL,
[dmPlanTrakPayerKey] [uniqueidentifier] NULL,
[dmPlanTrakProductKey] [uniqueidentifier] NULL,
[dmPrescriberKey] [uniqueidentifier] NULL,
[dmWeekKey] [uniqueidentifier] NULL,
[newRxAmt] [decimal](19, 6) NULL,
[newTMUQty] [decimal](19, 6) NULL,
[newUnitQty] [decimal](19, 6) NULL,
[totalRxAmt] [decimal](19, 6) NULL,
[totalTMUQty] [decimal](19, 6) NULL,
[totalUnitQty] [decimal](19, 6) NULL,
[rowCreatedTS] [datetime] NOT NULL,
[rowCreatedID] [varchar](80) NOT NULL,
[rowAlteredTS] [datetime] NOT NULL,
[rowAlteredID] [varchar](80) NOT NULL
) ON [FG_PTDATA2009]
Data Distribution
Below is a breakdown on the distribution of the data being loaded for the prototype (by year):
Year RowCnt
2008 16,347,744
2009 50,002,696
2010 33,649,560
Combined Partitioned Table
The DDL for the resulting partition table (before renaming) is as follows:
CREATE TABLE [ETL].[wkRxVolumeByWeek](
[ftRxVolumeByWeekKey] [uniqueidentifier] NOT NULL,
[quarterStartDt] [datetime] NOT NULL,
[dmPlanTrakPayerKey] [uniqueidentifier] NULL,
[dmPlanTrakProductKey] [uniqueidentifier] NULL,
[dmPrescriberKey] [uniqueidentifier] NULL,
[dmWeekKey] [uniqueidentifier] NULL,
[newRxAmt] [decimal](19, 6) NULL,
[newTMUQty] [decimal](19, 6) NULL,
[newUnitQty] [decimal](19, 6) NULL,
[totalRxAmt] [decimal](19, 6) NULL,
[totalTMUQty] [decimal](19, 6) NULL,
[totalUnitQty] [decimal](19, 6) NULL,
[rowCreatedTS] [datetime] NOT NULL,
[rowCreatedID] [varchar](80) NOT NULL,
[rowAlteredTS] [datetime] NOT NULL,
[rowAlteredID] [varchar](80) NOT NULL,
CONSTRAINT [PK_ftRxVolumeByWeek] PRIMARY KEY CLUSTERED
([ftRxVolumeByWeekKey] ASC, [quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmPlanTrakPayerKey] ON [ETL].[wkRxVolumeByWeek] ( [dmPlanTrakPayerKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmPlanTrakProductKey] ON [ETL].[wkRxVolumeByWeek] ( [dmPlanTrakProductKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmPrescriberKey] ON [ETL].[wkRxVolumeByWeek] ( [dmPrescriberKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmWeekKey] ON [ETL].[wkRxVolumeByWeek] ( [dmWeekKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
…and Foreign Keys and any other Constraints.
Commands for Combining Work Tables to One Partitioned Table
The SQL commands used for partition switching from separate tables into one combined partitioned table are as follows:
-- Perform Constraint Checking on table data for tables that become Partitions
ALTER TABLE ETL.wkRxVolumeByWeek_2008 WITH CHECK ADD CONSTRAINT CK_ftRxVolumeByWeek_2008
CHECK (quarterStartDt >= '01-01-2008' AND quarterStartDt <= '12-31-2008') GO ALTER TABLE ETL.wkRxVolumeByWeek_2009 WITH CHECK ADD CONSTRAINT CK_ftRxVolumeByWeek_2009 CHECK (quarterStartDt >= '01-01-2009' AND quarterStartDt <= '12-31-2009') GO ALTER TABLE ETL.wkRxVolumeByWeek_2010 WITH CHECK ADD CONSTRAINT CK_ftRxVolumeByWeek_2010 CHECK (quarterStartDt >= '01-01-2010' AND quarterStartDt <= '12-31-2010')
GO
-- Add Primary Keys on Individual Work Tables
ALTER TABLE [ETL].[wkRxVolumeByWeek_2008] ADD CONSTRAINT [PK_ftRxVolumeByWeek_2008] PRIMARY KEY CLUSTERED
(
[ftRxVolumeByWeekKey] ASC,
[quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
ON [FG_PTDATA2008]
GO
ALTER TABLE [ETL].[wkRxVolumeByWeek_2009] ADD CONSTRAINT [PK_ftRxVolumeByWeek_2009] PRIMARY KEY CLUSTERED
(
[ftRxVolumeByWeekKey] ASC,
[quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
ON [FG_PTDATA2009]
GO
ALTER TABLE [ETL].[wkRxVolumeByWeek_2010] ADD CONSTRAINT [PK_ftRxVolumeByWeek_2010] PRIMARY KEY CLUSTERED
(
[ftRxVolumeByWeekKey] ASC,
[quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
ON [FG_PTDATA2010]
GO
-- Switch Partitions
ALTER TABLE ETL.wkRxVolumeByWeek_2008 SWITCH TO ETL.wkRxVolumeByWeek PARTITION 1
GO
ALTER TABLE ETL.wkRxVolumeByWeek_2009 SWITCH TO ETL.wkRxVolumeByWeek PARTITION 2
GO
ALTER TABLE ETL.wkRxVolumeByWeek_2010 SWITCH TO ETL.wkRxVolumeByWeek PARTITION 3
GO
Friday, January 28, 2011
SSIS script, Read and Write Variable functions
Script task to read and write variables.
To read
fileLoc = ReadVariable("FileFolderLoc")
To Write
WriteVariable("SubDirectoryObject", arr)
Private Function ReadVariable(ByVal varName As String) As Object
Dim result As Object
Dim vars As Variables
Dim emptyBytes(0) As Byte
Dim logMsg As String
Dim logCd As Integer
Try
Dts.VariableDispenser.LockForRead(varName)
Dts.VariableDispenser.GetVariables(vars)
Try
result = vars(varName).Value
Catch ex As Exception
logMsg = "*** CANNOT READ VARIABLE " + varName + " ***"
logCd = 1001
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
Finally
vars.Unlock()
End Try
Catch ex As Exception
logMsg = "*** CANNOT LOCK VARIABLE " + varName + " FOR READ ***"
logCd = 1002
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
End Try
Return result
End Function
Private Sub WriteVariable(ByVal varName As String, ByVal varValue As Object)
Dim vars As Variables
Dim emptyBytes(0) As Byte
Dim logMsg As String
Dim logCd As Integer
' If the variable you are updating is in the parent package,
' DO NOT list the variable in the child package and
' DO NOT include it in the package configuration.
Try
Dts.VariableDispenser.LockForWrite(varName)
Dts.VariableDispenser.GetVariables(vars)
Try
vars(varName).Value = varValue
Catch ex As Exception
logMsg = "*** CANNOT UPDATE VARIABLE " + varName + " WITH VALUE " + varValue.ToString + " ***"
logCd = 1003
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
Finally
vars.Unlock()
End Try
Catch ex As Exception
logMsg = "*** CANNOT LOCK VARIABLE " + varName + " FOR WRITE ***"
logCd = 1004
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
End Try
End Sub
To read
fileLoc = ReadVariable("FileFolderLoc")
To Write
WriteVariable("SubDirectoryObject", arr)
Private Function ReadVariable(ByVal varName As String) As Object
Dim result As Object
Dim vars As Variables
Dim emptyBytes(0) As Byte
Dim logMsg As String
Dim logCd As Integer
Try
Dts.VariableDispenser.LockForRead(varName)
Dts.VariableDispenser.GetVariables(vars)
Try
result = vars(varName).Value
Catch ex As Exception
logMsg = "*** CANNOT READ VARIABLE " + varName + " ***"
logCd = 1001
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
Finally
vars.Unlock()
End Try
Catch ex As Exception
logMsg = "*** CANNOT LOCK VARIABLE " + varName + " FOR READ ***"
logCd = 1002
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
End Try
Return result
End Function
Private Sub WriteVariable(ByVal varName As String, ByVal varValue As Object)
Dim vars As Variables
Dim emptyBytes(0) As Byte
Dim logMsg As String
Dim logCd As Integer
' If the variable you are updating is in the parent package,
' DO NOT list the variable in the child package and
' DO NOT include it in the package configuration.
Try
Dts.VariableDispenser.LockForWrite(varName)
Dts.VariableDispenser.GetVariables(vars)
Try
vars(varName).Value = varValue
Catch ex As Exception
logMsg = "*** CANNOT UPDATE VARIABLE " + varName + " WITH VALUE " + varValue.ToString + " ***"
logCd = 1003
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
Finally
vars.Unlock()
End Try
Catch ex As Exception
logMsg = "*** CANNOT LOCK VARIABLE " + varName + " FOR WRITE ***"
logCd = 1004
Dts.Log(logMsg, logCd, emptyBytes)
Dts.Events.FireError(logCd, " ", logMsg, " ", -1)
Dts.TaskResult = ScriptResults.Failure
Throw ex
End Try
End Sub
Copy File to all the subfolders
An example to copy a file from one location to all the subfolders for a target location.
Step 1 : Script , to get the list of all the subfolders and assigning to a object.
Dts.TaskResult = ScriptResults.Success
Dim fileLoc, fileName As String
fileLoc = ReadVariable("FileFolderLoc")
fileName = ReadVariable("FileName")
Dim diSource As DirectoryInfo = New DirectoryInfo(fileLoc)
Dim arr As New System.Collections.ArrayList()
For Each subdir In diSource.GetDirectories()
arr.Add(subdir.Name)
Next
WriteVariable("SubDirectoryObject", arr)
Step 2 : For each Loop container as below
.
Step 1 : Script , to get the list of all the subfolders and assigning to a object.
Dts.TaskResult = ScriptResults.Success
Dim fileLoc, fileName As String
fileLoc = ReadVariable("FileFolderLoc")
fileName = ReadVariable("FileName")
Dim diSource As DirectoryInfo = New DirectoryInfo(fileLoc)
Dim arr As New System.Collections.ArrayList()
For Each subdir In diSource.GetDirectories()
arr.Add(subdir.Name)
Next
WriteVariable("SubDirectoryObject", arr)
Step 2 : For each Loop container as below
Step 3 : File copy to task to copy the files to the subfolder
Subscribe to:
Posts (Atom)