The purpose of this section is to provide an understanding of the process of loading large amounts of data. The examples used will be of a prototype using xPonent data.
Process Summary
1. Build base tables that have the same characteristics of columns and data types. The FILEGROUPs must be the same as the ending partitions. There should be one work table for each of partitions needed for the final table. No primary keys, foreign keys or indexes should be applied on the work tables before loading. Partitioning is also not applied at this time.
2. Load separate tables based on partition values. Loads can be run simultaneously.
3. Apply the Primary Key to each individual table. Validate data loaded meets the partition specification by issuing a Check Constraint for the range of values. These processes can also be run simultaneously.
4. Merge tables into one partitioned table using the PARTITION SWITCH.
5. Rename the partitioned work table to the Production table name.
6. Add additional indexing, check constraints, foreign keys, etc.
Large Data Load Process Diagram
xPonent Prototype Description
The xPonent Prototype was designed to determine how fast large amounts of data could be imported into SQL Server. The prototype used 100 million rows as a sample. Three work tables were created representing three partitions for three separate years of data. The final result of the entire process is a fact table (with three partitions), Primary and Foreign Keys and other constraints defined and the appropriate indexing included.
SSIS Package Data Flow
Timings
Below is a breakdown on the timings for loading the data an ancillary steps performed in the prototype.
1. 19 mins – Loading 100 million rows into 3 work tables.
2. 10 mins – Create Primary Key, Clustered (non partitioned) indexes on work tables
3. 3 mins – Constraint Checking on each work tables (verify partition ranges)
4. 1 min – Switch Partitions
5. 23 mins – Add other indexes to the combined partitions (now one table)
6. 3 mins – Add foreign keys
So, 19 mins to load three separate tables plus 40 mins for the other parts gives us about 1 hour to load 100,000,000 rows. We may be able to cut down the time of index creation as these were created serially with the work tables. We may also be able to create all the indexes in parallel on the work tables.
Other performance timing enhancements can be performed as referenced in these write-ups:
http://msdn.microsoft.com/en-us/library/dd537533(SQL.100).aspx
http://msdn.microsoft.com/en-us/library/dd425070.aspx
Another consideration for performance in running the import was the location of the file to be imported. This has to be on a server in the same domain with a fast network connection between servers. Or, better yet, on the same box as the SSIS package being run.
Work Tables
The DDL for each the Work tables is as follows (changing only the year imbedded in the name of the table and the FILEGROUP name):
CREATE TABLE [ETL].[wkRxVolumeByWeek_2009](
[ftRxVolumeByWeekKey] [uniqueidentifier] NOT NULL,
[quarterStartDt] [datetime] NOT NULL,
[dmPlanTrakPayerKey] [uniqueidentifier] NULL,
[dmPlanTrakProductKey] [uniqueidentifier] NULL,
[dmPrescriberKey] [uniqueidentifier] NULL,
[dmWeekKey] [uniqueidentifier] NULL,
[newRxAmt] [decimal](19, 6) NULL,
[newTMUQty] [decimal](19, 6) NULL,
[newUnitQty] [decimal](19, 6) NULL,
[totalRxAmt] [decimal](19, 6) NULL,
[totalTMUQty] [decimal](19, 6) NULL,
[totalUnitQty] [decimal](19, 6) NULL,
[rowCreatedTS] [datetime] NOT NULL,
[rowCreatedID] [varchar](80) NOT NULL,
[rowAlteredTS] [datetime] NOT NULL,
[rowAlteredID] [varchar](80) NOT NULL
) ON [FG_PTDATA2009]
Data Distribution
Below is a breakdown on the distribution of the data being loaded for the prototype (by year):
Year RowCnt
2008 16,347,744
2009 50,002,696
2010 33,649,560
Combined Partitioned Table
The DDL for the resulting partition table (before renaming) is as follows:
CREATE TABLE [ETL].[wkRxVolumeByWeek](
[ftRxVolumeByWeekKey] [uniqueidentifier] NOT NULL,
[quarterStartDt] [datetime] NOT NULL,
[dmPlanTrakPayerKey] [uniqueidentifier] NULL,
[dmPlanTrakProductKey] [uniqueidentifier] NULL,
[dmPrescriberKey] [uniqueidentifier] NULL,
[dmWeekKey] [uniqueidentifier] NULL,
[newRxAmt] [decimal](19, 6) NULL,
[newTMUQty] [decimal](19, 6) NULL,
[newUnitQty] [decimal](19, 6) NULL,
[totalRxAmt] [decimal](19, 6) NULL,
[totalTMUQty] [decimal](19, 6) NULL,
[totalUnitQty] [decimal](19, 6) NULL,
[rowCreatedTS] [datetime] NOT NULL,
[rowCreatedID] [varchar](80) NOT NULL,
[rowAlteredTS] [datetime] NOT NULL,
[rowAlteredID] [varchar](80) NOT NULL,
CONSTRAINT [PK_ftRxVolumeByWeek] PRIMARY KEY CLUSTERED
([ftRxVolumeByWeekKey] ASC, [quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmPlanTrakPayerKey] ON [ETL].[wkRxVolumeByWeek] ( [dmPlanTrakPayerKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmPlanTrakProductKey] ON [ETL].[wkRxVolumeByWeek] ( [dmPlanTrakProductKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmPrescriberKey] ON [ETL].[wkRxVolumeByWeek] ( [dmPrescriberKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
CREATE NONCLUSTERED INDEX [IX_ftRxVolumeByWeekBydmWeekKey] ON [ETL].[wkRxVolumeByWeek] ( [dmWeekKey] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PTYrPeriodDataScheme]([quarterStartDt])
GO
…and Foreign Keys and any other Constraints.
Commands for Combining Work Tables to One Partitioned Table
The SQL commands used for partition switching from separate tables into one combined partitioned table are as follows:
-- Perform Constraint Checking on table data for tables that become Partitions
ALTER TABLE ETL.wkRxVolumeByWeek_2008 WITH CHECK ADD CONSTRAINT CK_ftRxVolumeByWeek_2008
CHECK (quarterStartDt >= '01-01-2008' AND quarterStartDt <= '12-31-2008') GO ALTER TABLE ETL.wkRxVolumeByWeek_2009 WITH CHECK ADD CONSTRAINT CK_ftRxVolumeByWeek_2009 CHECK (quarterStartDt >= '01-01-2009' AND quarterStartDt <= '12-31-2009') GO ALTER TABLE ETL.wkRxVolumeByWeek_2010 WITH CHECK ADD CONSTRAINT CK_ftRxVolumeByWeek_2010 CHECK (quarterStartDt >= '01-01-2010' AND quarterStartDt <= '12-31-2010')
GO
-- Add Primary Keys on Individual Work Tables
ALTER TABLE [ETL].[wkRxVolumeByWeek_2008] ADD CONSTRAINT [PK_ftRxVolumeByWeek_2008] PRIMARY KEY CLUSTERED
(
[ftRxVolumeByWeekKey] ASC,
[quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
ON [FG_PTDATA2008]
GO
ALTER TABLE [ETL].[wkRxVolumeByWeek_2009] ADD CONSTRAINT [PK_ftRxVolumeByWeek_2009] PRIMARY KEY CLUSTERED
(
[ftRxVolumeByWeekKey] ASC,
[quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
ON [FG_PTDATA2009]
GO
ALTER TABLE [ETL].[wkRxVolumeByWeek_2010] ADD CONSTRAINT [PK_ftRxVolumeByWeek_2010] PRIMARY KEY CLUSTERED
(
[ftRxVolumeByWeekKey] ASC,
[quarterStartDt] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90)
ON [FG_PTDATA2010]
GO
-- Switch Partitions
ALTER TABLE ETL.wkRxVolumeByWeek_2008 SWITCH TO ETL.wkRxVolumeByWeek PARTITION 1
GO
ALTER TABLE ETL.wkRxVolumeByWeek_2009 SWITCH TO ETL.wkRxVolumeByWeek PARTITION 2
GO
ALTER TABLE ETL.wkRxVolumeByWeek_2010 SWITCH TO ETL.wkRxVolumeByWeek PARTITION 3
GO
No comments:
Post a Comment