Load Text Files To A Database 10 Times Faster
By Robbe D. Morris
While completing a contract project for an Outbound Telemarketing company recently, I ran into a situation where a creative solution to an age old problem was needed. We desperately needed to improve the performance of our desktop application that read in fixed position, fixed length, ASCII text files and handled various inserts and updates to multiple tables for each record. To complicate matters, the load process had to take the following requirements into account:
1.
All access to the SQL Server database had to be done remotely. Local access was not allowed.
2.
With each record read in from the text file, conditional statements as well as table inserts with the newly generated foreign keys needed to be handled accordingly.
3.
It had to be fast enough to be feasible over a very slow network.
4.
Needed to be simple enough for our computer operators to run without support from the programming group.
5.
Be able to insert characters that would typically foul up standard SQL insert/update statements.
These requirements really left me with the option of creating a desktop app that could read the text files line by line, pull out specific pieces of data by position on the record, and then perform the various inserts and updates required. Typically, an approach using the ADO Command object and SQL Server stored procedures would be used on a record by record basis. Of course, loading 50,000 to 500,000 records can be quite time consuming and I needed an approach that would be much faster. So, I put on my thinking cap and began to look for alternatives that could at least double the speed of our current processes. As it turned out, I was able to increase that speed by at least a factor of ten just working against a local instance of SQL Server. Depending on the size of the text file (ie number of records to process), the speed improvements gradually improved above and beyond the factor of 10 when applied to our remote database servers over the network.
Here’s how I did it. I assumed that the largest time consumer in the process was the repeated trips to the database for each record. A little test I put together confirmed that. With this in mind, a test app was written to compare the current approach against one that would send a thousand records at a time to the database to be processed. Rather than parsing the text file locally in the desktop app for each record, I created a process to package up a thousand records at a time and send the whole string to a similar stored procedure. The stored procedure would iterate through the string and parse out the individual pieces of data and react accordingly. A key point to mention is that I used a slight variation of the string builder class mentioned in another article here on EggHeadCafe: A Fast String Class for ASP Pages. Using standard string concatenation techniques would have dramatically impacted the performance gains in this endeavor.
In your implementations, you may have to adjust the number of records you can send to the stored procedure at any given time. If you don’t need unicode support, you can set the internal string to be varchar and support 8000 bytes. Otherwise, if you use nvarchar, you may need to reduce that. You may also opt to use a slightly different approach to parse the data directly from the varchar input parameter directly. In either event, it is just something you need to consider. The fewer trips you make to the database, the faster your process will be.
All of the test code has been included here for your review. You’ll notice that I used a column in the tables called RecordNum. I’ve often found it useful to flag records as they are being loaded to make it much easier to break out large text files into smaller groups of records to be used on predictive dialers. As always, if you would like to comment on this article, feel free to email me at the address at the bottom of this article.
Visual Basic basMain.bas Module (Primary Test Oriented Code)
Option Explicit Public Const mksTestFileName = "egghead.txt" Public UI As cUI Public Enum ekTestOptions EggHeadTest1 = 1 EggHeadTest2 = 2 End Enum Public Sub Main() Set UI = New cUI UI.AppPath = App.Path & "\" If InStr(1, UI.AppPath, "VB98", vbTextCompare) > 0 Then UI.AppPath = "C:\EggHeadTest\" ' This is to prevent VB from using the default Visual Studio as it's root ' when developing in the IDE. Simply change the path to fit the ' folder where your source code resides. This has no effect on ' production executables. End If DeleteExistingTestRecords CreateTestFile UI.TimerInit LoadTest ekTestOptions.EggHeadTest1 Debug.Print "Single Records: " & UI.TimerGetElapsedTime UI.TimerInit LoadTest ekTestOptions.EggHeadTest2 Debug.Print "Multiple Records: " & UI.TimerGetElapsedTime ExitApp End Sub Public Sub ExitApp() Set UI = Nothing Err.Clear End End Sub Public Sub DeleteExistingTestRecords() On Error GoTo ErrHandler If UI.DBOpen = True Then UI.ADOCon.Execute "delete from eggheadtest1" UI.ADOCon.Execute "delete from eggheadtest1trans" UI.ADOCon.Execute "delete from eggheadtest2" UI.ADOCon.Execute "delete from eggheadtest2trans" End If ErrHandler: If Err.Number <> 0 Then MsgBox "Failure to delete test records. " & vbCrLf & Err.Description On Error Resume Next UI.DBClose Err.Clear End Sub Public Sub CreateTestFile() Dim lTot As Long Dim lCnt As Long Dim oFS As Scripting.FileSystemObject Dim oFSFile As Scripting.TextStream On Error GoTo ErrHandler Const ForWriting = 2 Set oFS = New Scripting.FileSystemObject If oFS.FileExists(UI.AppPath & mksTestFileName) = True Then oFS.DeleteFile UI.AppPath & mksTestFileName, True Set oFSFile = oFS.OpenTextFile(UI.AppPath & mksTestFileName, ForWriting, True) lTot = 20005 ' Record length = 100 For lCnt = 1 To lTot oFSFile.Write "4075551212" & "Rush" & Space(6) & "Limbaugh" & Space(2) & "12 Main St" & Space(2) & Space(44) & "Test" & vbCrLf DoEvents Next oFSFile.Close ErrHandler: If Err.Number <> 0 Then MsgBox "Failure to open file for writing. " & vbCrLf & Err.Description On Error Resume Next Set oFSFile = Nothing Set oFS = Nothing Err.Clear End Sub Public Sub LoadTest(ByRef ekTestOption As ekTestOptions) Dim sRecord As String Dim lTotRecCnt As Long Dim lRecLength As Long Dim sMassRec As String Dim lMassRecCnt As Long Dim lStartRecord As Long Dim lRecordCnt As Long Dim lRecordCntMax As Long Dim lRecs As String On Error GoTo ErrHandler ' If we use the Open/For Input method, it allows us to reference one record at ' a time .v loading up a 40-50MB file into ' a string variable with the Scripting.FileSystemObject. But, remember that it ' removes the vbcrlf that was at the end of each ' record. Thus, we adjusted the previous record length of 100 to 98. UI.HourGlass True lRecLength = 98 lRecordCntMax = 1000 UI.DBOpen UI.DBClearParameters Open UI.AppPath & mksTestFileName For Input As #1 Do Until EOF(1) lTotRecCnt = lTotRecCnt + 1 Line Input #1, sRecord DoEvents Select Case ekTestOption Case ekTestOptions.EggHeadTest1 With UI.ADOCom .CommandText = "spEggHeadTest" & ekTestOption .CommandType = adCmdStoredProc .Parameters.Append .CreateParameter("@RETURN_VALUE", adInteger, adParamReturnValue, 0) .Parameters.Append .CreateParameter("@Phone", adVarWChar, adParamInput, 10, Mid(sRecord, 1, 10)) .Parameters.Append .CreateParameter("@FName", adVarWChar, adParamInput, 50, Mid(sRecord, 11, 10)) .Parameters.Append .CreateParameter("@LName", adVarWChar, adParamInput, 50, Mid(sRecord, 21, 10)) .Parameters.Append .CreateParameter("@RecordData", adVarWChar, adParamInput, 500, Mid(sRecord, 31, 64)) .Parameters.Append .CreateParameter("@LastPieceOfRecord", adVarWChar, adParamInput, 50, Mid(sRecord, 95, 4)) .Parameters.Append .CreateParameter("@RecordNum", adInteger, adParamInput, , lTotRecCnt) .Execute lRecs, , adExecuteNoRecords End With UI.DBClearParameters Case ekTestOptions.EggHeadTest2 lMassRecCnt = lMassRecCnt + 1 If lMassRecCnt = 1 Then UI.StringBuilderInit lStartRecord = lTotRecCnt End If UI.StringBuilderAppend sRecord If lTotRecCnt Mod lRecordCntMax = 0 Then sMassRec = UI.StringBuilderConcat UI.StringBuilderInit With UI.ADOCom .CommandText = "spEggHeadTest" & ekTestOption .CommandType = adCmdStoredProc .Parameters.Append .CreateParameter("@RETURN_VALUE", adInteger, adParamReturnValue, 0) .Parameters.Append .CreateParameter("@RecLength", adInteger, adParamInput, , lRecLength) .Parameters.Append .CreateParameter("@StartRecord", adInteger, adParamInput, , lStartRecord) .Parameters.Append .CreateParameter("@CurRecordCnt", adInteger, adParamInput, , lMassRecCnt) .Parameters.Append .CreateParameter("@RecordData", adLongVarWChar, adParamInput, 1073741823, sMassRec) .Execute lRecs, , adExecuteNoRecords End With UI.DBClearParameters lMassRecCnt = 0 End If End Select Loop Close #1 If ekTestOption = ekTestOptions.EggHeadTest2 Then If ((lMassRecCnt < lRecordCntMax) And (lMassRecCnt > 0)) Then sMassRec = UI.StringBuilderConcat With UI.ADOCom .CommandText = "spEggHeadTest" & ekTestOption .CommandType = adCmdStoredProc .Parameters.Append .CreateParameter("@RETURN_VALUE", adInteger, adParamReturnValue, 0) .Parameters.Append .CreateParameter("@RecLength", adInteger, adParamInput, , lRecLength) .Parameters.Append .CreateParameter("@StartRecord", adInteger, adParamInput, , lStartRecord) .Parameters.Append .CreateParameter("@CurRecordCnt", adInteger, adParamInput, , lMassRecCnt) .Parameters.Append .CreateParameter("@RecordData", adLongVarWChar, adParamInput, 1073741823, sMassRec) .Execute lRecs, , adExecuteNoRecords End With UI.DBClearParameters UI.StringBuilderInit End If End If ErrHandler: UI.HourGlass False If Err.Number <> 0 Then MsgBox "Error loading test file. " & vbCrLf & Err.Description On Error Resume Next Close #1 UI.DBClose Err.Clear End Sub
Visual Basic cUI.cls (Support Class Module With Common Methods)
Option Explicit Private msStringBuilderArray() As String Private mlStringBuilderGrowthRate As Long Private mlStringBuilderNumItems As Long Private mlStartTimer As Long Public AppPath As String Public ADOCon As ADODB.Connection Public ADOCom As ADODB.Command Private Sub Class_Initialize() Set ADOCon = New ADODB.Connection Set ADOCom = New ADODB.Command End Sub Private Sub Class_Terminate() On Error Resume Next ADOCon.Close Set ADOCom = Nothing Set ADOCon = Nothing Err.Clear End Sub Public Function DBOpen() As Boolean On Error Resume Next Me.ADOCon.Close Err.Clear On Error GoTo ErrHandler DBOpen = False Me.ADOCon.Open "Provider=SQLOLEDB;Data Source=(local); User ID=yourusr;Password=yourpwd;Initial Catalog=YOURDB" Set ADOCom.ActiveConnection = Me.ADOCon DBOpen = True ErrHandler: If Err.Number <> 0 Then MsgBox "Unable to connect to " & vbCrLf & Err.Description Err.Clear End Function Public Sub DBClose() On Error Resume Next Me.ADOCon.Close Err.Clear End Sub Public Sub HourGlass(ByVal fOn As Boolean) If fOn = True Then Screen.MousePointer = vbHourglass Else Screen.MousePointer = vbDefault End If End Sub Public Function WriteFile(ByVal sFilePathAndName As String, ByVal sFileContents As String) As Boolean Dim oFS As Scripting.FileSystemObject Dim oFSFile On Error GoTo ErrHandler WriteFile = False Const ForWriting = 2 Set oFS = New Scripting.FileSystemObject If oFS.FileExists(sFilePathAndName) = True Then oFS.DeleteFile sFilePathAndName, True Set oFSFile = oFS.OpenTextFile(sFilePathAndName, ForWriting, True) oFSFile.Write (sFileContents) oFSFile.Close WriteFile = True ErrHandler: If Err.Number <> 0 Then MsgBox "Failure to open file for writing. " & vbCrLf & sFilePathAndName & vbCrLf & Err.Description end if On Error Resume Next Set oFSFile = Nothing Set oFS = Nothing Err.Clear End Function Public Sub TimerInit() mlStartTimer = Timer End Sub Public Function TimerGetElapsedTime() As Long TimerGetElapsedTime = Timer - mlStartTimer End Function Public Sub CatStr(ByRef sOrgVal As String, ByRef sVal As String) ' Look at the egghead article for using arrays as a string builder. ' Since this sub is not part of the article's speed evaluation, I didn't bother. sOrgVal = sOrgVal & sVal End Sub Public Sub DBClearParameters() Dim nCnt On Error Resume Next If Me.ADOCom.Parameters.Count < 1 Then Exit Sub For nCnt = 0 To Me.ADOCom.Parameters.Count Me.ADOCom.Parameters.Delete 0 Next Err.Clear End Sub Public Sub StringBuilderInit() On Error Resume Next Erase msStringBuilderArray Me.StringBuilderInit Err.Clear mlStringBuilderGrowthRate = 50: mlStringBuilderNumItems = 0 ReDim msStringBuilderArray(mlStringBuilderGrowthRate) End Sub Public Sub StringBuilderAppend(ByVal strValue As String) If mlStringBuilderNumItems > UBound(msStringBuilderArray) Then ReDim Preserve msStringBuilderArray(UBound(msStringBuilderArray) + mlStringBuilderGrowthRate) end if msStringBuilderArray(mlStringBuilderNumItems) = strValue: mlStringBuilderNumItems = mlStringBuilderNumItems + 1 End Sub Public Function StringBuilderConcat() ReDim Preserve msStringBuilderArray(mlStringBuilderNumItems) StringBuilderConcat = Join(msStringBuilderArray, "") End Function
SQL Server Table Schemas And Stored Procedures
CREATE TABLE [dbo].[EggHeadTest1] ( [RecordID] [int] IDENTITY (1, 1) NOT NULL , [Phone] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [FName] [nvarchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [LName] [nvarchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [RecordData] [nvarchar] (500) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [LastPieceOfRecord] [nvarchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [RecordNum] [int] NOT NULL , [LastUpdateTime] [datetime] NULL ) ON [PRIMARY] GO CREATE TABLE [dbo].[EggHeadTest1Trans] ( [TransID] [int] IDENTITY (1, 1) NOT NULL , [RecordID] [int] NOT NULL , [DispositionID] [int] NOT NULL , [CallDateTime] [datetime] NULL , [LastUpdateTime] [datetime] NOT NULL ) ON [PRIMARY] GO CREATE TABLE [dbo].[EggHeadTest2] ( [RecordID] [int] IDENTITY (1, 1) NOT NULL , [Phone] [nvarchar] (10) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [FName] [nvarchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [LName] [nvarchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [RecordData] [nvarchar] (500) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [LastPieceOfRecord] [nvarchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL , [RecordNum] [int] NOT NULL , [LastUpdateTime] [datetime] NULL ) ON [PRIMARY] GO CREATE TABLE [dbo].[EggHeadTest2Trans] ( [TransID] [int] IDENTITY (1, 1) NOT NULL , [RecordID] [int] NOT NULL , [DispositionID] [int] NOT NULL , [CallDateTime] [datetime] NULL , [LastUpdateTime] [datetime] NOT NULL ) ON [PRIMARY] GO ALTER TABLE [dbo].[EggHeadTest1] WITH NOCHECK ADD CONSTRAINT [PK_EggHeadTest1] PRIMARY KEY CLUSTERED ( [RecordID] ) ON [PRIMARY] GO ALTER TABLE [dbo].[EggHeadTest1Trans] WITH NOCHECK ADD CONSTRAINT [DF_EggHeadTest1Trans_DispositionID] DEFAULT (0) FOR [DispositionID], CONSTRAINT [PK_EggHeadTest1Trans] PRIMARY KEY CLUSTERED ( [TransID] ) ON [PRIMARY] GO ALTER TABLE [dbo].[EggHeadTest2] WITH NOCHECK ADD CONSTRAINT [PK_EggHeadTest2] PRIMARY KEY CLUSTERED ( [RecordID] ) ON [PRIMARY] GO ALTER TABLE [dbo].[EggHeadTest2Trans] WITH NOCHECK ADD CONSTRAINT [DF_EggHeadTest2Trans_DispositionID] DEFAULT (0) FOR [DispositionID], CONSTRAINT [PK_EggHeadTest2Trans] PRIMARY KEY CLUSTERED ( [TransID] ) ON [PRIMARY] GO CREATE PROCEDURE dbo.spEggHeadTest1 ( @Phone nvarchar(10), @FName nvarchar(50), @LName nvarchar(50), @RecordData nvarchar(500), @LastPieceOfRecord nvarchar(50), @RecordNum int ) AS DECLARE @rc int DECLARE @NewID int BEGIN TRANSACTION EggHeadTest insert EggHeadTest1 (Phone,FName,LName,RecordData,LastPieceOfRecord, RecordNum,LastUpdateTime) values (@Phone,@FName,@LName,@RecordData,@LastPieceOfRecord, @RecordNum,GetDate()) if (@@ERROR <> 0) BEGIN ROLLBACK TRANSACTION EggHeadTest select @rc = @@Error Goto OnExit END SELECT @NewID = @@IDENTITY insert EggHeadTest1Trans (RecordID,DispositionID, CallDateTime,LastUpdateTime) values (@NewID,0,null,GetDate()) if (@@ERROR <> 0) BEGIN ROLLBACK TRANSACTION EggHeadTest select @rc = @@Error Goto OnExit END COMMIT TRANSACTION EggHeadTest OnExit: RETURN @rc GO SET QUOTED_IDENTIFIER OFF GO SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER OFF GO SET ANSI_NULLS OFF GO CREATE PROCEDURE dbo.spEggHeadTest2 ( @RecLength int, @StartRecord int, @CurRecordCnt int, @RecordData ntext ) AS DECLARE @rc int DECLARE @TotRecCnt int DECLARE @NewID int DECLARE @nCnt int DECLARE @RecStart int DECLARE @TmpData varchar(8000) Select @TotRecCnt = @StartRecord - 1 Select @RecStart = 1 Select @nCnt = 1 BEGIN TRANSACTION EggHeadTest WHILE (@nCnt <= @CurRecordCnt) BEGIN Select @TotRecCnt = @TotRecCnt + 1 /* Increment our record count for the whole file */ Select @TmpData = SUBSTRING(@RecordData,@RecStart,@RecLength) insert EggHeadTest2 (Phone,FName,LName,RecordData, LastPieceOfRecord,RecordNum,LastUpdateTime) values (SUBSTRING(@TmpData,1,10),SUBSTRING(@TmpData,11,10), SUBSTRING(@TmpData,21,10), SUBSTRING(@TmpData,31,64), SUBSTRING(@TmpData,95,4),@TotRecCnt, GetDate()) if (@@ERROR <> 0) BEGIN ROLLBACK TRANSACTION EggHeadTest select @rc = @@Error Goto OnExit END SELECT @NewID = @@IDENTITY insert EggHeadTest2Trans (RecordID,DispositionID, CallDateTime,LastUpdateTime) values (@NewID,0,null,GetDate()) if (@@ERROR <> 0) BEGIN ROLLBACK TRANSACTION EggHeadTest select @rc = @@Error Goto OnExit END Select @nCnt = @nCnt + 1 Select @RecStart = @RecStart + @RecLength END /* End of loop */ COMMIT TRANSACTION EggHeadTest OnExit: RETURN @rc GO SET QUOTED_IDENTIFIER OFF GO SET ANSI_NULLS ON GO