Synchronizing a table based on a flat files is one of the more common tasks when dealing with databases. It should be a straight forward task to delete removed records, update changed records and insert new records but I often encounter horrible ways of achieving this goal.
I recently came across an implementation using a DTS package with a data-driven task in SQL Server 2000. For each line in the file it would launch several queries to detect if it was a new, updated or unchanged record (all glued together with VBScript) and then launch the appropriate query to update the database.
Searching a bit on the internet I came across this simple solution. The best way of doing this fast and reliably is by using the JOIN and LEFT JOIN clauses in your UPDATE, INSERT and DELETE commands to determine the status of the record. I would like to show my implementation of slightly more complicated requirements.
First start by bulk inserting your data file into a staging table that has no constraints or indexes. This is the fastest way to get your data into the database. For the synchronisation queries (see below), you might want to consider using transactions if there is a risk of data corruption. You could also gain some speed by disabling the indexes while running the insert query.
In the following examples I will try to synchronize the accounts table using the tmp_accounts table as the staging table.
Deleting removed rows
DELETE accounts FROM accounts acc LEFT JOIN tmp_accounts tmp ON acc.category = tmp.category AND acc.code = tmp.code WHERE acc.code IS NULL OR acc.category IS NULL
Fields code and category together uniquely define an account.
Using the LEFT JOIN with the FROM clause will ensure that all records from the accounts table are selected, even if they don't exist in tmp_accounts. The records that don't exist in tmp_accounts will have a NULL value for every field. So all records that exist in the accounts table but have NULL values in their tmp_accounts fields, need to be deleted.
Updating changed rows
UPDATE accounts SET description = tmp.description, type = tmp.type, section = tmp.section, audit_date = tmp.audit_date, audit_user = tmp.audit_user FROM accounts acc JOIN tmp_accounts tmp ON acc.category = tmp.category AND acc.code = tmp.code WHERE acc.description <> tmp.description OR acc.type <> tmp.type OR acc.section <> tmp.section
By using the JOIN clause, we are sure to work only on records that exist in both the accounts table and the tmp_accounts table. Accounts need to be update in case their description, type or section has changed.
Audit_user and audit_date are automatically filled in the tmp_accounts table using a DEFAULT and are always kept up-to-date in the accounts table.
Inserting new rows
INSERT INTO accounts ( category, code, description, type, section, title_only, division, override, allow_manual_entry, audit_date, audit_user ) SELECT tmp.category, tmp.code, tmp.description, tmp.type, tmp.section, tmp.title_only, tmp.division, tmp.override, tmp.allow_manual_entry, tmp.audit_date, tmp.audit_user FROM tmp_accounts tmp LEFT JOIN accounts acc ON tmp.category = acc.category AND tmp.code = acc.code WHERE acc.code IS NULL OR acc.category IS NULL
Using a LEFT JOIN makes sure we are working with all records in the tmp_accounts table, even if they don't exist in the accounts table. Records that don't exist in the accounts table will have NULL values for all of their fields. So all records that exist in tmp_accounts and have NULL values for the accounts fields, have to be inserted.
Audit_user and audit_date are automatically filled in the tmp_accounts table using a DEFAULT and are always kept up-to-date in the accounts table