Thursday, June 2, 2011

How to compare the record differences of two similar tables - Part 1 of 2


Click here to skip to the code snippet.

The CHECKSUM TABLE result sets of two similar tables only indicate if there are differences between the two tables. It does not tell you what exactly the differences are between the two. Does `tableA` have an updated value for one of its records that `tableB` does not have? Or does `tableA` have an extra row that `tableB` does not have? Depending on specific business requirements, the CHECKSUM TABLE statement may be sufficient. If you need to determine what the actual differences are, there is a way to go about doing this automatically and dynamically by creating a stored procedure.

This SELECT statement that uses a UNION ALL clause provides the same functionality in simply determining if there are differences between two tables:

    SELECT *
    FROM (
      SELECT *
      FROM `tableA`
      SELECT *
      FROM `tableB`)
    AS `compareTables`
    GROUP BY `column1`, `column2`, `columnN`
    HAVING COUNT(*) = 1),

This statement provides an automated check in the sense that you won't have to manually compare the CHECKSUMs of `tableA` and `tableB`. Also, unlike using a CHECKSUM TABLE statement, a SELECT statement allows the result to be stored INTO a variable. If the statement returns TRUE then there is a difference found between the two tables, if not it will return FALSE. The GROUP BY clause will need a comma-separated list of all the columns of the tables since MySQL will not permit an asterisk (*) wildcard in this part of the statement.

To fully automate the above formula, it can be converted into a function and the writing of the column names for the GROUP BY clause can be retrieved using the getColumnList() user defined function. Then re-write the SELECT statement as a dynamic PREPARE-EXECUTE construct. The two tables `tableA` and `tableB` can be the parameters to be passed and the TRUE/FALSE values to be the RETURN.

To determine what the actual differences are, a more complex SELECT statement is needed. Based on the first iteration, we can derive from it and make it more useful:

SELECT MIN(`compareTables`) AS 'Tables Compared',
  `column1`, `column2`, `columnN`
  SELECT '`tableA`' AS `compareTables`,
    `column1`, `column2`, `columnN`
  FROM `tableA`
  SELECT '`tableB`' AS `compareTables`,
    `column1`, `column2`, `columnN`
  FROM `tableB`)
AS `compareTables`
GROUP BY `column1`, `column2`, `columnN`

A sample result set will look like this:

Tables ComparedcustomerNumbercontactLastNamecontactFirstName

The first two rows compared `tableA` and `tableB`. Note that the difference between the two tables was found in the values for the `contactLastName` column. The last row shows `tableA` with no matching comparison with `tableB`. This means that the record in `tableA` does not exist in `tableB`.

In the continuation of this entry, a stored procedure that automates the writing of the SELECT statement is given. The names of the tables to be compared and their column names can be dynamically generated so that the SELECT statement won't have to be re-written for each different case that a comparison of two similar tables is made.


Faisal said...

I need to compare the record differences of two similar tables.I Saw you example its great.But I need only record from a table. Like only Change and new from new table( table B)
Here Table A = Old table and Table B = New Table. I want to show Table B which is not match in Table A.

0v34c10ck said...

Hi Faisal,

Thank you.

If what you need are just the new records, try this:

FROM `tableB` B
WHERE B.`primaryKeyColumn`
SELECT `primaryKeyColumn`
FROM `tableA` A
WHERE B.`primaryKeyColumn` = A.`primaryKeyColumn`

Where you mentioned `tableB` is the one where the new records are in. You will need to input the Primary Key column/s.

If you want to make this statement dynamic, re-write it as a dynamic SQL statement in a stored procedure and you can use the getPKColumns() user defined function to get the Primary Keys. You can find it here:

Anonymous said...

I know it's been a while since anyone commented, but I would love a version of this that only displays differences between fields (value changes) and not missing or extra rows. (i.e. ignore new data or missing data because I only care if there's a change to existing data) Is that possible with just a small tweak?