Pages

Thursday, June 2, 2011

How to compare the record differences of two similar tables - Part 1 of 2

Permalink: http://bit.ly/1z7MNGQ



Click here to skip to the code snippet.

The CHECKSUM TABLE result sets of two similar tables only indicate if there are differences between the two tables. It does not tell you what exactly the differences are between the two. Does `tableA` have an updated value for one of its records that `tableB` does not have? Or does `tableA` have an extra row that `tableB` does not have? Depending on specific business requirements, the CHECKSUM TABLE statement may be sufficient. If you need to determine what the actual differences are, there is a way to go about doing this automatically and dynamically by creating a stored procedure.

This SELECT statement that uses a UNION ALL clause provides the same functionality in simply determining if there are differences between two tables:

  SELECT
  IF(
    EXISTS(
    SELECT *
    FROM (
      SELECT *
      FROM `tableA`
      UNION ALL
      SELECT *
      FROM `tableB`)
    AS `compareTables`
    GROUP BY `column1`, `column2`, `columnN`
    HAVING COUNT(*) = 1),
  TRUE, FALSE);

This statement provides an automated check in the sense that you won't have to manually compare the CHECKSUMs of `tableA` and `tableB`. Also, unlike using a CHECKSUM TABLE statement, a SELECT statement allows the result to be stored INTO a variable. If the statement returns TRUE then there is a difference found between the two tables, if not it will return FALSE. The GROUP BY clause will need a comma-separated list of all the columns of the tables since MySQL will not permit an asterisk (*) wildcard in this part of the statement.

To fully automate the above formula, it can be converted into a function and the writing of the column names for the GROUP BY clause can be retrieved using the getColumnList() user defined function. Then re-write the SELECT statement as a dynamic PREPARE-EXECUTE construct. The two tables `tableA` and `tableB` can be the parameters to be passed and the TRUE/FALSE values to be the RETURN.

To determine what the actual differences are, a more complex SELECT statement is needed. Based on the first iteration, we can derive from it and make it more useful:

SELECT MIN(`compareTables`) AS 'Tables Compared',
  `column1`, `column2`, `columnN`
FROM (
  SELECT '`tableA`' AS `compareTables`,
    `column1`, `column2`, `columnN`
  FROM `tableA`
  UNION ALL
  SELECT '`tableB`' AS `compareTables`,
    `column1`, `column2`, `columnN`
  FROM `tableB`)
AS `compareTables`
GROUP BY `column1`, `column2`, `columnN`
HAVING COUNT(*) = 1;

A sample result set will look like this:

Tables ComparedcustomerNumbercontactLastNamecontactFirstName
`tableA`103fooCarine
`tableB`103SchmittCarine
`tableA`999foobarfoobarz

The first two rows compared `tableA` and `tableB`. Note that the difference between the two tables was found in the values for the `contactLastName` column. The last row shows `tableA` with no matching comparison with `tableB`. This means that the record in `tableA` does not exist in `tableB`.



In the continuation of this entry, a stored procedure that automates the writing of the SELECT statement is given. The names of the tables to be compared and their column names can be dynamically generated so that the SELECT statement won't have to be re-written for each different case that a comparison of two similar tables is made.