DB Consistency Analysis / Maatkit

Intro

I’ve been struggling to keep our MIT database mirror synchronized with the BNL master, and I wanted to write about some steps we (STAR) might take to do a better job of keeping our slaves synchronized. The problem that I’m worried about is the situation where, according the the Heartbeat Page, a slave is up-to-date with robinson, but in reality the slave somehow become silently corrupted.

Initial Checksums

It turns out that this problem is actually pretty common. I found what seems to be a slick set of utilities called Maatkit that will calculate checksums of every table in a DB and look for differences between replicated DBs. I ran mk-table-checksum on the following servers

  • robinson.star.bnl.gov:3306
  • db01.star.bnl.gov:3316
  • db02.star.bnl.gov:3316
  • db03.star.bnl.gov:3316
  • rhig.physics.yale.edu:3316
  • star1.lns.mit.edu:3316

and attached the output below as initial_checksum.txt. None of the slaves in that list are fully in sync with robinson according to those checksums. db02 and db03 come much closer than the others; in db02’s case only Calibrations_tracker.{schema,ssdHitError} are different from robinson. I verified for a few cases that differences actually do exist in the tables when the checksums don’t match.

Resynchronization and Results

Maatkit also provides a utility (mk-table-sync) which will determine and optionally execute the SQL commands needed to re-sync one server against another. I used this utility to re-sync star1 against robinson — it takes quite a while. I then ran mk-table-checksum and mk-checksum-filter again, and attached the output as checksum_after_sync_filtered.txt. Unfortunately, robinson and star1 still don’t have perfect agreement according to the checksums. I’m not sure what tables like Nodes, NodeRelation, and tableCatalog do, but I noticed the following “physics” tables still did not have matching checksums:

  • Calibrations_ftpc.ftpcGasOut
  • Calibrations_rich.trigDetSums
  • Calibrations_svt.svtPedestals
  • RunLog_onl.beamInfo
  • RunLog_onl.biFitParams
  • RunLog_onl.starMagOnl
  • RunLog_onl.zdcFitParams

Now comes a weird part: I tried a SELECT * FROM biFitParams on star1 and on robinson, and in that case there was no difference in the output. I’m not sure how the checksums could still be different in that case. I also tried diff’ing the starMagOnl tables; the only difference I found was one server reported some currents as “-0.0000000000” and the other one reported “0.0000000000” (no leading minus sign).

Summary

So, I realize the results aren’t 100% conclusive, but I still believe that these maatkit scripts would be a valuable addition to STAR’s QA toolkit. They definitely helped me correct a variety of real problems with our MIT database mirror.

It’s straightforward to take the output of mk-table-checksum and mk-checksum-filter and programmatically put it on a webpage; in fact, Mike Betancourt wrote up a little sed script to do just that. I think we should try scheduling ~daily checksum calculations and posting any discrepancies to the Heartbeat webpage automatically.