Garbage In, Garbage Out From Databases, Too
Whenever I bring up performing data validation on information returned from the database to the application, they tell me it’s a waste of resources. However, I think Robin Harris might agree with me.
From “Data corruption is worse than you know“:
CERN found an overall byte error rate of 3 * 10^7, a rate considerably higher than numbers like 10^14 or 10^12 spec’d for components would suggest. This isn’t sinister.
It’s the BER of each link in the chain from CPU to disk and back again plus the fact that for some traffic, such as transferring a byte from the network to a disk, requires 6 memory r/w operations. That really pumps up the data volume and with it the likelihood of encountering an error.
From “50 ways to lose your data“:
Software: drives contain small computers that run on several hundred thousand lines of code. Is that code bug free? Need you ask? Among the more common bugs - and let’s not get started on the less common ones - are:
- New code that fixes a problem and accidently breaks old code
- Putting the right data in the wrong place.
- Phantom writes that are reported as written but, oops!, aren’t.
- Cache management bugs that munge data, or return correct data to the wrong place.
- OK, this is less common, but sometimes the on-disk ECC miscorrects the data. ECC is software, right? How do you know it always works correctly? You don’t.
He focuses on the way data can get corrupted on the storage system, for the most part, but data within the databases can get corrupted in additional ways, including:
- You let me test it, and I inserted invalid data. Sometimes the dev team would fix the issue, but not clean the data, which meant that the application would bomb out when trying to load the corrupt data from the database. Usually during a demo. By checking the data when reading it from the database, it could have handled my troublesome habits.
- Data migrations or batch data loads dump unfiltered or incorrect data in. Because some third shift desk jockey working from a command line is, like developers, LIKE THE GODS! and would never make a mistake or try something he or she did not understand.
- Integrated applications aren’t as bulletproof as yours; that is, they have 4″ mesh data validation where your wonderful project has 2″ mesh. Your partners and Web services are allowing crap into your database which can choke your application.
It’s a good idea to be suspicious of all data and make sure it’s correct. However, that costs time and money, and developers and customer-facing yes men would rather spend their time drinking at lunch.
Kind of like QA, actually, in that regard. Because whiskey makes us meaner.
September 19th, 2007 at 11:33 am
Oh, you missed the money quote:
“When [CERN] checked 8.7 TB of user data for corruption - 33,700 files - they found 22 corrupted files, or 1 in every 1500 files.”
Data repositories of 8.7TB and larger aren’t unheard of these days, and that number is only going to seem more and more normal as time goes on.