QRecall

Adrian Chapman

James

I capture my Users directory to an archive hosted on a Drobo-FS and this has performed well for some time. I also capture my System volume and other Macs to their own archives on the Drobo.

My Users archive has recently suffered some serious, and it would seem unrepairable problems, such that I now have an archive with 157 layers of which 89 are showing as damaged. For my most recent attempt at a repair I manually deleted all the index files in the archive and the repair ran through but reported a number of errors in the archive. I then ran my Users archive action which seems to have run without incident but the progress bar in the activity window is only about 40% of the way across when the archive is being closed and the closing process seems to take an inordinately long time. However, the capture seems to have completed without any other issues.

I am very reluctant to write off this archive but right now I don't see what else I can do. I have submitted a report.

James Bucanek

Adrian,

Thanks for posting your problem and for the diagnostic report.

My analysis is that you're having data integrity problems with your NAS drive. My first guess would be cross-linked files, which can cause writes into one file to pollute a different file with unrelated data.

The first problem was encountered on 2012-06-28 14:08 when the verified failed:

2012-06-28 15:44:17.686 Archive contains unused space that has not been erased

2012-06-28 15:49:13.313 Pos: 245397938824

2012-06-28 15:49:13.313 Length: 1555152

Under normal circumstances, QRecall fills each unused area of the archive with zeros. It then wraps a small record header around each area to mark where it begins and how long it is.

During the verify, QRecall confirms that all unused areas of the archive are, indeed, still filled with zeros. In your case, there was something other than zeros in that region of the archive. This indicates that some other process has written data into that region of your archive or the sectors that store that data have become damaged or corrupted.

Subsequent attempts to repair the archive show similar problems, but this time the damage was in regions of the archive that contained real data:

2012-06-28 17:53:37.933 2464 bytes at 248688532320

2012-06-28 17:54:20.271 13601632 bytes at 250084036664

2012-06-28 17:55:19.349 32744 bytes at 252668562552

2012-06-28 17:55:19.349 16 bytes at 252668595320

2012-06-28 17:58:06.164 4104 bytes at 259718257104

The next repair reported more lost data:

2012-06-28 21:00:20.971 16400 bytes at 245687567768

2012-06-28 21:00:53.052 20280 bytes at 246963311616

2012-06-28 21:00:53.395 2664 bytes at 246963332080

The really important thing to note here is that the file positions in the second report are at positions before the ones in the first report. If the data on the drive was stable, the first repair would have reported these errors too. Instead, the first repair read this area of the archive and found it to be sound, but the next time your ran the repair the previously sound area now has data errors.

There are two explanations for this.

The drive system could simply be dying a slow death, and is randomly scrambling or corrupting data.

More likely is the volume structure of the NAS is damaged. If the allocation map is invalid, it could cause newly written data (i.e. the index files that get recreated during the repair) to inadvertently overwrite perfectly good data in the primary archive data file. In effect, the repair is destroying the archive.

The solution to the later problem is to use a disk repair utility to ensure the volume structure of the NAS is OK.

As you continued to run more repair actions, QRecall continued to find new area of your archive that was damaged. A damage to a few records can indirectly impact the contents of multiple layers, so it won't take long before most of the layers in your archive are affected by at least one of these problems. You can review the repair log to discover exactly which files/folders were affected (in most cases, it was just a few files that were lost during each repair).

Adrian Chapman

Well, that certainly explains the problem. Unfortunately the Drobo-FS has its own internal file system and from what little I know, the method of checking can be destructive of data! Great!

Sometimes I think these systems are too damned clever for their own good.

Thanks anyway James

James Bucanek

Just a follow up:

While reviewing the rest of your diagnostic report, I noticed this error during your last verify action:

2012-06-30 14:07:41.546 Failed

2012-06-30 14:07:41.546 cannot read file

2012-06-30 14:07:41.546 Error: I/O error (bummers)

2012-06-30 14:07:41.546 Length: 524288

2012-06-30 14:07:41.547 Pos: 242321696144

2012-06-30 14:07:41.547 File: repository.data

An I/O error means the drive was physically unable to read something from the drive.

It could mean hardware problems, or could also be caused by a corrupted volume structure (a file position get translated into a track/sector that doesn't exist and the drive responds with an I/O error). Because of the ambiguity of I/O errors, it doesn't really narrow down what the exact problem is. But it is one more piece of evidence that points to a problem with the NAS.

James Bucanek

Adrian Chapman wrote:Unfortunately the Drobo-FS has its own internal file system and from what little I know, the method of checking can be destructive of data! Great!

From the sounds of this Drobo support article you might be able to fix it with OS X.

Adrian Chapman

It seems this is not applicable to the Drobo-FS, James. The Drobo-FS only has an ethernet port and it is supposed to do its own file system check upon booting.

I did a firmware update on it recently and I am beginning to wonder if that is the root of the problem.

I am currently running a QRecall verify on my other archives to make sure they are OK. So far I haven't discovered any other file corruption.

Adrian Chapman

James

Just a quick follow up. I have found no other data corruption in any files on my Drobo and I have started a new Users directory archive.

The only other thing that springs to mind is that I deleted a lot of content from the archive that subsequently became corrupted which resulted in my weekly verify/merge/compact actions started a big compact. Because this was taking so long I set it to run each night and stop at 0800. The corruption issue started after I embarked on this series of actions.

I'm not suggesting that QRecall is at fault per ce, but I wonder if there is a connection somewhere.

Adrian

James Bucanek

Adrian Chapman wrote:I'm not suggesting that QRecall is at fault per ce, but I wonder if there is a connection somewhere.

Very likely. The compact action moves a lot of data. In fact, it will typically rewrite almost every block in the archive. Contrast that with a typical capture or merge action, which might modify no more than one ten thousandth of the archive.

If you have cross-linked files, the compact action will be the one to expose it.