QRecall

Ralph Strauch

I seem to have an archive that is beyond repair. Attempts to repair it or even copy it to another drive result in the drive suddenly taking itself off-line. I originally thought it was a bad drive, but in light of the overall behavior I think there might be a problem with the archive file itself. I'm describing it here in case it's useful for you to know about.

For the past three years I've been backing up two Macs to a WD MyBook Pro, consisting of two 500gb drives in a Raid 1 configuration. (I originally bought the unit thinking I could use the drives separately but that didn't turn out to be true.) The system has worked well, by and large. It did occasionally give me network or drive errors that would corrupt my archive, but qrecall was always able to repair it. Several weeks ago the failures got worse, with the drive regularly unmounting itself during a backup. So I set it aside and shifted to a new backup drive, thinking that the problem was with the enclosure and the drives themselves were probably fine. (This happened about the time I upgraded to Lion, I believe.)

This week I got around to taking the drives out of the case and trying them both individually. Each exhibited the same behavior as they had in the case. When I tried to open the archive qrecall would tell me it needed to be reindexed, and when I tried to repair it the drive would unmount itself part way through. Both drives checked out OK using Disk Utility and Tech Tool Pro. It seemed unlikely to me that both drives would suffer the same failure at the same time. I erased one of the drives and ran a new backup on it, and that worked fine. I then tried copying the archive from the other drive to a new drive. Other files on the drive copied properly, but part way through the archive copy the source drive dismounted and left and incomplete file on the destination -- just as it had done during repair attempts.

I'll send you a report on this from qrecall. My concern is with the possibility that the archive somehow got so corrupted that it can't be repaired. I can probably get along without ever needing to go back to the older files that I lost, but it shakes what had been my complete faith in qrecall. If you have any idea of what might have happened, I'd appreciate that.

Ralph

James Bucanek

Ralph,

Thanks for posting your issue, although I'm sorry to hear you're having such problems.

I suspect the problem is with your interface (ATA, SATA, FireWire, ...). The problem isn't the archive, it's that QRecall can't reliably read data from the device(s) and those devices are randomly dropping offline.

Let's take your issues one at a time:

I seem to have an archive that is beyond repair.

Technically, that's virtually impossible.

QRecall archives are carefully crafted so that no one incident of lost data will corrupt the entire archive. An archive is, essentially, a huge set of individual data records. The repair process really does nothing more than read the entire record set from start to finish, and notes which data records are valid and intact. It then creates a new index that stitches the valid records back together into a functioning archive and erases everything else. For an archive to be damaged "beyond repair" it would require that almost every single data record in the entire archive was either damaged or unreadable. Unless your hard drive was driven over with a truck, that's seems unlikely.

Several weeks ago the failures got worse, with the drive regularly unmounting itself during a backup.

I think that this is the core problem, but I can't tell you what's causing it.

When I tried to open the archive qrecall would tell me it needed to be reindexed

That's simply because the archive had already been damaged, and had never been successfully repaired (for one reason or another). QRecall can't browse an unindexed archive.

and when I tried to repair it the drive would unmount itself part way through.

QRecall will not cause your drive to unmount itself. But a drive that does unmount itself will certainly tank whatever QRecall action is in progress.

Well, that first statement isn't entirely true. QRecall reads data from files using the same OS functions that every other application you own uses. One significant difference is that QRecall reads and write a whole lot of data. Applications that read a few megabytes aren't likely to encounter a problem, but read a few gigabytes and you'll eventually run into something. So it's possible for QRecall to exacerbate an existing problem, but it can't cause it.

Looking at your log files, that's exactly what it appears is happening here. The last report you attempted to perform shows a pattern of data read failures, timeouts, and volumes spontaneously unmounting.

2011-08-22 18:05:12.708 -0700 ------- Repair bu-archive.quanta

QRecall starts the repair process at 18:08 by reading the primary archive file.

Everything seems to run fine for several hours, until ...

2011-08-22 18:20:08.259 -0700 Caution Data problems found
2011-08-22 18:20:08.259 -0700 Warning Invalid data
2011-08-22 18:20:08.260 -0700 Details 32784 bytes at 12817714648
2011-08-22 18:20:08.756 -0700 Details 32784 bytes at 12822828952
2011-08-22 18:20:09.532 -0700 Details 24880 bytes at 12839047576
2011-08-22 18:20:13.020 -0700 Details 32784 bytes at 12935966552
2011-08-22 18:20:14.038 -0700 Details 32784 bytes at 12957223328

a burst of data errors are encountered. QRecall is either getting corrupted data during the read process, or the data in these records was previously scrambled by some other problem.

Twenty seconds after this string of data errors, the volume unmounts itself:

2011-08-22 18:46:53.872 -0700 #debug# DiskDisappeared /dev/disk2s3
2011-08-22 18:46:53.875 -0700 #debug# removed volume 'spare'
2011-08-22 18:47:03.897 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 18:47:03.901 -0700 #debug# unmounted volume /Volumes/spare

A few minutes later, the following is logged:

2011-08-22 19:17:06.069 -0700 Caution listener threw exception in message progressUpdate:
2011-08-22 19:17:06.069 -0700 Details NSDistantObject (0x1023012a0) is invalid (no connection)
2011-08-22 19:17:06.069 -0700 #debug# NSObjectInaccessibleException exception

These messages tell me that the communications link between the repair process and the user interface timed out. This usually means that one of processes terminated abnormally, but it can also happen if one of the processes stop communicating for more than a couple of minutes. That tells me that the repair process was likely "hung" waiting for the drive to return data.

I suspect that the drive and/or interface is having problems reading some data. It either loses connection with the drive (causing hung processes) or panics and unmounts the volume, or both. Two more bits of information in the log infer this. First, is the string of volume mount and unmount events logged:

2011-08-22 17:28:03.347 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:29:12.919 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:29:27.452 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:29:27.617 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:29:27.725 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:32:24.371 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:32:50.309 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:32:50.380 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:33:18.777 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:34:06.202 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:34:19.468 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:34:27.942 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:34:32.833 -0700 #debug# volume /Volumes/FantomHD unmount notification
2011-08-22 17:34:32.907 -0700 #debug# volume /Volumes/MBPclone unmount notification
2011-08-22 17:38:58.680 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:39:17.792 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:39:33.070 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:39:33.236 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:39:33.350 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:40:28.764 -0700 #debug# volume /Volumes/rstrauch mount notification
2011-08-22 17:41:24.718 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:43:07.119 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:43:55.041 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:44:09.609 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:46:20.231 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 17:47:14.190 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 17:48:40.586 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 18:01:55.406 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 18:47:03.897 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-22 20:42:27.107 -0700 #debug# volume /Volumes/spare mount notification
2011-08-22 20:43:50.591 -0700 #debug# volume /Volumes/spare unmount notification
2011-08-23 01:36:29.097 -0700 #debug# volume /Volumes/spare1 mount notification
2011-08-23 08:53:02.072 -0700 #debug# volume /Volumes/spare1 unmount notification
2011-08-23 10:05:30.332 -0700 #debug# volume /Volumes/Files mount notification
2011-08-23 10:38:14.210 -0700 #debug# volume /Volumes/spare1 mount notification

While many of these are obviously normal activity, particularly since you said you were moving volumes around, some are clearly not. Of particular note are the mount events that follow unmount events by a fraction of a second. This is not normal and indicates that the volume spontaneously unmounted itself and then immediately remounted.

Your log is also full of -35 (Volume not found) and -43 (Directory not found) errors, indicating the the archive's volume was not mounted at the time the action attempted to do something.

Other files on the drive copied properly, but part way through the archive copy the source drive dismounted and left and incomplete file on the destination -- just as it had done during repair attempts.

This tells me that the Finder is encountering exactly the same problem that the QRecall repair action did: it was reading the file just fine, then it encountered data read errors, then the drive unmounted. This points to a problem with the drive, the controllers, or the I/O subsystem.

If it becomes possible to recover the data that's on the drive(s), I suspect that the bulk of the archive is fine, could be repaired, and eventually put back into service.

Ralph Strauch

Thanks for the detailed reply. The problems were caused, apparently, by the WD Raid enclosure I was using.Doth drives seem to be fine individually now that I've removed them from the enclosure. I thought that the archive itself was unreadable when I couldn't copy it to another drive, but I think now that my power cord had pulled loose causing the computer to shut and interrupt the copy. I've now successfully repaired the archive and everything is back to normal.

Here's another question, though. I'd like to use the two drives I pulled from the Raid unit to keep two copies of the archive -- storing one offsite and switching them periodically. Will I encounter any problems if I name both drives and archives identically, and use the same Action to backup to whichever drive happens to be connected? That seems like it should work if all the status information is stored in the archive, and it's certainly simpler than have a separate action for each drive.

Ralph

James Bucanek

Ralph Strauch wrote:I've now successfully repaired the archive and everything is back to normal.

That's great news!

Here's another question, though. I'd like to use the two drives I pulled from the Raid unit to keep two copies of the archive -- storing one offsite and switching them periodically.

This is an excellent strategy. I personally prefer this approach over RAID 1 for redundancy.

Will I encounter any problems if I name both drives and archives identically, and use the same Action to backup to whichever drive happens to be connected?

Most likely, yes. Mac OS X is pretty smart about not being fooled by different volumes and/or files with the same name. It will undoubtedly recognize that the volumes are different and will probably refuse to use one of them outright.

it's certainly simpler than have a separate action for each drive.

While it seems like a lot of work, that shouldn't be difficult to set up.

Let's say you have two archives on two volumes, named "A" and "B".

Mount volume A and configure all of your QRecall actions the way you want them. In the Schedule section of each item, add the "Ignore if no archive" condition.

Unmount volume A and mount volume B.

Select all of the action for archive A and click the "duplicate" button at the bottom of the actions window, and then click on the "edit" button. In each action, change the archive from A to B. When saving do not let QRecall update the archive for all similar actions, else it will "fix" the ones for archive A.

Now you have two identical set of actions, which will only run when their respective volume is mounted.

You can even get clever and use QRecall as reminder to rotate your backup sets. Set up a compact action for each archive. Schedule them using the interval schedule to run every 14 days, starting on alternate weeks. Change their schedule condition to "Hold while no archive". Now, once a week QRecall will start the compact (which will appear in the activity window) for that week's archive and patiently wait for you to mount it.

Bruce Giles

James Bucanek wrote:

Ralph Strauch wrote:Will I encounter any problems if I name both drives and archives identically, and use the same Action to backup to whichever drive happens to be connected?

Most likely, yes. Mac OS X is pretty smart about not being fooled by different volumes and/or files with the same name. It will undoubtedly recognize that the volumes are different and will probably refuse to use one of them outright.

I have an interesting solution to the problem. Unfortunately, I can't tell you exactly how I did it -- it was a fortunate accident. But it might be useful to someone...

I have two external FW800 drives I use as rotating backups. They have different names -- "Xserve Backup A" and "Xserve Backup B". Each drive, however, has an archive file with the same name -- "Xserve Backup A.quanta". Yes, you read it correctly. Drive "Xserve Backup B" has an archive file named "Xserve Backup A.quanta".

The interesting part is that I have only one set of actions, which "automagically" modify themselves to work with whichever backup drive is mounted. I don't know how they do that. When drive "Xserve Backup A" is mounted, my actions all point to:

/Volumes/Xserve Backup A/QRecall Backups/Xserve Backup A.quanta

When drive "Xserve Backup B" is mounted, my actions all point to:

/Volumes/Xserve Backup B/QRecall Backups/Xserve Backup A.quanta

If I edit an action, I can actually see the wording in the archive location changes, depending which drive is mounted.

I *think* what happened was that I originally created one backup drive, got it all configured in QRecall, then used SuperDuper to clone it to a second drive. Then I changed the name of the second drive, but didn't change anything else. Anyway, it's worked perfectly that way for several years. I don't, however, ever mount both backup drives on the Mac at the same time. Seems like that might be asking for trouble.

-- Bruce

Peter N Lewis

James Bucanek wrote:
Select all of the action for archive A and click the "duplicate" button at the bottom of the actions window, and then click on the "edit" button. In each action, change the archive from A to B. When saving do not let QRecall update the archive for all similar actions, else it will "fix" the ones for archive A.

Are there plans to make this automatic? It seems to me, if your Archive field allowed multiple selections like the Items to Capture does, you'd be able to do this within QRecall rather than the hackish solution of duplicating all actions?

James Bucanek

Peter N Lewis wrote:Are there plans to make this automatic?

It's an interesting idea, but I don't have any plans to implement actions that target multiple archives, primarily because I don't see any way of implementing such a feature without creating a lot of ambiguities. (Ambiguities and backups aren't a good mix.)

It seems to me, if your Archive field allowed multiple selections like the Items to Capture does, you'd be able to do this within QRecall rather than the hackish solution of duplicating all actions?

You're capturing to two archives, so you should have two sets of actions—one set for each archive. Duplicating the first set of actions isn't so much of a "hack" as it is simply a way of saving time when setting up the second set of actions (as opposed to creating the second set from scratch too).

Peter N Lewis

James Bucanek wrote:

Peter N Lewis wrote:Are there plans to make this automatic?

It's an interesting idea, but I don't have any plans to implement actions that target multiple archives, primarily because I don't see any way of implementing such a feature without creating a lot of ambiguities. (Ambiguities and backups aren't a good mix.)

It seems to me, if your Archive field allowed multiple selections like the Items to Capture does, you'd be able to do this within QRecall rather than the hackish solution of duplicating all actions?

You're capturing to two archives, so you should have two sets of actions—one set for each archive. Duplicating the first set of actions isn't so much of a "hack" as it is simply a way of saving time when setting up the second set of actions (as opposed to creating the second set from scratch too).

I'm not sure how its ambiguous - it captures to the first archive that is available (or if you prefer, it captures to all archives that are available, which would make sense too - rerun the action for each specified archive).

It becomes a hack when you have three or four rotating backups, and need to create three or four copies of each action on each Mac.

What happens when you want to add a new excluded folder? You have to add it on three or four actions. Same if you want to change the schedule.

James Bucanek

Peter N Lewis wrote:It becomes a hack when you have three or four rotating backups, and need to create three or four copies of each action on each Mac.

I feel we're getting the "solution" cart in front of the "need" horse. While seme kind of "multi-archive" action might make this kind of backup strategy easier to set up and maintain, it adds a lot of complexity for a very small benefit (in my not-so-humble opinion).

The real need is to have redundant, rotating, cascading, or fall-over backups. For that, I have a couple of new ideas on the drawing board that I think will address those needs much more directly.