QRecall

Charles,

An e-mail notification is a good suggestion. I also plan to make errors more prominent in the monitor window, possibly forcing the window to the front when a warning or error is detected. At the very least, reopening the monitor window when an error occurs which should make it harder to miss.

I like to think of the verify action as a "peace of mind" feature. QRecall constantly checks the integrity of the archive whatever it is doing (capture, merge, recall, etc.) and will stop immediately if it finds anything amiss.

So at first blush the verify appears superfluous. But no action touches everything in an archive. The verify is the only action that ensures that everything in the archive is valid, readable, and has not been altered or corrupted in any way. I tend to schedule verifies to run at least once a week, even once a day on small archives.

Tip: If your archive is on another computer, create the verify action to run on the server. It will run much faster than it would over a network connection. The same is true for merge and compact actions.

Ralph Strauch wrote:Here's log for my previous posting. I attached it to that posting and it showed up as attached, but the post itself showed up with <br>'s in place of the returns, I guess because I wrote it in a text processor and pasted it in.

No, this is a bug in the JForum software that runs this forum. It can usually be avoided unchecking the "Disable HTML in the message" option. I don't know why.

Hopefully this will get fixed when we upgrade to the next version of JForum.

Ralph,

I'm profoundly sorry to be the bearer of bad news (again), but using QRecall with Apple's iDisk service does not work well.

There was a long thread on this some time back (Online Storage). To avoid the pain of reading the whole thing, I'll summarize:

Apple's iDisk, Amazon's Simple Storage System, and similar services are all based on the WebDAV protocol. This is file transfer protocol originally designed to make it easier to maintain web sites. Over time, it's been co-opted into a general purpose remote file storage service.

The problem is that WebDAV is not a remote file protocol and it cannot make small modifications in a large file remotely — which is exactly what QRecall does all the time. To simulate the ability to arbitrarily modify files, the OS silently downloads the entire remote file (the QRecall archive in this case) onto your local drive, allows you to update it locally, then copies the modified file back up to the server when the file is closed. For really small file, this is fine. But for making small changes to large files it is horrifically inefficient.

What you observed was QRecall capturing all of the documents to the local copy of the archive. Since this was a local copy, it also happens very quickly making you think "wow, this is really fast!" When the capture was complete, the archive file was closed and the actual work of copying the data to the iDisk began. What you perceived as QRecall "hanging" was actually the WebDAV service copying the finished archive to the iDisk server. This is consistent with the observed behavior and the I/O errors in your log file.

For small incremental backups of important documents, I highly recommend a small USB thumb drive or just create a small archive on the same disk. For an off-sight backup, occasionally copy the archive to your iDisk.

The "waiting for archive" occurs when QRecall believes that some other process has the archive document open for reading or writing. Could you describe the arrangement (whether the archive is on an external drive or a network volume, etc.). It would also be extremely helpful if you sent me your log file (~/Library/Logs/QRecall/QRecall.log).

The type of identity key shouldn't make any difference in this case.

I'm curious to why the problem occurred, but I suspect that a restart will solve it.

Ralph,

Thanks for sending me your log files.

It hard to tell what's going on, but whatever is happening appears to be related to your external hard drive unmounting in the middle of a QRecall action.

For example, the log you sent shows the Repair starting at 2008-03-13 01:48:19. Three minutes later (01:51:48), the volume "spare" (the one containing the archive) was unmounted. The operating system will not normally allow a volume to be unmounted as long as there are file open on that volume. I can only conclude that the hard drive or its interface was physically disconnected or powered down. This will most certainly cause any capture or repair action to fail and leave the archive in an incomplete state.

You also mentioned that you forced the action to quit, but had to restart your computer anyway. Make sure you are stopping or quitting the QRecallHelper process. Quitting the QRecall application (from the Dock or any other way) will have no effect on any archive commands or actions that are running. Each archive operation runs in a separate process (this is how actions run in the background, or even when you are logged out) and will continue to run after the QRecall application quits. If you can't get an action to stop using the stop button, launch Activity Monitor, locate the QRecallHelper process, and quit that.

From what you've sent me, I still suspect that your external drive is disconnecting spontaneously while QRecall is writing to it and this is the root of your problems.

Ralph Strauch wrote:I'm attaching a file containing the log starting at the point that I mounted the backup drive.

The log file didn't get attached. You can just mail it to james@qrecall.com and I'll take a look at it.

I restarted the backup at 2008-03-13 01:43:51.499 and it failed due to an invalid header length. I started a repair at 2008-03-13 01:48:19.045 and went to bed. When I got up the next morning I found that the repair had failed almost immediately at 2008-03-13 01:48:20.562 due to "Found 24 bytes of invalid data at file position 2424."

That, by itself, will not cause the repair to fail. It's just an informational message telling you what QRecall found wrong with the archive. The repair is specifically designed to discard errors and corrupted data and reconstruct the archive from all of the data that still OK. What you want to look for is either "Repair complete" or "Repair failed".

It also looks "from the log as if the backup disk was dismounted shortly thereafter. It was not mounted when I checked it in the morning.

If the volume unmounted, then that's a good reason why the Repair would fail.

To help you interpret the log, MBPHD is the source drive on my MacBook Pro being backed up. "spare" is the backup volume, and the backup disk also contains volumes "leopard" and "Mac OS X Install DVD." The unmounted backup disk is the same condition I described on March 8. I assumed then that it was a problem with my drive, but now seeing it occur again immediately following a Qrecall failure I wonder if that is the case. Can you tell from the logs? Are some of these problems with my system, or is it all due to QRecall?

My money is still on a problem with that drive or your FireWire interface. QRecall will never cause a drive to dismount. However, a drive that spontaneously dismounts will certain wreak havoc with any application that is writing to it, including QRecall. The error you encountered before was due to a write failure on that drive. Since the same thing happened again, I assume that QRecall encountered another write failure while trying to write to that drive.

In any case, I think QRecall needs a softer failure mode, allow the program to exit properly without requiring a force quit and sometimes even a reboot.

It does. If a call to the operating system fails, QRecall will exit gracefully. However, if a call to the operating system hangs there's nothing QRecall can do about it; it isn't in control of the process anymore. I might be able to tell more when I see the log file.

The log file also includes a lot of entries for time periods when QRecall isn't active. Is this normal?

Yes. First, QRecall is always active. The scheduler and monitor processes run continuously. I've left a lot of debugging messages in the log to help diagnose any potential problems. Diagnostic messages that provide little use over time eventually get turned off.

Andre wrote:Is there any solution, so the monitors stay in sleep modus

Yes, but not until the next release.

A few users were complaining that their system was going to sleep during a capture. Code was added that tells the system it is busy and to stay awake while a command is running. This has the unfortunate side effect of waking up the display as well.

This happens to me too and I find it very annoying. An expert preference was added that will disable the notifications that tell the system to stay awake. Expect that version within the next couple of weeks.

Ralph Strauch wrote:2008-03-06 01:08:22.152 -0800 Failure Could not capture Localizable.strings
2008-03-06 01:08:22.405 -0800 Details archive I/O error
2008-03-06 01:08:22.406 -0800 Details Cause: <IO> cannot read envelope data { PkgNumber=2128924, Class=FileSource@0x133a70(/Volumes/spare/bu-archive.quanta/repository.data), API=FSReadFork, Pos=45118276456, OSErr=-5000, Length=8828, File=/Volumes/spare/bu-archive.quanta/repository.data }

As I suspected, the problem was not with the Localizable.strings file. The problem was that QRecall lost access to the archive (the root problem was "archive I/O error").

What's odd is the error code. Error code -5000 is normally an access error to a file on a network volume, not a USB device. But then again, you said that you're using some special USB drivers? That might be a factor.

Ralph Strauch wrote:Wednesday evening I initiated a backup to an external WD hard drive. Qrecall ran for a couple of hours and then hung. I unsuccessfully attempted to rerun it and eventually realized that my USB port had failed and the external HD was no longer mounted, so that was the cause of the backup failure. (The USB failure must have been a driver failure, because after a restart the port is working again.)

This will certainly cause the archive to be invalid afterwards. QRecall was unable to finalize the archive. Any subsequent attempt to use it will require that you first repair it so that QRecall knows that its in a valid state before it begins.

The archive was corrupted (header length invalid) and would not reindex. I didn't have enough free space to repair it, so I trashed it, started a new backup, and went to bed.

You could have repaired the archive in situ. The default repair options are conservative so the "Copy recovered data to new archive" is on by default. But for most situations this isn't necessary. Unchecking that option will allow you to repair the archive without copying it first. I have a to-do item to leave that option unchecked by default in the next release.

In the morning I found that it had hung at 1am with a "Cannot capture localizable strings" error referring to a Japanese lprog file in a printer driver. I excluded that file from capture (though it had been previously captured without error) and tried again.

Failing to read or capture a source file should not (by itself) cause the capture to hang or the archive to be corrupted. More than likely something else (like what happened earlier) occurred while the localizable strings file was being captured. I doubt there's anything wrong with that file. Your log files ? which you are welcome to send to me ? should explain the root cause.

Again, I found that my archive was corrupted with a "header length invalid" error. This time I repaired the archive and recaptured, and everything seemed to work -- except for the layer information in my current archive. My current archive is 87gb, and shows that it contains two layers. Layer 1, which I think is the repaired version of my 44gb corrupted capture, shows and unknown time of capture and 0gb in size, while layer 2 shows the time of my latest capture and 52gb size, the same size shown for the capture in the log.

This would be consistent with an incomplete layer that was recovered during the repair (although I'm surprised the layer doesn't say "-damanged-" or "-incomplete-"). The size of the first layer is impossible to determine because it never finished.

I tried reindexing the archive to see if this would correct the layer 1 info, and it doesn't.

Reindexing won't help, because that information doesn't exist. The first capture never finished, so the summary statistics (capture date, number/size of captured items, etc.) was never generated or recorded.

Since the layer is incomplete, I would suggest that you merge it with the second (complete) layer. Assuming that you don't have any files in the first layer that you need to preserve, merging it with discard the incomplete layer information.

So I think I've probably got a good archive now, and the layer info just got lost in the repair process. I'm a bit concerned, though, because the 44gb first capture and 52gb second capture together only give me an 87gb archive. Am I missing something here?

Unlike a regular file system, a QRecall archive is a database of file information and data. Sizes get confusing, because multiple items can all share the same data. So capturing 30GB, 20GB, then 10GB won't produce a 60GB archive. The archive could be much smaller than that. In the extreme case where all of the files contains zeros, the archive could be less than a megabyte.

QRecall is conservative. Since the first layer wasn't captured completely, there's probably lots of missing file and directory information. When the second layer was captured, all of those missing items would have been captured again. But QRecall first captures the data of a file and then records the file information, which means that a significant part of second 52GB capture was already in the archive and would have been considered duplicate data.

If you're interested the details, increase the detail level in the log viewer and expand the "Captured X items, X.XX GB (XX% duplicate)" line. Inside this record you'll find the amount of data captured (the total data read in new items or items that QRecall suspect have changed), written (the amount of new data written to the archive), and duplicate (captured data that turned out to already be in the archive). Adding up the written values should approximate the total size of your archive, assuming you haven't compacted it.

David Stevens wrote:If I understand correctly, a single key would allow me to back up all 3 machines to their own individual archives.

You understand correctly.

What is the advantage of having 3 keys and a single "accessible to all machines" archive?

At this time, the primary advantage is space efficiency. For example, most of the operating system in the G4 iMac and the Mini are the same. QRecall would only store one copy of the (common parts) of the OS in the archive. Similarly, if you duplicate large iTunes or iPhoto libraries by copying them to each computer, you could have gigabytes of duplicate data.

Disadvantage: only one computer can capture to the archive at a time.

So it primarily depends on how tight your archive storage space is. I'd suggest starting with one key and three archives. See how much space the archives use. You can then decide if you're going to run out of backup storage space. You can always buy one or two additional keys later and switch to using shared archives.

Judith Blair wrote:My question/problem: the hourly capture of my internal drive takes about an hour, even after the first capture.

QRecall is reexamining every item on the entire volume to determine what has, and has not, changed. There are literally hundreds of thousands of files in the operating system, and every one has to be checked.

I suggest creating two captures: One that runs hourly (or semi-hourly) that captures only your Documents or home folder. There's little reason to recapture your entire operating system every hour. Have the second capture run once a day and capture your entire boot volume.

Time Machine avoids this problem by using something called "File System Events" which keeps track of which folders on a volume have recently changed.

A future version of QRecall will use this same facility to quickly determine what has, and has not, changed on the volume making large recapture actions substantially faster. (Sorry Tiger users, the FSEvents service was only made available to third-party developers in Leopard. QRecall will still have to do it "the hard way" in Tiger.)

I also wanted to make one comment about performance. I appreciate what you're trying to accomplish by storing the archive on an encrypted disk image, but speed is not one of the benefits.

Besides writing to a network volume, every access has to go through the disk image driver which is notoriously slow. Add to that the encryption/decryption overhead and you have a recipe for truly mediocre performance.

Rodd,

Thanks for sending me your logs and additional information privately. The problem is indeed related to a single folder with over 100,000 files.

The solution was remarkably simple: I just delete the code that throws an error when there are more than 63,000 items in a package list. It turns out this was old code, left over from the days when package lists couldn't exceed that number. But package lists were reworked ages ago and can now grow to more than 30,000,000 records.

If you have a folder with more than 30,000,000 files, then we might be in trouble.

Please download QRecall 1.0.0b4 and let me know how it does. To replace your current version with the new one:

- Open your Applications folder
- Drag your existing QRecall to the Trash
- Open the disk image
- Copy the new version from the disk image to your Applications folder
- Launch the new version

jch wrote:What if I used external hard drives instead and just rotated those off-site??

That would work just fine.

Does QRecall support multiple backup sets (one per external drive)?

QRecall doesn't care how many archives you have or how you organize them.

If you did this I would suggest creating two sets of capture and maintenance actions, one for each archive (let's call them A and B). In each action, add the "Ignore If No Archive" condition. When the drive with A is plugged in, all of the A actions will run and the B actions will be ignored. When you remove A and replace it with B the opposite will happen.

Remember not to leave a hard drive sitting unused for too long. The surface of a hard disk is (literally) fluid and is not designed for extended storage without being turned on. As long as each drive gets spun up and used at least once a month, you should be OK.

The "missing file icon" messages aren't serious. They just mean that when asked for the icon of a file, the operating system returned an error instead of an icon. This typically means inconsistencies in the launch services database.

The $64,000 question (no pun intended) is, "do you have a directory with more than 63,000 files in it?" I'm honestly not sure if I've ever tested that case.

The "package list exceed 63K elements" just means that a package list exceed 63K elements. Package lists are lists of packages; They could represent a list of file or directories, but are most often used to list the data blocks that belong to a file. When they get too big they transform into a tree of packages lists, so they should never exceed the 63K limit.

There's a couple of potential problems here. It could be that some geometry of your hard drive has found a flaw in QRecall's package list management, be it a directory with 63,000 files or who knows what. However, it could also be a problem writing to a disk image. It would be easy to test that by repeating your test writing directly to a volume instead of a disk image on a volume.

I'm going to assume that the 63K error occurred during the capture, which means that the archive is most likely damaged. If you want to continue to use this archive it will have to be repaired.

It would also be helpful if you could send me your complete log file, via private mail if you prefer, following the repair. I'm interested in exactly what the complete order of events was and what problems the repair finds.