QRecallDownloadIdentity KeysForumsSupport
  [Search] Search   [Recent Topics] Recent Topics   [Hottest Topics] Hottest Topics   [Groups] Back to home page 
interrupted backups and corrupted archives  XML
Forum Index » Problems and Bugs
Author Message
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

I've had a problem this week with interrupted backups and corrupted archives. Part of the problem was with my computer and qrecall may have been fine, but I'd like the describe what happened and see what you think.

Wednesday evening I initiated a backup to an external WD hard drive. Qrecall ran for a couple of hours and then hung. I unsuccessfully attempted to rerun it and eventually realized that my USB port had failed and the external HD was no longer mounted, so that was the cause of the backup failure. (The USB failure must have been a driver failure, because after a restart the port is working again.)

The archive was corrupted (header length invalid) and would not reindex. I didn't have enough free space to repair it, so I trashed it, started a new backup, and went to bed. In the morning I found that it had hung at 1am with a "Cannot capture localizable strings" error referring to a Japanese lprog file in a printer driver. I excluded that file from capture (though it had been previously captured without error) and tried again.

Again, I found that my archive was corrupted with a "header length invalid" error. This time I repaired the archive and recaptured, and everything seemed to work -- except for the layer information in my current archive. My current archive is 87gb, and shows that it contains two layers. Layer 1, which I think is the repaired version of my 44gb corrupted capture, shows and unknown time of capture and 0gb in size, while layer 2 shows the time of my latest capture and 52gb size, the same size shown for the capture in the log. I tried reindexing the archive to see if this would correct the layer 1 info, and it doesn't.

So I think I've probably got a good archive now, and the layer info just got lost in the repair process. I'm a bit concerned, though, because the 44gb first capture and 52gb second capture together only give me an 87gb archive. Am I missing something here?

Ralph
James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1473
Offline

Ralph Strauch wrote:Wednesday evening I initiated a backup to an external WD hard drive. Qrecall ran for a couple of hours and then hung. I unsuccessfully attempted to rerun it and eventually realized that my USB port had failed and the external HD was no longer mounted, so that was the cause of the backup failure. (The USB failure must have been a driver failure, because after a restart the port is working again.)
This will certainly cause the archive to be invalid afterwards. QRecall was unable to finalize the archive. Any subsequent attempt to use it will require that you first repair it so that QRecall knows that its in a valid state before it begins.

The archive was corrupted (header length invalid) and would not reindex. I didn't have enough free space to repair it, so I trashed it, started a new backup, and went to bed.
You could have repaired the archive in situ. The default repair options are conservative so the "Copy recovered data to new archive" is on by default. But for most situations this isn't necessary. Unchecking that option will allow you to repair the archive without copying it first. I have a to-do item to leave that option unchecked by default in the next release.

In the morning I found that it had hung at 1am with a "Cannot capture localizable strings" error referring to a Japanese lprog file in a printer driver. I excluded that file from capture (though it had been previously captured without error) and tried again.
Failing to read or capture a source file should not (by itself) cause the capture to hang or the archive to be corrupted. More than likely something else (like what happened earlier) occurred while the localizable strings file was being captured. I doubt there's anything wrong with that file. Your log files ? which you are welcome to send to me ? should explain the root cause.

Again, I found that my archive was corrupted with a "header length invalid" error. This time I repaired the archive and recaptured, and everything seemed to work -- except for the layer information in my current archive. My current archive is 87gb, and shows that it contains two layers. Layer 1, which I think is the repaired version of my 44gb corrupted capture, shows and unknown time of capture and 0gb in size, while layer 2 shows the time of my latest capture and 52gb size, the same size shown for the capture in the log.
This would be consistent with an incomplete layer that was recovered during the repair (although I'm surprised the layer doesn't say "-damanged-" or "-incomplete-"). The size of the first layer is impossible to determine because it never finished.

I tried reindexing the archive to see if this would correct the layer 1 info, and it doesn't.
Reindexing won't help, because that information doesn't exist. The first capture never finished, so the summary statistics (capture date, number/size of captured items, etc.) was never generated or recorded.

Since the layer is incomplete, I would suggest that you merge it with the second (complete) layer. Assuming that you don't have any files in the first layer that you need to preserve, merging it with discard the incomplete layer information.

So I think I've probably got a good archive now, and the layer info just got lost in the repair process. I'm a bit concerned, though, because the 44gb first capture and 52gb second capture together only give me an 87gb archive. Am I missing something here?
Unlike a regular file system, a QRecall archive is a database of file information and data. Sizes get confusing, because multiple items can all share the same data. So capturing 30GB, 20GB, then 10GB won't produce a 60GB archive. The archive could be much smaller than that. In the extreme case where all of the files contains zeros, the archive could be less than a megabyte.

QRecall is conservative. Since the first layer wasn't captured completely, there's probably lots of missing file and directory information. When the second layer was captured, all of those missing items would have been captured again. But QRecall first captures the data of a file and then records the file information, which means that a significant part of second 52GB capture was already in the archive and would have been considered duplicate data.

If you're interested the details, increase the detail level in the log viewer and expand the "Captured X items, X.XX GB (XX% duplicate)" line. Inside this record you'll find the amount of data captured (the total data read in new items or items that QRecall suspect have changed), written (the amount of new data written to the archive), and duplicate (captured data that turned out to already be in the archive). Adding up the written values should approximate the total size of your archive, assuming you haven't compacted it.

- QRecall Development -
[Email]
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

James,

Thanks for the prompt reply and all the information. I merged the two layers, as you suggested, and the archive looks fine now.

Here are the logs around the time that QRecall was unable to process the lproj file and hung up, in case the information is of value to you. The failure occurred at 2008-03-06 01:08:22.

2008-03-05 22:00:00.362 -0800 Minutia Waiting for permission to open archive [2.733106.569.8]
2008-03-05 22:00:06.079 -0800 Minutia Acquired permission to open archive [2.733106.569.9]
2008-03-05 22:00:06.081 -0800 Capture MBPHD [2.733106.569.10.1]
2008-03-05 22:00:06.093 -0800 Details MBPHD [2.733106.569.10.1.1]
2008-03-06 01:08:22.284 -0800 #debug# Reopened log file [2.733106.569.12]
2008-03-06 01:08:22.152 -0800 Failure Could not capture Localizable.strings [2.733106.569.11]
2008-03-06 01:08:22.391 -0800 Details File: MBPHD:Libraryrinters:EPSON:InkjetPrinterrintingModuleXG900_Core.plugin:Contents:Resources:Japanese.lproj:Localizable.strings [2.733106.569.11.1]
2008-03-06 01:08:22.404 -0800 Caution Problems processing items [2.733106.569.13]
2008-03-06 01:08:22.405 -0800 Details archive I/O error [2.733106.569.13.1.1]
2008-03-06 01:08:22.405 -0800 Details Data exception [2.733106.569.13.1.1.1]
2008-03-06 01:08:22.405 -0800 Subrosa .Explain: datawrio [2.733106.569.13.1.1.2]
2008-03-06 01:08:22.406 -0800 Details Cause: <IO> cannot read envelope data { PkgNumber=2128924, Class=FileSource@0x133a70(/Volumes/spare/bu-archive.quanta/repository.data), API=FSReadFork, Pos=45118276456, OSErr=-5000, Length=8828, File=/Volumes/spare/bu-archive.quanta/repository.data } [2.733106.569.13.1.1.3]
2008-03-06 01:08:22.406 -0800 Details Path: MBPHD:Libraryrinters:EPSON:InkjetPrinterrintingModuleXG900_Core.plugin:Contents:Resources:Japanese.lproj [2.733106.569.13.1.2]
2008-03-06 01:08:22.406 -0800 Failure Could not process folder Japanese.lproj [2.733106.569.13.1]
2008-03-06 08:10:34.907 -0800 #debug# Reopened log file [1.733106.245.25]
2008-03-06 08:10:34.906 -0800 Minutia Saved Capture MBPHD to bu-archive.quanta action [1.733106.245.24]
2008-03-06 08:10:34.929 -0800 Details file: /Users/ralph/Library/Preferences/QRecall/Actions/bu-archive.quanta-capture (2).qraction [1.733106.245.24.1]


Ralph

This message was edited 1 time. Last update was at 08-Mar-08 16:49

James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1473
Offline

Ralph Strauch wrote:2008-03-06 01:08:22.152 -0800 Failure Could not capture Localizable.strings
2008-03-06 01:08:22.405 -0800 Details archive I/O error
2008-03-06 01:08:22.406 -0800 Details Cause: <IO> cannot read envelope data { PkgNumber=2128924, Class=FileSource@0x133a70(/Volumes/spare/bu-archive.quanta/repository.data), API=FSReadFork, Pos=45118276456, OSErr=-5000, Length=8828, File=/Volumes/spare/bu-archive.quanta/repository.data }


As I suspected, the problem was not with the Localizable.strings file. The problem was that QRecall lost access to the archive (the root problem was "archive I/O error").

What's odd is the error code. Error code -5000 is normally an access error to a file on a network volume, not a USB device. But then again, you said that you're using some special USB drivers? That might be a factor.

- QRecall Development -
[Email]
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

James Bucanek wrote:

What's odd is the error code. Error code -5000 is normally an access error to a file on a network volume, not a USB device. But then again, you said that you're using some special USB drivers? That might be a factor.


I think I was using a Firewire connection for that backup, because of my earlier problems with USB. I'm not using any special USB drivers, though. My USB port had gone dead during an earlier backup but then resuscitated itself after a system restart, so I thought that might have been a problem with the system drivers.

Ralph

This message was edited 1 time. Last update was at 08-Mar-08 19:34

Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

I'm continuing to have problems with QRecall hanging up and failing to exit properly, leaving me with a bad archive that needs repair.

I began a backup at 2008-03-12 23:54:54.687 and left it. I came back about an hour later to find that it had hung, with the activity monitor showing 99 items processed and the log showing a failure to make a filter item for an excluded folder that I had deleted from the source drive. I deleted that folder from the action exclusion list and tried to restart the backup at 2008-03-13 01:28:10. QRecall was hung so I had to force quit it. I think it wouldn't force quit properly and I had to reboot the computer to get it to run again, but I'm a bit hazy on those details right now.

I restarted the backup at 2008-03-13 01:43:51.499 and it failed due to an invalid header length. I started a repair at 2008-03-13 01:48:19.045 and went to bed. When I got up the next morning I found that the repair had failed almost immediately at 2008-03-13 01:48:20.562 due to "Found 24 bytes of invalid data at file position 2424." It also looks "from the log as if the backup disk was dismounted shortly thereafter. It was not mounted when I checked it in the morning.

To help you interpret the log, MBPHD is the source drive on my MacBook Pro being backed up. "spare" is the backup volume, and the backup disk also contains volumes "leopard" and "Mac OS X Install DVD." The unmounted backup disk is the same condition I described on March 8. I assumed then that it was a problem with my drive, but now seeing it occur again immediately following a Qrecall failure I wonder if that is the case. Can you tell from the logs? Are some of these problems with my system, or is it all due to QRecall?

In any case, I think QRecall needs a softer failure mode, allow the program to exit properly without requiring a force quit and sometimes even a reboot.

I'm attaching a file containing the log starting at the point that I mounted the backup drive. Entries for the events described above have been marked with astericks *** to make them easier to locate. The log file also includes a lot of entries for time periods when QRecall isn't active. Is this normal?

Ralph

This message was edited 1 time. Last update was at 13-Mar-08 12:15

James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1473
Offline

Ralph Strauch wrote:I'm attaching a file containing the log starting at the point that I mounted the backup drive.
The log file didn't get attached. You can just mail it to james@qrecall.com and I'll take a look at it.

I restarted the backup at 2008-03-13 01:43:51.499 and it failed due to an invalid header length. I started a repair at 2008-03-13 01:48:19.045 and went to bed. When I got up the next morning I found that the repair had failed almost immediately at 2008-03-13 01:48:20.562 due to "Found 24 bytes of invalid data at file position 2424."
That, by itself, will not cause the repair to fail. It's just an informational message telling you what QRecall found wrong with the archive. The repair is specifically designed to discard errors and corrupted data and reconstruct the archive from all of the data that still OK. What you want to look for is either "Repair complete" or "Repair failed".

It also looks "from the log as if the backup disk was dismounted shortly thereafter. It was not mounted when I checked it in the morning.
If the volume unmounted, then that's a good reason why the Repair would fail.

To help you interpret the log, MBPHD is the source drive on my MacBook Pro being backed up. "spare" is the backup volume, and the backup disk also contains volumes "leopard" and "Mac OS X Install DVD." The unmounted backup disk is the same condition I described on March 8. I assumed then that it was a problem with my drive, but now seeing it occur again immediately following a Qrecall failure I wonder if that is the case. Can you tell from the logs? Are some of these problems with my system, or is it all due to QRecall?
My money is still on a problem with that drive or your FireWire interface. QRecall will never cause a drive to dismount. However, a drive that spontaneously dismounts will certain wreak havoc with any application that is writing to it, including QRecall. The error you encountered before was due to a write failure on that drive. Since the same thing happened again, I assume that QRecall encountered another write failure while trying to write to that drive.

In any case, I think QRecall needs a softer failure mode, allow the program to exit properly without requiring a force quit and sometimes even a reboot.
It does. If a call to the operating system fails, QRecall will exit gracefully. However, if a call to the operating system hangs there's nothing QRecall can do about it; it isn't in control of the process anymore. I might be able to tell more when I see the log file.

The log file also includes a lot of entries for time periods when QRecall isn't active. Is this normal?
Yes. First, QRecall is always active. The scheduler and monitor processes run continuously. I've left a lot of debugging messages in the log to help diagnose any potential problems. Diagnostic messages that provide little use over time eventually get turned off.

- QRecall Development -
[Email]
James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1473
Offline

Ralph,

Thanks for sending me your log files.

It hard to tell what's going on, but whatever is happening appears to be related to your external hard drive unmounting in the middle of a QRecall action.

For example, the log you sent shows the Repair starting at 2008-03-13 01:48:19. Three minutes later (01:51:48), the volume "spare" (the one containing the archive) was unmounted. The operating system will not normally allow a volume to be unmounted as long as there are file open on that volume. I can only conclude that the hard drive or its interface was physically disconnected or powered down. This will most certainly cause any capture or repair action to fail and leave the archive in an incomplete state.

You also mentioned that you forced the action to quit, but had to restart your computer anyway. Make sure you are stopping or quitting the QRecallHelper process. Quitting the QRecall application (from the Dock or any other way) will have no effect on any archive commands or actions that are running. Each archive operation runs in a separate process (this is how actions run in the background, or even when you are logged out) and will continue to run after the QRecall application quits. If you can't get an action to stop using the stop button, launch Activity Monitor, locate the QRecallHelper process, and quit that.

From what you've sent me, I still suspect that your external drive is disconnecting spontaneously while QRecall is writing to it and this is the root of your problems.

- QRecall Development -
[Email]
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

I continue to have problems with interrupted backups and corrupted archives, this time with a remote backup of my documents folder to my wife's iDisk. The backup appearer to proceed normally but hung up at the end, and the archive somehow disappeared. The sequence of events is outlined below, and the log file covering those events is attached. Key events described below are marked in the log file with three asterisks (***).

At 2008-03-21 17:53 I mounted the iDisk (/Volumes/mstrauch) with 5.8gb of free space. At 18:08 I created "rdocs archive.quanta," and at 18:09 began to capture my 3.4gb documents folder to that archive. The capture appeared to be going well, with the blue progress bar filling in as the capture progressed. I went away from the computer and came back after a couple of hours to find the blue bar had progressed to the end and the activity monitor said it was capturing a file that was the 2nd to last item in the documents folder contents list. It stayed like that until 22:09 when I decided that it was completely hung and tried to cancel the capture. Qrecall acknowledged my cancel requests but remained hung with the same display in the activity monitor. I attempted to force quit Qrecall, but it failed to quit and remained in the dock. I shut the lid of my laptop and went to bed.

This morning when I got up and opened the computer (07:44), Qrecall shut itself down. My idisk showed a empty archive (0kb). I tried to open the archive at 8:01 and was told it needed to be reindexed. When I reindexed it, it was empty, and now occupied 16kb, which I presume is from the index. There are no log entries between start of the capture last night at 18:09 and my attempt to cancel the hang-up at 22:09, to indicate what happened during that period. I suppose the hangup could have been caused by a network failure or the idisk unmounting itself, but there's no indication of that in the log, and the iDisk was mounted at 22:09. The backup appeared to be working properly earlier on and my available disk space is now only 2.4gb, which is about what it should be after the backup. But I can't tell what happened to the archive or the disk space. I thought about deleting the archive to see if that would give me back my disk space, but decided to see first if that file would be useful to you.
 Filename strauch log 0321.txt [Disk] Download
 Description log file from remoter backup failure
 Filesize 15 Kbytes
 Downloaded:  3218 time(s)

This message was edited 2 times. Last update was at 22-Mar-08 13:31

Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

Here's log for my previous posting. I attached it to that posting and it showed up as attached, but the post itself showed up with <br>'s in place of the returns, I guess because I wrote it in a text processor and pasted it in. I then edited the post here to put in unix returns and in the process, lost the attachment. So here it is again
 Filename strauch log 0321.txt [Disk] Download
 Description log file for failed remote backup
 Filesize 15 Kbytes
 Downloaded:  3340 time(s)

This message was edited 2 times. Last update was at 22-Mar-08 16:23

James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1473
Offline

Ralph,

I'm profoundly sorry to be the bearer of bad news (again), but using QRecall with Apple's iDisk service does not work well.

There was a long thread on this some time back (Online Storage). To avoid the pain of reading the whole thing, I'll summarize:

Apple's iDisk, Amazon's Simple Storage System, and similar services are all based on the WebDAV protocol. This is file transfer protocol originally designed to make it easier to maintain web sites. Over time, it's been co-opted into a general purpose remote file storage service.

The problem is that WebDAV is not a remote file protocol and it cannot make small modifications in a large file remotely — which is exactly what QRecall does all the time. To simulate the ability to arbitrarily modify files, the OS silently downloads the entire remote file (the QRecall archive in this case) onto your local drive, allows you to update it locally, then copies the modified file back up to the server when the file is closed. For really small file, this is fine. But for making small changes to large files it is horrifically inefficient.

What you observed was QRecall capturing all of the documents to the local copy of the archive. Since this was a local copy, it also happens very quickly making you think "wow, this is really fast!" When the capture was complete, the archive file was closed and the actual work of copying the data to the iDisk began. What you perceived as QRecall "hanging" was actually the WebDAV service copying the finished archive to the iDisk server. This is consistent with the observed behavior and the I/O errors in your log file.

For small incremental backups of important documents, I highly recommend a small USB thumb drive or just create a small archive on the same disk. For an off-sight backup, occasionally copy the archive to your iDisk.

- QRecall Development -
[Email]
James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1473
Offline

Ralph Strauch wrote:Here's log for my previous posting. I attached it to that posting and it showed up as attached, but the post itself showed up with <br>'s in place of the returns, I guess because I wrote it in a text processor and pasted it in.
No, this is a bug in the JForum software that runs this forum. It can usually be avoided unchecking the "Disable HTML in the message" option. I don't know why.

Hopefully this will get fixed when we upgrade to the next version of JForum.

- QRecall Development -
[Email]
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

James Bucanek wrote:
I'm profoundly sorry to be the bearer of bad news (again), but using QRecall with Apple's iDisk service does not work well.

There was a long thread on this some time back (Online Storage). To avoid the pain of reading the whole thing, I'll summarize:


Thanks for the info. I'm sorry I didn't come across the other thread before I tried the remote backup. I guess I'll just keep using Apple's Backup for my idisk backup of critical files.

I have one remaining problem I wonder if you can offer any advice about. I've deleted the empty qrecall archive from the idisk, but it still shows only 2.4gb of free space rather than the 5.8gb I had before the backup. Apparently it allocated 3.4gb of space for the qrecall archive even though it didn't use it, still keeps that space as used, even though it doesn't show up as allocated to any particular folder.. Do you know anything I can do to get that space back?

Thanks,
Ralph
James Bucanek



Joined: 14-Feb-07 10:05
Messages: 1473
Offline

Ralph Strauch wrote:I have one remaining problem I wonder if you can offer any advice about. I've deleted the empty qrecall archive from the idisk, but it still shows only 2.4gb of free space rather than the 5.8gb I had before the backup. Apparently it allocated 3.4gb of space for the qrecall archive even though it didn't use it, still keeps that space as used, even though it doesn't show up as allocated to any particular folder. Do you know anything I can do to get that space back?
According to this technical support article all you should have to do is wait 24 hours:

If you recently had an unsuccessful upload to your iDisk,
wait 24 hours, then check the storage space again.
Sometimes an unsuccessful upload can result in iDisk
usage to be reported incorrectly, and waiting 24 hours
allows for any invisible files potentially generated by the
unsuccessful upload to be detected and removed.


After another day if the free space doesn't appear, I'd contact the .Mac technical support. But it should take care of itself.

- QRecall Development -
[Email]
Ralph Strauch



Joined: 24-Oct-07 22:17
Messages: 194
Offline

I had another network interruption this week, again resulting in a corrupted archive. This one happened when my wife put her iMac to sleep while my MacBook Pro was backing up over airport to an external drive mounted on the iMac.

I grant that this can be seen as a case of operator error that shouldn't happen, but inetwork interruptions of one kind or another are not so rare events that they should be discounted. Is it possible to make Qrecall more forgiving of network failures so they don't result in corrupted archives?

Ralph

This message was edited 1 time. Last update was at 03-Apr-08 17:22

 
Forum Index » Problems and Bugs
Go to:   
Powered by JForum 2.1.8 © JForum Team