QRecall

James Bucanek

I recommend running a verify to determine if the archive is OK or needs repair. I can't tell from the information here.

Also, send a diagnostic report form the MBP so I can look into the details.

Ralph Strauch

I should have mentioned that I've sent a report. I'll move the archive to the MBP (USB3( and start the verify now.

Ralph Strauch

Verify failed after 7 minutes. I've sent another report and have started a repair action.

Edit: After just a couple of minutes the repair action quit with an error message that it had lost contact with process, no entry in the log. I've restarted the repair and it looks like it's running normally.

Ralph Strauch

I?ve now repaired my archive and run a manual backup of my MBP and everything seems fine. The repair log shows a long string of chunks of invalid data throughout the archive, along with one instance of ?Communication problems with command listener on port QRecallApplication? and three of ?Distributed objects message send timed out? at the end of the log. I?ll send a Report.

You?ve said elsewhere that the invalid data will be discarded and that the archive will contain only valid files, and I?m a bit confused about the result of that process. Will I lose files which were found to contain invalid data, or only the versions of those files which contained the invalid data?

This problem has occurred only with scheduled overnight MBP backups, so I?ll move the drive back to the iMac and let the iMac scheduled overnight backups run, but just run MBP backups manually during the day for a while.

As I said in the thread on Encryption, <http://forums.qrecall.com/posts/list/567.page>, I?ve been mounting my backup drives on the MBP because my Airport Extreme can?t read a FileVault encrypted drive. But if this problem is caused by backing up one computer to a drive mounted on another perhaps shifting from FileVault to Qrecall encryption and going back to mounting the drive on the AE would be a solution. What do you think?

James Bucanek

Ralph Strauch wrote:You?ve said elsewhere that the invalid data will be discarded and that the archive will contain only valid files, and I?m a bit confused about the result of that process. Will I lose files which were found to contain invalid data, or only the versions of those files which contained the invalid data?

During a repair, QRecall scans every byte of your archive's repository.data file. Normally, every byte is accounted for. If the repair can't make sense of a string of bytes, this range of the file is erased and you get a "Data problems found" message in the log.

The question, however, is whether any of this "damaged" data means anything important. If the bad data was originally a data or file record, then it will result in the loss of one or more files. That, in turn, could leave one or more folders missing a file. You need to look beyond the "Invalid data" messages in the log and look for folder and file specific messages. For example, if a bad byte was found in the middle of a data record, that would result in the loss of a file. That lost file will record a "Discarded incomplete file" message in the log, along with the details of which file that was lost.

There are no such messages in your log, so none of the "bad" data in your archive was related to anything important. This isn't entirely surprising. After compact and merge actions finish, there are lots of regions in the file that are left empty or contain unreferenced data, the loss of which won't have any consequences for your archive.

As I said in the thread on Encryption, <http://forums.qrecall.com/posts/list/567.page>, I?ve been mounting my backup drives on the MBP because my Airport Extreme can?t read a FileVault encrypted drive. But if this problem is caused by backing up one computer to a drive mounted on another perhaps shifting from FileVault to Qrecall encryption and going back to mounting the drive on the AE would be a solution. What do you think?

Technically, there is very little difference between these two solutions. In both situations, you have an external hard drive mounted on a server (either the iMac or the Airport Basestation), running an AppleShare server, being accessed over a network by a remote client (the MacBook Pro).

The advantage of using the Airport Basestation is that the basestation runs 24/7, so you don't have to worry about when the iMac is running.

The advantage of using the iMac is that actions will be faster on the iMac (because it's directly connected to the volume) and will be more reliable (local busses almost always have lower error rates than LANs).

Ralph Strauch

I've had another "problem closing archive" failure, where it looks from the log like the entire backup was written to the archive before the failure occurred. I tried to repair the archive and that failed as well, leaving me with long lists of invalid data and error corrections.

This was a backup of my MBP run over the network with the archive drive mounted on my iMac. I moved the disk to the MBP for the repair, to take advantage of USB3 speeds. I was running b25. This is a different backup drive from the one that exhibited this problem in November.

As soon as I send the report I'll update to b26 and run another repair.

James Bucanek

Ralph,

Everything I see in the log would indicate I/O problems with the drive.

The capture starts at 12-10 10:35 and runs just fine until 10:44:21 when QRecall encounters a "Device Not Configured" error trying to write to the archive. This is a fatal I/O error that's usually associated with a device that has gone off-line. There's no way to recover from that kind of error, so the capture bails. During its exit, it tries to close the archive but encounters another I/O failure, "Operation timed out". That's when it logged the "Problem closing archive" message.

You then started a repair at 17:09. That ran swimmingly until about 21:42. That's when the error correction messages start. These are not unexpected. Since the capture was interrupted while writing to the data files, it's likely that there's a mismatch between the data that was written and the error correction codes. So what undoubtedly happened is that some of the data doesn't match some of the correction codes. This is fine, and the repair will reconstruct the codes as needed. This is confirmed, to a degree, because the invalid correction codes were for an offset very close to the offset being written when the capture failed.

Once the repair was past that bad patch, the repair resumes running and encountered no other issues until it was almost finished at 12-11 02:49. At this point, everything goes south. As it tries to finalize the archive it encounters a string of "No such volume", "No such directory", and "Device not configured" errors, as if the volume had been spontaneously unmounted or powered down. Needless to say, the repair wasn't successful.

Ralph Strauch

I tried a couple of more times to repair my archive and both attempts failed, so I guess this archive is toast. I?m not convinced, though, that the problem is with the drive. In the two months I have been using Qrecall v2.0 beta, this particular problem -- where a backup that appears to be successfully completed and written to the archive experiences a fatal error as the archive is being closed -- has occurred 7 times, involving two different archives and backup drives, run both over the network and directly mounted to the computer.

Both drives have been given a clean bill of health by Disk Utility, Tech Tool Pro, and Diskwarrior. The failure always seems to occur at the same point in the backup being performed ?- after Qrecall has recorded all the statistics for the backup and is closing the file, so it seems unlikely that it could be due to a mechanical or network problem unrelated to the backup task.

Here?s a list of the backups that have experienced this problem.

Action 2015-11-08 11:12:38 ------- Capture to 3rd backup.quanta
2015-11-08 11:32:24 Captured 4207 items, 1.07 GB (50% duplicate)
2015-11-08 11:38:12 Failure Problem closing archive

Action 2015-11-11 06:24:16 ------- Capture to 3rd backup.quanta
2015-11-11 07:26:51 Captured 787 items, 447.8 MB (83% duplicate)
2015-11-11 07:39:03 Failure Problem closing archive

Action 2015-11-16 03:04:33 ------- Capture to 2nd backup.quanta
2015-11-16 07:31:53 Captured 750 items, 440.8 MB (56% duplicate)
2015-11-16 07:42:48 Failure Problem closing archive

Action 2015-11-19 03:37:19 ------- Capture to 2nd backup.quanta
2015-11-19 07:53:57 Captured 1785 items, 610.6 MB (61% duplicate)
2015-11-19 08:47:46 Failure Problem closing archive

Action 2015-11-30 08:56:33 ------- Capture to 2nd backup.quanta
2015-11-30 09:08:35 Captured 1732 items, 417 MB (52% duplicate)
2015-11-30 09:11:35 Failure Failed

Action 2015-12-06 03:00:01 ------- Capture to 3rd backup.quanta
2015-12-06 03:17:56 Captured 1077 items, 838.9 MB (59% duplicate)
2015-12-06 03:27:38 Failure Failed

Action 2015-12-10 10:35:35 ------- Capture to 3rd backup.quanta
2015-12-10 10:55:26 Captured 1643 items, 922.4 MB (58% duplicate)
2015-12-10 11:07:33 Failure Problem closing archive

James Bucanek

Ralph,

I'm traveling, so I can't take a detailed look at you logs until I get back tomorrow, but just a few notes until then....

From what I've seen, it's unlikely to be the archive itself. The errors I mentioned in the previous post are things like " device not ready" and "no such directory". These are not data or I/O errors, but rather indicate a device that has disconnected.

DiskWarrior, disk utility, et. al. are wonderful tools, but only address the volume directory structure. They say nothing about the reliability of the drive, and certainly don't speak to whether the drive has, or could, disconnect at an inappropriate time.

Finally, the "Problem closing archive" can be a bit of a red herring. Whenever there's a problem with an operation, the action will stop and then try to close the archive. If the problem was something that prevents QRecall from accessing the archive files, you'll also get a "Problem closing archive", in addition to the original problem.

More tomorrow...

James Bucanek

Ralph,

Here's a random sampling of a few capture issues I found in the logs you sent me:

2015-11-08 11:12:38.071 -0800 ------- Capture to 3rd backup.quanta

2015-11-08 11:38:12.467 -0800 Failure Problem closing archive

2015-11-08 11:38:12.468 -0800 Details ErrDescription: Operation timed out



2015-11-10 07:54:44.131 -0800 ------- Capture to 3rd backup.quanta

2015-11-10 08:46:52.575 -0800 Failure Could not capture file

2015-11-10 08:46:52.575 -0800 Details archive I/O error

2015-11-10 08:46:52.575 -0800 Details Cause: <IO> cannot read hash page(s) { ErrDescription='Operation timed out', POSIXErr=60, Position=5731762176, API=pread, Path='/Volumes/BUD3/3rd backup.quanta/hash.index', Length=8192 }



2015-11-11 03:54:28.850 -0800 ------- Capture to 3rd backup.quanta

2015-11-11 03:56:06.984 -0800 Failure Problem closing archive

2015-11-11 04:55:24.154 -0800 Details ErrDescription: Operation timed out



2015-11-11 06:24:16.879 -0800 ------- Capture to 3rd backup.quanta

2015-11-11 07:25:35.814 -0800 Failure Could not capture file

2015-11-11 07:25:35.814 -0800 Details failed to write envelope header

2015-11-11 07:25:35.814 -0800 Details ErrDescription: Device not configured



2015-11-19 03:37:19.406 -0800 ------- Capture to 2nd backup.quanta

2015-11-19 04:39:06.269 -0800 Details cannot read envelope content length

2015-11-19 04:39:06.269 -0800 Details ErrDescription: Device not configured



2015-11-30 08:56:33.900 -0800 ------- Capture to 2nd backup.quanta

2015-11-30 09:11:35.848 -0800 Details problem closing file

2015-11-30 09:11:35.848 -0800 Details ErrDescription: Operation timed out



2015-12-06 03:00:01.518 -0800 ------- Capture to 3rd backup.quanta

2015-12-06 03:27:38.240 -0800 Details problem closing file

2015-12-06 03:27:38.240 -0800 Details ErrDescription: Operation timed out



2015-12-10 10:35:35.671 -0800 ------- Capture to 3rd backup.quanta

2015-12-10 10:44:21.337 -0800 Details cannot read envelope content length

2015-12-10 10:44:21.337 -0800 Details ErrDescription: Device not configured

There are a couple of things to note. First, all of the errors are either "Operation timed out" or "Device not configured." These are not file content errors, media errors, file structure errors, or volume structure errors. These error indicate that storage device has gone off line, or in the case a remote connection the connection to the server has been lost.

The other thing that points to this not being an issue with this particular archive or its volume structures are that these events are the exception. There are scores of successful captures interleaved between these failures. If the archive was corrupted, the media unreliable, or the volume directory structure was damaged, these other captures would have run into the same problems—but they didn't.

I still suspect that the drive containing the archive is either spontaneously going off-line or unmounting, or the network connection to the device or server is timing out, disconnecting, going to sleep, shutting down, going off-line, etc.

Ralph Strauch

James,

I've trashed the archive that wouldn't repair, replaced it with a new archive, and backed up both computers successfully and uneventfully. I've now swapped that drive out and brought my other offsite drive back in, and backups are running fine on it as well, so hopefully I'm back in business.

I don't think the problem was a bad drive, since it seemed to affect both drives equally during the period it was happening. I did seem to be having network problems, which definitely contributed.Some of those involved scheduled backups run automatically at night when both machines were otherwise asleep so I'll just skip that for a while and backup the MBP manually during the day, at least until I accumulate a history of stability with v2.0.

I am still concerned about the seven failures cited above, which all appear to occur as the archive attempted to close immediately after writing the backup to the archive and the backup statistics to the log,. The timing of these failures, occurring at a specific well-defined point in a long ongoing process, makes it hard to attribute them to a source external to that process, like a mechanical or network failure.

I'm curious -- what app are you pasting the log entries from, to get that formatting and the line numbers?

James Bucanek

Ralph Strauch wrote:I don't think the problem was a bad drive, since it seemed to affect both drives equally during the period it was happening.

I hope I didn't imply that there was definitively something wrong with your drive. I meant to emphasize that the failures are all "drive related," meaning that they indicate a drive that has spontaneously gone offline, unmounted, or simply stopped responding. If the drive is being accessed via a network or file server, there are lots of moving parts between your system and the actual hard drive that can cause these symptoms.

I did seem to be having network problems, which definitely contributed.Some of those involved scheduled backups run automatically at night when both machines were otherwise asleep ...

And I think this is the most likely explanation. A dropped network, a file server that goes to sleep while it's still being used, or a computer that's running code while it's asleep (power nap issue) would all cause the kinds of errors you see in the log.

I am still concerned about the seven failures cited above, which all appear to occur as the archive attempted to close immediately after writing the backup to the archive and the backup statistics to the log,. The timing of these failures, occurring at a specific well-defined point in a long ongoing process, makes it hard to attribute them to a source external to that process, like a mechanical or network failure.

Except for the other evidence.

In between the failures you highlighted are identical failures that occur before the archive beings to close, and a slew of successful captures without any problems at all. I think the fact that there are a high number of failures occuring while closing the archive is simply because that's when the greatest amount of archive activity begins. Most (incremental) captures spend most of their time reading the local hard drive looking for changes. It's when the archive is about to be closed that things get busy, and if there's a problem read/writing to the archive, that's when it's statistically most likely to happen.

I'm curious -- what app are you pasting the log entries from, to get that formatting and the line numbers?

The forum has various tags you can use to format text, like [b]bold[/b], [i]italic[/i], and [u]underline[/u]. You can also surround a line or block of text with the [b]

[/b]code goes here[b]

[/b] tags and they'll be formatted like the log listing in my earlier post. Just select the block of text and click the "Code" button above the post entry field.