Author |
Message |
9 years ago
|
#1
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
My scheduled backup last night looks from the log like it completed the backup then quit with a "problem closing archive." When I tried to rerun it, the archive needed repair. After the repair the archive includes last night's layer, showing the same number of files and size the log for last night showed, so apparently the backup had completed and then something screwed up as it was closing. The repair log showed two locations of invalid data. When I woke the computer up this morning it showed an alert that "logout had failed because some app refused to quit. I don't remember if it was qrecall or not. I've been having these logout failures at night sometimes since updating to El Cap, and discovered to that was because the update apparently turned on an option to log out after an hour of inactivity, which had not been previously set. I don't know if that caused any problem with qrecall, or not. I'm sending a report.
|
|
|
9 years ago
|
#2
|
James Bucanek
Joined: Feb 14, 2007
Messages: 1572
Offline
|
Ralph Strauch wrote:My scheduled backup last night looks from the log like it completed the backup then quit with a "problem closing archive." When I tried to rerun it, the archive needed repair. After the repair the archive includes last night's layer, showing the same number of files and size the log for last night showed, so apparently the backup had completed and then something screwed up as it was closing. The repair log showed two locations of invalid data.
You're problems started a little before that. Earlier in this capture, and in previous captures, QRecall has been warning of failed I/O. Most of these it successfully recovered from. Here's an example:
2015-11-16 04:05:55.842 Recovered from archive data read failures
2015-11-16 04:05:55.843 cannot read envelope content length
2015-11-16 04:05:55.843 POSIXErr: 6
2015-11-16 04:05:55.843 ErrDescription: Device not configured
2015-11-16 04:05:55.843 Path: /Volumes/BUD2/2nd backup.quanta/repository.data
2015-11-16 04:05:55.843 Position: 1230587166720
2015-11-16 04:05:55.843 Length: 131072 This indicates that QRecall encounter a fatal error (Device not configured) reading from your primary data file. It tried again, and was successful, so it logged this as a transient error. If you start to see a lot of transient errors, it's an indication that something is failing. It could be a degraded cable or network connection. More likely, it's a hard drive on its last legs. At the end of the capture, QRecall tried to update the index files and failed with another I/O error. This one it couldn't recover from:
2015-11-16 07:42:48.430 Problem closing archive
2015-11-16 07:42:48.430 POSIXErr: 60
2015-11-16 07:42:48.430 ErrDescription: Operation timed out
2015-11-16 07:42:48.430 Path: /Volumes/BUD2/2nd backup.quanta/repository_p8w8k16m2.1_8k.checksum32 This failure corrupted the structure of your archive, requiring you to repair it. The data errors you encountered during the repair indicate permanent media failures. This is another indication that your drive may be knocking on death's door. I would highly recommend that you investigate the health of your backup drive and its connection with your system. You might need a new drive.
|
- QRecall Development - |
|
|
9 years ago
|
#3
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
I ran the drive through Disk Utility, Diskwarrior, and TechTool Pro (full surface scan) and everything looked fine. I realize this doesn't totally clear the disk, but looking more carefully at when it happens suggests that the problem a byproduct of a scheduled backup trying to run when both computers involved are normally asleep. I normally leave my backup drive connected to an iMac (USB2) and backup my MBP over wifi, though I do sometimes move the backup drive to the MBP for time-intensive operations like verify and repair, to take advantage of the greater speed provided by USB3. I have a scheduled backup on the MBP for 3am, and the MBP is usually asleep with the lid closed at that time. Until I upgraded to El Cap that backup usually didn't run and I would just start a manual backup when I turned the computer on in the morning. More recently, though, that scheduled backup is trying to run, and that's when my "problem closing archive" occurs. See the backup failures at 2015-11-16 03:04:33 and 2015-11-19 03:37:19. In both cases, the backup seem to run for a while, but not actually complete, and then fail about 4 hours later. Backups run manually during the day seem to complete properly. After running the above utilities to check the drive I was able to successfully back up the MBP at both 2015-11-17 10:48:14 and 2015-11-18 12:10:41 and to backup the iMac at 11/17 19:04 (not shown on the report because that backup was run from the iMac). Since yesterday?s failure I don?t seem to be able to repair the archive. Two consecutive Repair cycles both show ?Invalid data? at the same locations, and the archive will not open because the index is incorrect. Is this archive totally lost, or is there a way of getting rid of the invalid data and repairing the rest of the archive? When I get it up and running again I?ll delete the 3am backup for the MBP and just run daily backups manually, and with luck that will take care of the problem. BTW, the missing filter references that show up in each backup log are due to excluded files on the other machine.
|
|
|
9 years ago
|
#4
|
James Bucanek
Joined: Feb 14, 2007
Messages: 1572
Offline
|
Ralph Strauch wrote:I ran the drive through Disk Utility, Diskwarrior, and TechTool Pro (full surface scan) and everything looked fine.
Since you did a full surface scan, and given your other observations, let's assume the hard drive is not the source of the I/O errors.
looking more carefully at when it happens suggests that the problem a byproduct of a scheduled backup trying to run when both computers involved are normally asleep. ... I have a scheduled backup on the MBP for 3am, and the MBP is usually asleep with the lid closed at that time. Until I upgraded to El Cap that backup usually didn't run and I would just start a manual backup when I turned the computer on in the morning. More recently, though, that scheduled backup is trying to run, and that's when my "problem closing archive" occurs. See the backup failures at 2015-11-16 03:04:33 and 2015-11-19 03:37:19.
I think you've nailed it. It appears that QRecall is running while one or both systems are asleep. Of course, they're not completely asleep. This is undoubtedly a byproduct of OS X's "Power Nap" mode. During a power nap, certain (Apple) processes, like Mail and Time Machine, are allowed to perform limited maintenance tasks. Other processes (like QRecall) aren't supposed to run at a all, but apparently they are. Since both of your systems are asleep, it's difficult to ascertain which one is responsible for the I/O errors (although I suspect the MBP, for reasons I'll detail in a moment).
In both cases, the backup seem to run for a while, but not actually complete, and then fail about 4 hours later.
Looking carefully at your capture log from 11-19 I see that the capture started at 03:37:58 and ran for about a half minute before being suspended again at 03:38:37. (I know it was suspended, because the beta version of QRecall dumps status messages to the log almost continusously, so if there are any big gaps in the log it's because absolutely nothing was happening.) An hour later (04:39:06) QRecall starts logging again and immediately encounters an I/O error. This is undoubtedly because hardware and/or network connections are not fully functional during a power nap. I suspect that either the last I/O before it was put back to sleep, or the first I/O when it started the next power nap, was interrupted and caused the error. QRecall retries the I/O, recovers from the error, and runs a few more minutes (last log entry was 04:40:16). At this point it's suspended again until 05:43:04, and the cycle repeats: I/O error, retry, recover, run a minute or two, get suspended, wash, rinse, repeat. This pattern continues until 07:51.12, when the system was legitimately woken up again, which I can tell because the scheduler logged receipt of a "system woke up" event. QRecall continues with the capture, and almost finishes when ... The system apparently goes back to sleep around 07:54:42 while closing the archive. The system wakes back up again at 08:47:22 (another "woke up" event was logged), and at that point is when the timeout failure that caused the "Problem closing archive" occurred. I can't tell if this was because the laptop's network interface hadn't fully woken up again, the forced sleep interrupted the request, or if the server was now asleep and not responding. Regardless, that was the sequence of events. There are several things that can be done to address this. If you're willing to change your habits, you could add power management requests to the 3 AM capture action so it wakes the system up when it starts and puts it back to sleep again when it's finished. This would require you to leave the laptop lid open (the system won't wake up if the screen is closed). This would ensure the system is completely awake, allowing the capture to complete. Approach number two would be to stop the laptop from power napping by turning off the Power Nap feature in the Energy Saver preference pane (under Schedule). That would probably restore the behavior you saw under Yosemite. Additionally, I'm going to explore the possibility of modifying QRecall so it can suspend itself during a "power nap" and also talk to Apple as to why it's running in the first place.
Since yesterday?s failure I don?t seem to be able to repair the archive. Two consecutive Repair cycles both show ?Invalid data? at the same locations, and the archive will not open because the index is incorrect. Is this archive totally lost, or is there a way of getting rid of the invalid data and repairing the rest of the archive?
The archive isn't totally lost, but you are stuck. The logs reveal a problem/situation/bug in the code that regenerates the error correction codes which is preventing the repair from completing. I'm looking into that today. You can wait until I have a fix, or I can tell you how to manually strip the error correction codes from the archive so you can repair it and resume using it.
|
- QRecall Development - |
|
|
9 years ago
|
#5
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
Thanks for the quick response and explanation. It all makes sense. I think I had power nap turned off in Yosemite and previously, but I guess El Cap turned it on as a default. I've traced a couple of other problems to El Cap's changing default settings, I think. If the fix is coming along quickly, I'll wait, but if stripping the error codes is relatively simple and it would be fit your schedule better not to be diverted to that fix right now, please send me the instructions for stripping them,
|
|
|
9 years ago
|
#6
|
James Bucanek
Joined: Feb 14, 2007
Messages: 1572
Offline
|
Ralph, Here's a link to the 2.0.0a22 alpha release of QRecall. If you have the time, please download, install, and use it to repair your "2nd backup" archive again. It should die exactly the same way the earlier repairs did, but when it does it will log some information about the failure that I'm missing. After it fails, please send another diagnostic report so I can pick through the wreckage. Either way, to quickly get your archive back up and running, do this. In the Finder, Right/Control+click on the archive and choose Show Package Contents. Inside the archive you'll see a repository.data file and a slew of "companion" files with names like these:
repository_8k.checksum32
repository_p8w8k16m2.0.anvin_reed_sol
repository_p8w8k16m2.0_8k.checksum32
repository_p8w8k16m2.1.anvin_reed_sol
repository_p8w8k16m2.1_8k.checksum32 Trash all of these "companion" files, just don't delete the repository.data file (that's where all of the good stuff is). Repair the archive and go about your business. If you want to back to using correction codes again, use the archive settings to generate new codes.
|
- QRecall Development - |
|
|
9 years ago
|
#7
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
I've now repaired my archive with Qrecall v2.0.0a22 alpha and sent you that report, removed the repository companion files, and started the archive repair again to get the archive back. I'll let you know in the morning how it works out. The repair that I just did had one probably insignificant glitch you should know about. I shut the lid about 1/3 of the way through, then opened it again about an hour later and it seemed to pick up OK. I'm in the habit of shutting the lid when I'm done working, and it's hard to remember not to do that when a long process I haven't been paying any attention to is running in the background. I use power management on the iMac for a scheduled 1am backup and it works fine. I somehow never thought of that on the MBP. One question that occurs to me there, though, is that backing up the MBP involves 2 computers, not one. I'm running the backup from the MBP, but the drive is mounted to the iMac. Does that cause any problems? Also, if I command the MBP backup to wake the computer before the backup but not to sleep it after the backup will it then go to sleep according to the Energy Saver inactivity settings? I ask that because I'm hesitant to tell Qrecall to sleep the computer when I might sometimes run that backup command during the day when I'm working on other things. I'll probably keep running the MBP backups as I've been doing, but it will be nice to have a clearer understanding of my options. Thanks again for the stellar support you provide.
|
|
|
9 years ago
|
#8
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
I've now repaired my archive and it completed successfully, though the log still shows Data Problems -- Invalid data at the same two locations that have been showing up. Is this now a permanent feature of this archive? After the repair completed I ran a Merge Layers action followed by a backup, which seems to have run fine with one minor glitch. I clicked on the backup while the Merge was in process, figuring it would run after the Merge completed, which it did. But the Monitor window only showed the running Merge and did not show the waiting backup, as it has done in the past. When I came back to the computer after the backup had started, the Monitor window showed "Idle" and the only indication I could find that the the backup was actually running was the "Archive busy" window that came up when I tried to open the archive.
|
|
|
9 years ago
|
#9
|
James Bucanek
Joined: Feb 14, 2007
Messages: 1572
Offline
|
Ralph Strauch wrote:The repair that I just did had one probably insignificant glitch you should know about. I shut the lid about 1/3 of the way through, then opened it again about an hour later and it seemed to pick up OK. I'm in the habit of shutting the lid when I'm done working, and it's hard to remember not to do that when a long process I haven't been paying any attention to is running in the background.
Ideally, this shouldn't ever be a problem. OS X is pretty good about recovering everything when it wakes up and picking up right where it left off. Problems can occur if something happens while the laptop is asleep that it can't recover from, such as shutting down the file server it was using, unplugging an external drive, losing a network connection, or having another system modify the archive. But if any of those things do occur, QRecall should fail with an appropriate message.
I use power management on the iMac for a scheduled 1am backup and it works fine. I somehow never thought of that on the MBP. One question that occurs to me there, though, is that backing up the MBP involves 2 computers, not one. I'm running the backup from the MBP, but the drive is mounted to the iMac. Does that cause any problems?
It could, but QRecall should deal with it. There are basically two scenarios: the iMac's file server is responding or it isn't. If it is, the laptop will connect and perform the capture. If it isn't, the laptop will either fail to mount the volume or fail to open the archive. Either way, there's no harm done.
Also, if I command the MBP backup to wake the computer before the backup but not to sleep it after the backup will it then go to sleep according to the Energy Saver inactivity settings?
Yes. When QRecall starts a time consuming activity (like a capture), it registers with the power management service to tell it not to put the filesystem to sleep until it's finished. Once finished, it releases the power manager. If there are no other services that want to prevent sleep, your system will go to sleep on its normal schedule.
I ask that because I'm hesitant to tell Qrecall to sleep the computer when I might sometimes run that backup command during the day when I'm working on other things. I'll probably keep running the MBP backups as I've been doing, but it will be nice to have a clearer understanding of my options.
Also not a problem. The sleep request following an action is just that—a request. When the action is finished, it will post a request for your system to go to sleep in about 10 minutes. A dialog will appear notifying you that a request to put your system to sleep has been made. If you ignore that dialog, your system will go to sleep (or shutdown, or restart, or whatever the request was). Dismiss the dialog to cancel the request and continue using your computer.
Ralph Strauch wrote:I've now repaired my archive and it completed successfully, though the log still shows Data Problems -- Invalid data at the same two locations that have been showing up. Is this now a permanent feature of this archive?
Not any more . One of the last things the repair action does is erase any invalid data it found. This wasn't happening before, because the repair actions weren't finishing. But now that one has, your archive should be free of invalid data.
After the repair completed I ran a Merge Layers action followed by a backup, which seems to have run fine with one minor glitch. I clicked on the backup while the Merge was in process, figuring it would run after the Merge completed, which it did. But the Monitor window only showed the running Merge and did not show the waiting backup, as it has done in the past. When I came back to the computer after the backup had started, the Monitor window showed "Idle" and the only indication I could find that the the backup was actually running was the "Archive busy" window that came up when I tried to open the archive.
That's a mystery. Send a diagnostic report and I'll look into it. It might be another scheduler issue (*sigh*).
|
- QRecall Development - |
|
|
9 years ago
|
#10
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
I ran another backup using v2.0.0a22 alpha to see what would happen with the Monitor. When I started the backup the window flashed briefly showing the backup starting, then shifted immediately to "Idle" and stayed there. The only indication of the running backup was the "Archive busy" window when I tried to open the archive. I came back later and and a crash window was showing, indicating that Qrecall had quit, but the app was still running and the log indicates that the backup completed normally. I filled in that crash report window and also sent another report from the app, so you should have both. Both of these last two backups were run with the drive mounted on the MBP. Next I'm going to more the drive to the iMac and run backups of both machines that way. The iMac is running the previous beta version.
|
|
|
9 years ago
|
#11
|
James Bucanek
Joined: Feb 14, 2007
Messages: 1572
Offline
|
Ralph, Thanks, as always, for the diagnostic reports. When you have a chance update to the 2.0.0b23 beta that was just released and see if that resolves any of these issues.
|
- QRecall Development - |
|
|
9 years ago
|
#12
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
Yes, the b23 version seems to take care of the Monitor problem, so all is fine again. I do notice that closing the archive seems to be taking longer, whichh I assume is related the error correction?
|
|
|
9 years ago
|
#13
|
James Bucanek
Joined: Feb 14, 2007
Messages: 1572
Offline
|
Ralph Strauch wrote:I do notice that closing the archive seems to be taking longer, whichh I assume is related the error correction?
Closing the app at the end of an action, or closing an archive browser window in the QRecall app? If you're talking about at the end of an action, the answer is "probably". Error correction (and encryption) generally make every modification a little slower. If you're talking about closing a browser window, it's likely due to the resource leak I fixed in b23. In b21, closing a browser window failed to properly release all of the memory and resources for the archive. This would result in a memory leak of thousands, or even hundreds of thousands, of objects. But fixing the bug has its own downside; it takes a moment to destroy and cleanup after all of those objects. If it's annoying, let me know. There may be a way to dispose of these objects on a background thread so it doesn't hang up the app.
|
- QRecall Development - |
|
|
9 years ago
|
#14
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
I was referring the period of time at the end of a backup when the Monitor is reporting "archive closing -- the interval between the log entries for "Captured xxx items. . ." and "Capture finished." Going back and look at a series of log entries, it's pretty clear that it is related to error correction, as the following entries show. Error correction was installed on the 17th and 18th, then removed on the 21st and the time interval drops significantly. Then it goes back up after I inflated the archive again on the 21st (except for the backup on the 22nd, which was exceptionally short overall). Action 2015-11-17 11:08:33 Captured 2591 items, 2.26 GB (8% duplicate) Action 2015-11-17 11:09:30 ------- Capture finished (20:42) Action 2015-11-18 12:27:25 Captured 7949 items, 1.5 GB (19% duplicate) Action 2015-11-18 12:45:45 ------- Capture finished (34:03) 21-Nov-15 ? removed error correction Action 2015-11-22 08:44:15 Captured 12494 items, 1.72 GB (29% duplicate) Action 2015-11-22 08:44:24 ------- Capture finished (1:24:21) Action 2015-11-22 13:31:27 ------- Inflate 2nd backup.quanta Action 2015-11-22 17:00:44 Captured 720 items, 547.4 MB (56% duplicate) Action 2015-11-22 17:00:49 ------- Capture finished (01:50) Action 2015-11-23 10:39:31 Captured 912 items, 480.9 MB (83% duplicate) Action 2015-11-23 10:45:13 ------- Capture finished (10:16) Action 2015-11-24 07:55:11 Captured 1797 items, 876.2 MB (46% duplicate) Action 2015-11-24 08:04:39 ------- Capture finished (19:57) This is not a problem you need to do anything about, just an observation -- something I might not have even noticed normally, except for the fact that I was troubleshooting during this period so paying closer attention to what was happening, and that make a ten minute interval seem long when I'm watching for it to get over.
|
|
|
9 years ago
|
#15
|
Ralph Strauch
Joined: Oct 24, 2007
Messages: 194
Offline
|
This problem is still with us, apparently. I ran a scheduled backup of my MBP over wifi to a backup drive attached to my iMac last night, and the log shows this morning that the archive failed with a "problem closing file," as it did on 11/16. The archive shows what looks like a complete layer from last night's backup, though, so apparently it was written before the problem occurred. I tried this morning to rerun the backup manually, and it failed again Action 2015-12-06 09:23:24 Failure Program exception Action 2015-12-06 09:23:24 invalid position added to bucket Action 2015-12-06 09:23:24 Failure Command failed Action 2015-12-06 09:23:24 An internal program error occurred. Please report this problem to the developer. Since we dealt with this problem in November I've been running this backup manually during the day with no problems. This was the first time I've run it as a scheduled backup overnight. This is not the drive that had the problem earlier but the one that had been offsite, and I just swapped back in on Friday. The problem occurred on the third backup run after the swap. I guess I'll have to run another repair on this archive, but I'll wait to hear from you before I start it.
|
|
|
|
|
|