Message |
|
My bad. The log entries are property sorted, the problem is that the dates are wrong. What's happened is that the timestamps in the last week of December have the wrong year, 2016-12-27 instead of 2015-12-27, which puts them far into the future. The date formatter used to create the log entries was erroneously using the "YYYY-MM-dd..." format, when it should have been using "yyyy-MM-dd..." format. The YYYY specifier is the "year designation used in ISO year-week calendar as defined by ISO 8601", whatever the hell that is. Anyway, it's not the calendar year. I've fixed it, but you'll have to wait until your log records get rotated out before they disappear. If you find it super annoying, just trash all of the log files in ~/Library/Logs/QRecall and restart. If you want to fix this without losing your log history, you'll have to get tricky. You could use an editor to replace all instances of "2016-12-" with "2015-12-", for example.
|
|
|
Well, that was four days I really wish I could have back. Regardless, I now have a solution to the keychain access problem for privileged actions (specifically, capture). It turns out there's some tricky issues with running code outside the user's login session that prevents it from accessing the user's security information—as it should. The problem was, this is one of those situations where an outside process really did need to access your user's private information and the OS X security framework was, naturally, not inclined to make this easy. The fix for this problem will appear in the next release.
|
|
|
Bruce Giles wrote:So, everything appears to be back to normal, but is there anything you can do in QRecall to keep from generating so many error messages so quickly? Or is that the OS itself that was generating those?
This is 100% OS X. QRecall installs a number of "services" that are managed via launchd. Each service is defined by a configuration file (the .plist files in ~/Library/LaunchAgents for example). This is entirely set-and-forget; QRecall writes these configuration files and then expects launchd to manage the processes. I'm actually surprised you had this behavior. In most circumstances, launchd will throttle a misbehaving service to keep it from chewing up all of the available resources. So if a service doesn't start, or starts but immediately terminates, launchd will ignore it for awhile before trying again. But 300 times a second is pretty much the opposite of "ignoring." I'm going to file a bug report on this one, because it's clearly a simple misconfiguration that leads to serious problems. For future reference, if you had just launched QRecall once it would/should have fixed the problem. When you launch the QRecall app, it tries to establish a connection with the scheduler (that launchd is supposed to keep running). If it can't, it then checks to see if the scheduler is properly installed. It should have discovered that the scheduler .plist file was misconfigured and rewrote it. That might not have immediately solved the problem (because launchd tends to cache the .plist info), but the next restart should have stopped it.
Had I not noticed the problem when I did, I can imagine that I would have run out of disk space fairly soon, and I hear that OS X gets really hard to work with when it runs out of disk space.
The word you're looking for is "nightmare." HTH
|
|
|
Gary, Thanks for the info. I found the problem, which was an off-by-one error. When you set the keep count to 1, it was actually getting set to 0 which freaked out the scheduler. I've fixed it, but more importantly if the log rotation schedule is out of bounds the scheduler now simply logs it, picks a valid value, and keeps running. I have no idea what possessed me to make that a fatal error.
|
|
|
Ralph Strauch wrote:I don't think the problem was a bad drive, since it seemed to affect both drives equally during the period it was happening.
I hope I didn't imply that there was definitively something wrong with your drive. I meant to emphasize that the failures are all "drive related," meaning that they indicate a drive that has spontaneously gone offline, unmounted, or simply stopped responding. If the drive is being accessed via a network or file server, there are lots of moving parts between your system and the actual hard drive that can cause these symptoms.
I did seem to be having network problems, which definitely contributed.Some of those involved scheduled backups run automatically at night when both machines were otherwise asleep ...
And I think this is the most likely explanation. A dropped network, a file server that goes to sleep while it's still being used, or a computer that's running code while it's asleep (power nap issue) would all cause the kinds of errors you see in the log.
I am still concerned about the seven failures cited above, which all appear to occur as the archive attempted to close immediately after writing the backup to the archive and the backup statistics to the log,. The timing of these failures, occurring at a specific well-defined point in a long ongoing process, makes it hard to attribute them to a source external to that process, like a mechanical or network failure.
Except for the other evidence. In between the failures you highlighted are identical failures that occur before the archive beings to close, and a slew of successful captures without any problems at all. I think the fact that there are a high number of failures occuring while closing the archive is simply because that's when the greatest amount of archive activity begins. Most (incremental) captures spend most of their time reading the local hard drive looking for changes. It's when the archive is about to be closed that things get busy, and if there's a problem read/writing to the archive, that's when it's statistically most likely to happen.
I'm curious -- what app are you pasting the log entries from, to get that formatting and the line numbers?
The forum has various tags you can use to format text, like [b]bold[/b], [i]italic[/i], and [u]underline[/u]. You can also surround a line or block of text with the [b]
[/b]code goes here[b]
[/b] tags and they'll be formatted like the log listing in my earlier post. Just select the block of text and click the "Code" button above the post entry field.
|
|
|
My bad. I was trying to give you a "safer" command to use, but failed shell expansion rules 101. Here's the command:
sudo rm -rf /.fseventsd Please don't mistype this one; one errant space and you could wipe your entire drive.
|
|
|
Alexandre Takacs wrote:(that's me canceling after 7h)
You have a lot more patience than I do. The "Locating changes" phase is playing back the filesystem event log to determine what folders contain changes. QRecall requests a playback of changes and then waits to receive them. This should take, at most, a minute, and usually only takes a few seconds. I suspect something is wrong and OS X simply failed, or started and never finished, sending the events and QRecall was left waiting forever. This might be a transient problem. First, try restarting your system and try again. If it happens again, you might consider resetting your filesystem history on Macintosh HD. There's an official API for doing this programmatically, but I don't know of any utilities you can download that will do that for you. The "unsupported" method is simply wipe the filesystem history for your volume and restart your system. Open up a Terminal window and execute the following command:
sudo rm /.fseventsd/* The command will prompt you for your admin password. After it finishes, immediately restart your system. Be very careful about entering this command exactly as shown. The sudo rm command can do some serious damage if incorrectly used. One of those should fix your problem. If it doesn't, I suspect a deeper issue with OS X.
|
|
|
Ralph, Here's a random sampling of a few capture issues I found in the logs you sent me:
2015-11-08 11:12:38.071 -0800 ------- Capture to 3rd backup.quanta
2015-11-08 11:38:12.467 -0800 Failure Problem closing archive
2015-11-08 11:38:12.468 -0800 Details ErrDescription: Operation timed out
2015-11-10 07:54:44.131 -0800 ------- Capture to 3rd backup.quanta
2015-11-10 08:46:52.575 -0800 Failure Could not capture file
2015-11-10 08:46:52.575 -0800 Details archive I/O error
2015-11-10 08:46:52.575 -0800 Details Cause: <IO> cannot read hash page(s) { ErrDescription='Operation timed out', POSIXErr=60, Position=5731762176, API=pread, Path='/Volumes/BUD3/3rd backup.quanta/hash.index', Length=8192 }
2015-11-11 03:54:28.850 -0800 ------- Capture to 3rd backup.quanta
2015-11-11 03:56:06.984 -0800 Failure Problem closing archive
2015-11-11 04:55:24.154 -0800 Details ErrDescription: Operation timed out
2015-11-11 06:24:16.879 -0800 ------- Capture to 3rd backup.quanta
2015-11-11 07:25:35.814 -0800 Failure Could not capture file
2015-11-11 07:25:35.814 -0800 Details failed to write envelope header
2015-11-11 07:25:35.814 -0800 Details ErrDescription: Device not configured
2015-11-19 03:37:19.406 -0800 ------- Capture to 2nd backup.quanta
2015-11-19 04:39:06.269 -0800 Details cannot read envelope content length
2015-11-19 04:39:06.269 -0800 Details ErrDescription: Device not configured
2015-11-30 08:56:33.900 -0800 ------- Capture to 2nd backup.quanta
2015-11-30 09:11:35.848 -0800 Details problem closing file
2015-11-30 09:11:35.848 -0800 Details ErrDescription: Operation timed out
2015-12-06 03:00:01.518 -0800 ------- Capture to 3rd backup.quanta
2015-12-06 03:27:38.240 -0800 Details problem closing file
2015-12-06 03:27:38.240 -0800 Details ErrDescription: Operation timed out
2015-12-10 10:35:35.671 -0800 ------- Capture to 3rd backup.quanta
2015-12-10 10:44:21.337 -0800 Details cannot read envelope content length
2015-12-10 10:44:21.337 -0800 Details ErrDescription: Device not configured
There are a couple of things to note. First, all of the errors are either "Operation timed out" or "Device not configured." These are not file content errors, media errors, file structure errors, or volume structure errors. These error indicate that storage device has gone off line, or in the case a remote connection the connection to the server has been lost. The other thing that points to this not being an issue with this particular archive or its volume structures are that these events are the exception. There are scores of successful captures interleaved between these failures. If the archive was corrupted, the media unreliable, or the volume directory structure was damaged, these other captures would have run into the same problems—but they didn't. I still suspect that the drive containing the archive is either spontaneously going off-line or unmounting, or the network connection to the device or server is timing out, disconnecting, going to sleep, shutting down, going off-line, etc.
|
|
|
Jeff, This is turning out the be quite the mystery. It looks like QRecall is doing everything right, but it's still not working. I've posted the problem in the Apple developers forum in hopes of discovering a solution. Here's what's happening. When you add a password to the keychain, it's stored there along with a list of applications that are allowed to freely access it. This list includes the QRecall application and the QRecallHelper process. When the QRecall application asks for the password, it gets it. That's why you can open the archive in a browser window. Most other actions, like capture, are performed by the QRecallHelper process. When it asks for the password, the keychain says no such record exists. That's why your capture (and most everything else you try) won't run. If you open the keychain record for your archive encryption password, you'll see that both QRecall and QRecallHelper are both listed as trusted apps, but for some reason it's not working. (I suspect one of the recent security updates, but it's too soon to tell). For now, I suggest removing the password and storying your encryption key in plain text. Not quite as secure, but your archive data will still be encrypted.
|
|
|
Jeff, In all likelihood, this is a bug. There have been a number of recent changes in how passwords are handled/obtained, and I wouldn't at all be surprised if something fell through the cracks. Please send a diagnostic report; open the QRecall application and choose Help > Send Report. This will contain a lot of low-leve information that will help me diagnose the problem.
|
|
|
Ralph,
I'm traveling, so I can't take a detailed look at you logs until I get back tomorrow, but just a few notes until then....
From what I've seen, it's unlikely to be the archive itself. The errors I mentioned in the previous post are things like " device not ready" and "no such directory". These are not data or I/O errors, but rather indicate a device that has disconnected.
DiskWarrior, disk utility, et. al. are wonderful tools, but only address the volume directory structure. They say nothing about the reliability of the drive, and certainly don't speak to whether the drive has, or could, disconnect at an inappropriate time.
Finally, the "Problem closing archive" can be a bit of a red herring. Whenever there's a problem with an operation, the action will stop and then try to close the archive. If the problem was something that prevents QRecall from accessing the archive files, you'll also get a "Problem closing archive", in addition to the original problem.
More tomorrow...
|
|
|
Ralph, Everything I see in the log would indicate I/O problems with the drive. The capture starts at 12-10 10:35 and runs just fine until 10:44:21 when QRecall encounters a "Device Not Configured" error trying to write to the archive. This is a fatal I/O error that's usually associated with a device that has gone off-line. There's no way to recover from that kind of error, so the capture bails. During its exit, it tries to close the archive but encounters another I/O failure, "Operation timed out". That's when it logged the "Problem closing archive" message. You then started a repair at 17:09. That ran swimmingly until about 21:42. That's when the error correction messages start. These are not unexpected. Since the capture was interrupted while writing to the data files, it's likely that there's a mismatch between the data that was written and the error correction codes. So what undoubtedly happened is that some of the data doesn't match some of the correction codes. This is fine, and the repair will reconstruct the codes as needed. This is confirmed, to a degree, because the invalid correction codes were for an offset very close to the offset being written when the capture failed. Once the repair was past that bad patch, the repair resumes running and encountered no other issues until it was almost finished at 12-11 02:49. At this point, everything goes south. As it tries to finalize the archive it encounters a string of "No such volume", "No such directory", and "Device not configured" errors, as if the volume had been spontaneously unmounted or powered down. Needless to say, the repair wasn't successful.
|
|
|
Ralph Strauch wrote:You?ve said elsewhere that the invalid data will be discarded and that the archive will contain only valid files, and I?m a bit confused about the result of that process. Will I lose files which were found to contain invalid data, or only the versions of those files which contained the invalid data?
During a repair, QRecall scans every byte of your archive's repository.data file. Normally, every byte is accounted for. If the repair can't make sense of a string of bytes, this range of the file is erased and you get a "Data problems found" message in the log. The question, however, is whether any of this "damaged" data means anything important. If the bad data was originally a data or file record, then it will result in the loss of one or more files. That, in turn, could leave one or more folders missing a file. You need to look beyond the "Invalid data" messages in the log and look for folder and file specific messages. For example, if a bad byte was found in the middle of a data record, that would result in the loss of a file. That lost file will record a "Discarded incomplete file" message in the log, along with the details of which file that was lost. There are no such messages in your log, so none of the "bad" data in your archive was related to anything important. This isn't entirely surprising. After compact and merge actions finish, there are lots of regions in the file that are left empty or contain unreferenced data, the loss of which won't have any consequences for your archive.
As I said in the thread on Encryption, <http://forums.qrecall.com/posts/list/567.page>, I?ve been mounting my backup drives on the MBP because my Airport Extreme can?t read a FileVault encrypted drive. But if this problem is caused by backing up one computer to a drive mounted on another perhaps shifting from FileVault to Qrecall encryption and going back to mounting the drive on the AE would be a solution. What do you think?
Technically, there is very little difference between these two solutions. In both situations, you have an external hard drive mounted on a server (either the iMac or the Airport Basestation), running an AppleShare server, being accessed over a network by a remote client (the MacBook Pro). The advantage of using the Airport Basestation is that the basestation runs 24/7, so you don't have to worry about when the iMac is running. The advantage of using the iMac is that actions will be faster on the iMac (because it's directly connected to the volume) and will be more reliable (local busses almost always have lower error rates than LANs).
|
|
|
This should be fixed in 2.0.0b26 (just released).
|
|
|
Quick update: I think I found the problem with the archive window not opening. It's definitely a display problem and (as I suspected) occurs after the archive has been opened, but before the window is displayed. Still trying to determine the underlaying cause, but at least I have a bead on it.
|
|
|
|
|