Message |
|
Ming-Li Wang wrote:It happened in the early morning on 8/19, Taiwan time, so it's already in the diag. report I gave you that day.
I found the log records for the action. The log window didn't display them because of a subtle bug in the way the log window collection log records, that I don't have a good solution for at the moment. QRecall log records have a sequence value that organizes them into a hierarchy. The top level values in the hierarchy are the Julian Day Number and the process ID. Now while it's unlikely that two QRecall actions that run within the same day would have the same process ID, it isn't impossible and that's what happened here. A helper with a process ID of 506 ran at 01:38:25.86 that morning and the capture action started at 03:20:00.904 also had a process ID of 506. The log window assumed they were the same process and commingled their log records. I'll put this on the "do something about this" list, but as I said, there's no simple solution. It's exacerbated in your case because you have a lot of QRecall actions scheduled (several run every 15 minutes), increasing the possibility of a duplicate process ID. Most of these capture actions do nothing, because nothing changed. You might consider changing the schedule of your actions that capture a specific subset of files (like documents, photos, or music) to use the new "Capture Items Changed" event schedule. This new event schedule monitors the folder(s) to be captured for changes and immediately starts the capture once changes are detected. The advantages are that it's much more responsive (the action can start the minute a change is detected) and it's much more efficient because the action doesn't run until there's actually something to capture.
|
|
|
Ming-Li Wang wrote:Still got the rainbow wheel again when trying to click QR Monitor's icon in the menubar this morning. The system is otherwise normal, meaning I could launch Firefox and other applications without delay. I launched QR from the Launchpad instead, but again clicking anything on the menu gave me the rainbow wheel. Be patient, and the menu did pop up after a few sec., and I was able to get the Log window open, and send the diag. report. The toplog was also sent, but I believe you'll see a system mostly idle throughout the night. The OSX Activity Monitor again showed QR Scheduler is alive, but it's apparently not working.
After spending a lot of quality time with the logs, the crash reports, the toplog.txt files you've been collecting, and some experimenting, I think I'm getting a handle on what's going on, although I still don't know why. I finally figured out the rainbow cursor problem: The helper and scheduler processes are hung up. It's not a system load or resource issue, these processes are simply locked up. When you launch QRecall, the first thing it does it try to contact the monitor and scheduler processes and, you guessed it, promptly locks up waiting for them to respond. The messages will eventually time out and the QRecall app will run, but will report it can't install the monitor and it's waiting for the scheduler to start, even though it's already running. The new version (2.0.0b12) tries to address these issues by reworking how messages are retried. The new logic might still lock up, but for no more than a minute or two (worst case). So the monitor and scheduler should recover within a reasonable amount of time (minutes not hours) and get back to running. The root of the problem is the helper processes that are started around 4:00 AM simply stop running—and I have no idea why. They don't log any error messages (the log messages from the process simply stop), they don't terminate unexpectedly (nothing in the system log), they don't crash (no crash log for the process), they just stop running. The scheduler and monitor continue to try and communicate with the helper, which locks them up too waiting for a response that never happens. Without a sample of what's going on within the QRecallHelper process, I'm at a loss to explain why it would do this. It also doesn't make much sense that it only does it at 4:00 AM and not other times. QRecall 2.0.0b12 addresses some of these issues and adds additional debugging information that will hopefully allow me to get closer to the problem.
Ming-Li Wang wrote:Since the problem doesn't seem to be related to competing use of system resources (my own tentative conclusion while you investigate), I decided to take a closer look at the logs and see if I can spot anything that escaped my attention.
And you are correct, this is not resource issue.
1. A capture action is scheduled for my system partition at 3:20am daily. The archive's layer map says it's done this morning, with 793 captured items totaling 273.7 MB. I can't find any entry for it in the QR Log (see attached screenshot), even with maximum detail level.
If you send another diagnostic report I can review the log.
Aug 19 03:20:00 carlo com.apple.xpc.launchd[1] (com.qrecall.hoist[505]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): QRHoist
This is normal. Apple wants developers to switch from using Mach ports to XPC. So whenever QRecall starts a privileged service using Mach ports, the system spits out this message. But until Apple provides end-to-end XPC communications, QRecall has to use Mach ports.
Aug 19 03:20:16 carlo QRecallHelper[506]: Failed to initialize IconCache named: com.apple.iconservices with error: Error Domain=NSCocoaErrorDomain Code=4097 "Couldn?t communicate with a helper application." (connection to service named com.apple.iconservices) UserInfo=0x7ff2b941f840 {NSDebugDescription=connection to service named com.apple.iconservices} Aug 19 03:20:16 carlo QRecallHelper[506]: Error returned from iconservicesagent: Error Domain=NSCocoaErrorDomain Code=4097 "Couldn?t communicate with a helper application." (connection to service named com.apple.iconservices) UserInfo=0x7ff2be2bf6f0 {NSDebugDescription=connection to service named com.apple.iconservices} Aug 19 03:20:16 carlo QRecallHelper[506]: ImageIO: CGImageDestinationFinalize image destination must have at least one image
These are errors from OS X's launch services framework. As QRecall captures items, it queries the launch services API for things like the item's icon, localized display name, filename character encoding, and so on. There are number of issues with launch services going back more than a decade, and sometimes it can't/won't return this information. In more recent versions of OS X, it logs messages like this to the console. (Earlier versions of OS X would simply crash.) You'll probably notice corresponding messages in the QRecall log like "cannot obtain icon for ...". These problems are entirely cosmetic, as this information is only used to display the items in the archive browser. They're annoying, but can be safely ignored. Sometimes these issue are caused by a corrupted launch services database and rebuilding your LSDB might make these go away (Google "rebuild launch services database").
2. The capture actions scheduled for 3:50am was indeed run and concluded successfully, with nothing to capture. Don't know why QR executed it again after reboot this morning (as reported in my previous post).
If the scheduler terminated abnormally, it might not have remembered that it already ran that action would immediately run it again when restarted.
3. The first in a string of "Unable to communicate with helper" warnings (at 4:20:25 am) in the QR Log has a note that says:
[NOTE: this exception originated in the server.] | connection timeout: did not receive reply
The others says:
cannot connect with helper port: QRecallHelper.2307054f
These are the helper communication problems that are at the root of the issue.
4. QR scheduler managed to get itself up and executed a capture job at 6:34:21am.
Small victories.
|
|
|
Here's an experiment to try: Late in the evening, open a Terminal window and start the following command:
top -l 0 -S -s 30 -n 6 -o cpu > ~/Desktop/toplog.txt and let it run all night. When you get back to your computer in the morning, switch to the Terminal window and press Control+C to stop it. The toplog.txt file will contain a sample of your system's load averages, total memory usage, swap space, and the top 6 CPU hogs, and other performance stats, recorded every 30 seconds. Compress that file and email it to me. If there are processes running overnight that are demanding a lot of CPU and/or memory, this should catch them.
|
|
|
Ming-Li Wang wrote:Another follow-up: I tried to restart the scheduler with the command line you taught me in another thread:
QRecallScheduler & It didn't work. Got
-bash: QRecallScheduler: command not found
Sorry, that was shorthand. You need to specify the path to the scheduler executable:
~/Library/Application\ Support/QRecall/QRecallScheduler &
Then I tried
sudo QRecallScheduler & and got a process id. Good.
Not good. You got a job ID for sudo. sudo then went to run the executable, couldn't find it, and quit. And it's moot anyway, because you can't run the scheduler as root.
But the two pending actions are still "waiting". Restarting QR monitor didn't help.
Restarting the monitor won't help. The monitor is just that: a monitor. It just reports on what's going on. It doesn't tell actions what to do (although it will rely commands you select in the stop menu to the action).
There's still no sign of QR scheduler in OSX Activity Monitor, either. The process name for the id is "sudo". Guess I'll have to logout and re-login again.
Even after you get the scheduler running again, you'll still need to stop those actions or choose the "Ignore Hold and Run" command from the stop button's drop-down menu in the monitor. When the scheduler starts a QRecallHelper process, it begins by placing it in a "wait for scheduler" state. Once the scheduler determines that the action is ready to run, it sends it a message that it can start. While in the "waiting" state, the action is suspended until the scheduler gives it the green light. And this is the problem I'm seeing in your log. The scheduler starts the QRecallHelper process, but then can't communicate with it ("lost communications with helper" errors). So what's happening is that the helper gets launched but never runs. I see these kind of communication failures when the system is overloaded and can't allocate the kernel resources necessary to deliver the message, or can't do it fast enough. What I suspect is that when the scheduler is trying to start these actions, your system is under tremendous load; either something is using (or was using) a lot of CPU, memory, or both. This is bogging down the communications between the scheduler and helper processes, and generally making your system run very sluggishly. And just restarting the scheduler won't help, unless the scheduler closed down cleanly and is restarted within 30 seconds. The scheduler doesn't know what QRecallHelper processes are running and, once it's forgotten about them, can't automatically reconnect with them to tell them they can run. So once you have an orphaned QRecallHelper process sitting in a "wait" state, you will either have to start it manually (using the "Ignore Hold and Run" command) or just stop the action.
|
|
|
Ming-Li Wang wrote:Ok, just came in, and found QR Monitor window open with 2 actions "waiting on scheduler". Apparently the scheduler went down again, at 12:22pm according to the log.
Please send that log via a diagnostic report. If you haven't figured this out by now, you can't send too many diagnostic reports.
I posted the last message at 11:44 local time, so the scheduler stopped less than an hour later. But it seemed to have run into trouble much earlier, for the two "waiting" actions (both weekly merge actions) were scheduled to take place at around 5am this morning. The rainbow wheels I saw this morning should be related, for QR seems be acting normally (no rainbow wheel) now with the scheduler gone.
I suspect these are related, but I doubt it's a causal relationship.
BTW, there are two new QR processes in the Activity Monitor that I've never seen before: QRecallHelper (2 of them). Is this normal?
Yes, those are the two actions that are waiting to run. You can quit them via the activity monitor. Every (major) task that is performed on an archive is performed by an instance of QRecallHelper (sometimes two instances). So if you start a recall, a verify, a merge, and a compact action, you'll see four QRecallHelper instances running. The helper also performs installs, system configuration checks, updating the sleep/wake schedule, and other privileged activities.
|
|
|
Ming-Li, I reviewed the diagnostic reports you sent, especially the one sent before you tried to kill or restart anything, and what I found was ... nothing. QRecall wasn't doing much of anything at the time. Both the scheduler and monitor were idle, and there were no running actions. So I suspect something else is in play, and the QRecall application might be the cause. The next time this happens, launch the Activity Monitor. Starting in the memory tab, sort processes by total memory use (greatest to least) and note which processes are using the most memory. Also note the amount of swap used. Then go to the CPU tab, sort (greatest to least) by CPU% and see if there are one or more runaway processes. Now on to my theory: If you are in the habit of leaving the QRecall application running, and its log window open, this might be your problem. As long as the log window is open, it will read in all of the log records on file. In your situation, this could be a lot. You have a lot of actions defined, and you run those actions very frequently. You're also running a beta version of the software (which produces a lot more log messages than the regular version). Your log file preferences are set to keep log messages for up to 6 months. All of these combine to create a kind of "perfect storm" that sets up the log window to read a ridiculous amount of data. If that's what happening, your system's poor performance could simply be the amount of memory and swap space consumed the QRecall application's log window. Now if you aren't leaving the log window open, then something else is bogging down your system, which is why you need to use the Activity Monitor to conform it. If it turns out that it is the QRecall log window, you can not leave it open or try reducing amount of log history you keep.
|
|
|
After spending some quality time with your log, I found a couple of instances where the scheduler stopped due to uncaught exceptions, mostly to do with spurious mount notifications for the recovery partition. (Under normal circumstances, the recovery volume should never be mounted, so it's a mystery why the scheduler would be receiving these...) I've plugged those leaks, but even those wouldn't have caused the scheduler to exit with a non-zero status. So why launchd isn't restarting it remains a mystery. I even specifically found instances where you had force-quit the scheduler, and launchd didn't restart it. But see my reply to the slow-down issue you reported for a possible clue. On a side note, it's best to try quitting (SIGTERM) the scheduler first before resorting to a "force quit" (SIGKILL). A quit gives the scheduler a chance to save the information it has about running actions before terminating. This helps it restart smoothly.
|
|
|
Ming-Li Wang wrote:Looking closely, I found the "overlap" was from 7/4 to 8/1, and 8/2 might be the day I started trying out beta 10. (Not sure, but likely.) If that is true, then does it mean both b10 and b11 thought, when making their first backup, they were continuing the volume started on 7/4 (the day I reinstalled my system from scratch), and yet b11 thought, at the same time, the volume has changed from the previous day. How is that possible?
I don't know, and there's no easy way to determine what identifiers QRecall was using at the time to match the volume. Oh, and to make matters more complicated, QRecall 1.x uses a different method for identifying volumes. Prior to 2.0, QRecall used a combination of the filesystem identifier (part of the BSD filesystem API), volume creation date, size, and name.
|
|
|
Ming-Li Wang wrote:I did use "sudo", didn't I?
Yes, you did. I don't know how I missed that... I couldn't reproduce this on my development system, but I did see it on a test MacMini, so I'll investigate this further. Thanks for sending the diagnostic reports with the scheduler failure. And to answer your question, no the QRecall application can no longer force the scheduler process to start/stop. Recent changes to the launchd and launchctl tools make process management the sole responsibility of launchd. It's a moot point, since adding a feature to manually start the scheduler process is more hack than solution. If you want to do that, just start it from the command line:
QRecallScheduler &
|
|
|
Ming-Li Wang wrote:One problem remains: I still can't restart QR scheduler after forcing it to quit in the Activity Monitor. Quit and restarted QR several times, but the scheduler never came up. A diagnostic report has been sent.
I'll look into this. QRecall's scheduler is managed by the launchd service. launchd is set to run the scheduler all the time, unless the QRecallScheduler process exits with a non-zero status. launchd will also throttle, or eventually stop running, the scheduler if it repeatedly crashes or is forcibly terminated too often. If you're force quitting (SIGKILL) the scheduler, I would expect it to restart. If you're soft quitting (SIGTERM), the scheduler should immediately clean up and quit with a status of 0; launchd should immediately restart it. If you interrupt (SIGINT) the scheduler, it will deliberately exit with a non-zero status so that launchd does NOT restart it. It will also exit with a non-zero status under certain circumstances, such as with some internal errors. Relaunching the QRecall application won't help, because it's launchd's run the process. If launchd has decided not to run the scheduler, you'll need to log out and back in (if "Run actions while logged out" is turned off), or restart the system (if "Run action while logged out" is turned on), to get it going again. Anyway, diagnostic reports will help. They should tell me if the scheduler is terminating early, terminating with a non-zero status, or repeatedly crashing.
Other issues: 1. I used "sudo qrecall captureprefs list -r -s /" to check the QR Capture Preferences settings, and it aborted with the following error midway through the process:
* failed to read directory
This isn't surprising, because you don't have permission to read those directories. Use sudo qrecall captureprefs list -r -s / instead. I'll put an item on the wish list for qrecall captureprefs -r to continue if it can't read the contents of a directory.
2. There are a few dozen of "Unable to communicate with helper" warnings and one "Unexpected problem; scheduler stopping immediately" error in the log.
Hmmm. The "scheduler stopping immediately" error might be the reason your scheduler isn't running...
|
|
|
Accurately identifying volumes is a tricky business. The name of a volume is essentially useless for identifying it; there are probably 50 million hard drives all named "Macintosh HD" in the world. Plus, renaming a volume doesn't make a new volume. Awhile back, OS X began assigning each volume a UUID (Universally Unique Identifier), where possible. This is used in aliases and bookmarks to reliably identify a volume, and is QRecall's preferred method. The UUID is used, for example, to determine what device to mount when trying to resolve a link to a file that's on an unmounted volume. If you install a new drive, or repartition a drive, OS X will assign that volume a unique ID. Mirroring a drive, block-level cloning of a volume, and similar techniques can, however, result in two volumes with the same UUID. This can result in confusion. OS X tries to combat this by spontaneously changing a UUID if it discovers it isn't unique. But that typically only happens if you clone a drive and then mount it alongside its original. On filesystems/devices that can't support UUIDs, QRecall falls back to using a combination of the volume's size, the date it was created, and (as a last resort) its name. So to answer your question, in a roundabout way, some of the things you're doing (installing new hardware, repartitioning) will result in QRecall treating the new volume as a brand new volume. Other techniques, like block-level cloning, may trick QRecall into thinking the new volume is the same one it has seen before. If you're interested, you can see the UUID (if any) of a volume using the diskutil info command.
Ming-Li Wang wrote:While I don't particularly mind QR starting a new set from time to time, this practice does have its down side: when digging through the archive for some earlier versions of a file, I can't reach all versions of the same file easily. Hence, a question: is there a way to "merge" the sets? If not, please consider it a feature request. Thanks!
QRecall > Help > Guide > Advanced > Combine Items.
|
|
|
Bryan Derman (of Derman Enterprises) has developed a set of scripts that will incrementally backup a QRecall archive to optical media or disk images. You can download these scripts, read the documentation, and get the full history from his blog entry Off-Site and Archival Storage for QRecall Backups. In a nutshell, these scripts divide a working QRecall archive into smaller segments which are stored on disk images. Each segment / disk image can be burned directly to an optical disk, stored across smaller hard drives, uploaded to a server, or added to a cloud synchronization service. As the archive grows, new segments are automatically burned/created. The result is a set of discs or images that, in the future, can be reassembled to recreate the original archive. There are a couple of caveats. First, you must turn off periodic merge and compact actions. These scripts depend on a feature of how QRecall adds new data to an archive; newly captured data is always appended to the end of an archive, until the archive is merged and compacted. So by not regularly merging, new segments can be written/burned for just the new data that's been appended to the archive. This can go on indefinitely, or at least until your archive runs out of disk space. At some point, the archive will need be merged and compacted, at which time you'll have to reburn/recopy a new set of discs/images from the beginning. But with adequate disk space, combined with QRecall's efficient data de-duplication, this might only be necessary once a year, possibly even less. Bryan has done an amazing job with these shell scripts. They're flexible, can be customized, are very well documented, and handle a variety of DVD and BluRay disc sizes. Bryan has made these scripts freely available and encourages feedback from anyone interested in trying them out. Currently, they have only been tested with QRecall 1.2. When Bryan approached me, looking for a way to backup an archive using optical media, I dismissed the idea as impractical. Because of the way QRecall archives are normally merged and compacted, it would require that the entire set of backup images/discs to be repeatedly re-burned. But Bryan convinced me that it's perfectly manageable to not merge the archive and let it grow indefinitely. And there are business practices that would make this a necessity, in situations where all data must be kept in perpetuity. I encourage anyone who is looking for an off-site backup solution for their archive to give these scripts a try. Bryan and I will be working together to integrate this functionality into a future version of QRecall.
|
|
|
Johannes,
My stock answer to this question is "No, you can't run two versions of QRecall simultaneously."
The problem is that all versions of QRecall are going to use the same set of resources (preferences, schedule, action documents, status files, archives, and so on). And much more complicated is the fact that QRecall installs a number of system services. These services are all identified by name ("com.qrecall.scheduler" for example). OS X doesn't allow two or more services to be registered with the same name, and QRecall (for security reasons) will refuse to use any service or component that's not the exact same version. So there can only be one scheduler service running, and only the same version of QRecall will talk to that scheduler, which effectively limits the usable versions to one.
Having said all of that, there might be a one-time-only exception as we transition from 1.2 to 2.0. QRecall 2.0 uses a different (safer, more secure) launch mechanism for running privileged commands, like a capture. Because of this, it might be possible to install QRecall 1.2 in user account A and QRecall 2.0 in user account B as long as both users have the "Run actions when logged out" option turned off.
In this situation, all QRecall user services (monitor, scheduler, system service, spotlight importer, and so on) will be installed in that user's account, and just for that user. The only system service that need to be installed is the privileged launch mechanism, but since 1.2 and 2.0 use different mechanisms, they shouldn't stomp on either other.
Warning: This is entirely theoretical, so install at your own risk. If you run into problems, please let me know that you're also running a parallel copy of QRecall 1.2 in any bug reports.
|
|
|
Ming-Li Wang wrote:I've since removed the action but now I get two ghost items in the QR pop-up menu that refuse to go away. (screenshot attached below)
This is an unrelated bug that I'm working on. Workaround: Quit the QRecall Monitor process to make them go away.
|
|
|
Ming-Li Wang wrote:Got "internal program error" midway through a compact action. The Log shows "unexpected buffer contents" messages. Report filed.
Got it, and thanks for the test archive. I don't think I would have found the bug without it. Yes, this is why we do beta testing. This archive had just the right (wrong?) sequence of reads and writes needed to confuse the redundancy buffer management. The fix for this will be in b11.
|
|
|
|
|