QRecall

James Bucanek

Ming-Li Wang wrote:Still got the rainbow wheel again when trying to click QR Monitor's icon in the menubar this morning. The system is otherwise normal, meaning I could launch Firefox and other applications without delay. I launched QR from the Launchpad instead, but again clicking anything on the menu gave me the rainbow wheel. Be patient, and the menu did pop up after a few sec., and I was able to get the Log window open, and send the diag. report. The toplog was also sent, but I believe you'll see a system mostly idle throughout the night. The OSX Activity Monitor again showed QR Scheduler is alive, but it's apparently not working.

After spending a lot of quality time with the logs, the crash reports, the toplog.txt files you've been collecting, and some experimenting, I think I'm getting a handle on what's going on, although I still don't know why.

I finally figured out the rainbow cursor problem: The helper and scheduler processes are hung up. It's not a system load or resource issue, these processes are simply locked up. When you launch QRecall, the first thing it does it try to contact the monitor and scheduler processes and, you guessed it, promptly locks up waiting for them to respond.

The messages will eventually time out and the QRecall app will run, but will report it can't install the monitor and it's waiting for the scheduler to start, even though it's already running.

The new version (2.0.0b12) tries to address these issues by reworking how messages are retried. The new logic might still lock up, but for no more than a minute or two (worst case). So the monitor and scheduler should recover within a reasonable amount of time (minutes not hours) and get back to running.

The root of the problem is the helper processes that are started around 4:00 AM simply stop running—and I have no idea why. They don't log any error messages (the log messages from the process simply stop), they don't terminate unexpectedly (nothing in the system log), they don't crash (no crash log for the process), they just stop running. The scheduler and monitor continue to try and communicate with the helper, which locks them up too waiting for a response that never happens.

Without a sample of what's going on within the QRecallHelper process, I'm at a loss to explain why it would do this. It also doesn't make much sense that it only does it at 4:00 AM and not other times.

QRecall 2.0.0b12 addresses some of these issues and adds additional debugging information that will hopefully allow me to get closer to the problem.

Ming-Li Wang wrote:Since the problem doesn't seem to be related to competing use of system resources (my own tentative conclusion while you investigate), I decided to take a closer look at the logs and see if I can spot anything that escaped my attention.

And you are correct, this is not resource issue.

1. A capture action is scheduled for my system partition at 3:20am daily. The archive's layer map says it's done this morning, with 793 captured items totaling 273.7 MB. I can't find any entry for it in the QR Log (see attached screenshot), even with maximum detail level.

If you send another diagnostic report I can review the log.

Aug 19 03:20:00 carlo com.apple.xpc.launchd[1] (com.qrecall.hoist[505]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): QRHoist

This is normal. Apple wants developers to switch from using Mach ports to XPC. So whenever QRecall starts a privileged service using Mach ports, the system spits out this message. But until Apple provides end-to-end XPC communications, QRecall has to use Mach ports.

Aug 19 03:20:16 carlo QRecallHelper[506]: Failed to initialize IconCache named: com.apple.iconservices with error: Error Domain=NSCocoaErrorDomain Code=4097 "Couldn?t communicate with a helper application." (connection to service named com.apple.iconservices) UserInfo=0x7ff2b941f840 {NSDebugDescription=connection to service named com.apple.iconservices}

Aug 19 03:20:16 carlo QRecallHelper[506]: Error returned from iconservicesagent: Error Domain=NSCocoaErrorDomain Code=4097 "Couldn?t communicate with a helper application." (connection to service named com.apple.iconservices) UserInfo=0x7ff2be2bf6f0 {NSDebugDescription=connection to service named com.apple.iconservices}

Aug 19 03:20:16 carlo QRecallHelper[506]: ImageIO: CGImageDestinationFinalize image destination must have at least one image

These are errors from OS X's launch services framework. As QRecall captures items, it queries the launch services API for things like the item's icon, localized display name, filename character encoding, and so on. There are number of issues with launch services going back more than a decade, and sometimes it can't/won't return this information. In more recent versions of OS X, it logs messages like this to the console. (Earlier versions of OS X would simply crash.) You'll probably notice corresponding messages in the QRecall log like "cannot obtain icon for ...". These problems are entirely cosmetic, as this information is only used to display the items in the archive browser. They're annoying, but can be safely ignored. Sometimes these issue are caused by a corrupted launch services database and rebuilding your LSDB might make these go away (Google "rebuild launch services database").

2. The capture actions scheduled for 3:50am was indeed run and concluded successfully, with nothing to capture. Don't know why QR executed it again after reboot this morning (as reported in my previous post).

If the scheduler terminated abnormally, it might not have remembered that it already ran that action would immediately run it again when restarted.

3. The first in a string of "Unable to communicate with helper" warnings (at 4:20:25 am) in the QR Log has a note that says:

[NOTE: this exception originated in the server.] | connection timeout: did not receive reply

The others says:

cannot connect with helper
port: QRecallHelper.2307054f

These are the helper communication problems that are at the root of the issue.

4. QR scheduler managed to get itself up and executed a capture job at 6:34:21am.

Small victories.

Ming-Li Wang

James Bucanek wrote:

1. A capture action is scheduled for my system partition at 3:20am daily. The archive's layer map says it's done this morning, with 793 captured items totaling 273.7 MB. I can't find any entry for it in the QR Log (see attached screenshot), even with maximum detail level.

If you send another diagnostic report I can review the log.

It happened in the early morning on 8/19, Taiwan time, so it's already in the diag. report I gave you that day.

B12 was installed yesterday afternoon, but it didn't work. Not a single action was executed until I rebooted the system this morning.

As I'll be out of town for a few days, I won't be able to tell if B12 can survive the early morning schedule without issue until next Tuesday.

James Bucanek

Ming-Li Wang wrote:It happened in the early morning on 8/19, Taiwan time, so it's already in the diag. report I gave you that day.

I found the log records for the action. The log window didn't display them because of a subtle bug in the way the log window collection log records, that I don't have a good solution for at the moment.

QRecall log records have a sequence value that organizes them into a hierarchy. The top level values in the hierarchy are the Julian Day Number and the process ID. Now while it's unlikely that two QRecall actions that run within the same day would have the same process ID, it isn't impossible and that's what happened here. A helper with a process ID of 506 ran at 01:38:25.86 that morning and the capture action started at 03:20:00.904 also had a process ID of 506. The log window assumed they were the same process and commingled their log records.

I'll put this on the "do something about this" list, but as I said, there's no simple solution.

It's exacerbated in your case because you have a lot of QRecall actions scheduled (several run every 15 minutes), increasing the possibility of a duplicate process ID. Most of these capture actions do nothing, because nothing changed.

You might consider changing the schedule of your actions that capture a specific subset of files (like documents, photos, or music) to use the new "Capture Items Changed" event schedule.

This new event schedule monitors the folder(s) to be captured for changes and immediately starts the capture once changes are detected. The advantages are that it's much more responsive (the action can start the minute a change is detected) and it's much more efficient because the action doesn't run until there's actually something to capture.

Ming-Li Wang

Many thanks for the detailed explanation, as always.

James Bucanek wrote:You might consider changing the schedule of your actions that capture a specific subset of files (like documents, photos, or music) to use the new "Capture Items Changed" event schedule.

This is something I've meant to try since v2 beta testing, but never got around to really set it up. Good. Will change the every-15-min. action (there's only one) right away.

One question: how reliable is OSX/HFS's file change mechanism? Do I need to compliment the action with, e.g., daily capture?

James Bucanek

Ming-Li Wang wrote:One question: how reliable is OSX/HFS's file change mechanism?

There are broadly two ways to use the filesystem change event information: to determine if something changed and to determine what changed.

The first use is rock solid. If anything within the watched folder hierarchy changes, OS X will notify the observing process (the scheduler) which will start the action.

The second use is where things get tricky. There are some extreme edge cases where filesystem events can't be used to reliably determine exactly which folders contain changes. This is why backup software like Time Machine can run into trouble. QRecall's capture uses a mixed solution that uses filesystem change information to skip folder during most captures (which is fast). But that trust has a time limit and, once exceeded, QRecall will perform an exhaustive examination of the items in every folder (which slow, but sure).

Do I need to compliment the action with, e.g., daily capture?

No. The "items changed" event simply triggers the start of the capture action, it doesn't tell the action what to capture. So it doesn't matter if you start the action using a daily schedule, with an "items changed" event, or using the command line tool, the capture action will still capture the same set of changes.