QRecall

Excellent news!

Steven J Gold wrote:Deduplication (and therefore, QRecall) should be my friend, so I've jumped onto the Beta hoping to become a user.

Welcome aboard! Now we'll see if we can get you to stay.

But QRecall pales in comparison to TimeMachine in one important aspect: Impact of QRecall on system usability.

I rarely notice when Time Machine is doing it's hourly stuff,, but QRecall severely impacts my computer. Suddenly, my Mac's operations like web browsing, file manipulation -- anything which requires disk access -- slow down and I wonder "What the heck's going on?" and the answer is QRecall. Especially for operations like compacting an archive or verifying.

I freely admit that the CPU and memory requirements involved in comparing every block of file data to a corpus of potentially millions of previously captured blocks, calculating checksums, generating error detection and correction codes, compression, and other tasks are vast compared to that of simply copying data from one drive to another. Keeping QRecall from interfering with your other work, therefore, can be a challenge.

(1) Time Machine sets the "use low-priority I/O" flag which gives precedence to non-Time Machine I/O, thereby lessening the impact of a backup. Could this be made a settable option in QRecall?

QRecall already does this. But, again, QRecall is doing orders of magnitude more work than Time Machine, so it still takes a toll on performance.

(2) There are times I *really* need all the performance I can get from my Mac (like when running Windows in a virtualized environment) and many of QRecall's operations take *hours* to complete. I've learned the hard way quitting QRecall in the mist of a long operation (like a compact) may have dire consequences!

I sure hope there are no dire consequences! (Unless you're force quitting a merge or compact action, in which case there can be dire consequences.) If you need to stop an action, use the stop button in the monitor window or send the QRecallHelper process a quit (TERM) signal, and not a force quit (KILL) signal. The stop or TERM signal will tell QRecall to find a break in its work, gracefully stop, and close the archive.

Could we *please* have a "Pause" button? Something which would pause QRecall until a "Resume" button is pressed?

In the QRecall Monitor window, there's a little stop (X) button in the upper right corner of each action progress pane. Click-and-hold (or Control+click) that button and you'll get a pop-up menu. In that menu you'll find two sets of commands of interest.

First is the Pause For > submenu. You can select to pause (suspend) the action for a brief time period between 5 minutes and 4 hours. Once you've paused an action, the menu item changes into a Resume command that will end the pause mode immediately.

There are also Stop and Stop and Reschedule commands. The first simply stops (quits, TERM signals) the process, identical to simply clicking the stop button. The second stops the action, but prompts you to immediately set a time at which the action is restarted. (This command only applies to actions started via an action document, because only actions can be scheduled.)

And here are some other tips:

In the QRecall Monitor menu (or in the QRecall application's Action menu, if you're not using the monitor menu), you'll find a Hold all Schedules > submenu that will suspend all of the scheduled actions for a period of time. This command affects action that are scheduled to run but haven't started yet, while the Pause for > commands suspend an already running action.

Time consuming actions can be scheduled to run late at night or early in the morning. You can add power requests to the schedule so that the computer starts up when the action is about to run and/or shuts down or goes to sleep when it's finished.

The really time consuming actions, like Compact, don't have to be run very frequently. Compact really doesn't need to be run more than once a month. How often you run merge actions is a matter of taste, but these too can be deferred for days or weeks with no significant impact on your archive. In fact, merge is more efficient if run less frequently. The other big time consumer is verify, but that can be paused or stopped without any consequences.

If you have a really large (more than 2 TB) archive, the new "Defer Data De-duplication" feature added in QRecall 2.0 is there to address the performance issues of capturing data to a large archive by putting off the de-duplication work until you compact the archive. This is because the effort required to de-duplicate data grows exponentially as the archive gets bigger. By deferring de-duplication, you can capture data all week long very quickly, and then de-duplicate it en masse over the weekend.

And if you want to get really tricky, there are new application-based conditions you can add to action schedules. For example, you could ignore (not run) actions whenever your Windows emulation application was running. It's also not a good idea to try and capture virtual machine images while they are in use. Instead, you can create an event schedule to capture these after your Windows emulation application quits.

Let us know if this helps and if you have any other questions or suggestions.

Adrian Chapman wrote:In version 2 the ability to use the new "capture on event" schedule seems ideal using "Run when capture items change" coupled with a delay, but I can't see how I can configure things so that they aren't captured by the hourly action but can still be captured to the same archive using the "Run when capture items change" action.

You're not being dense; it's impossible.

If I understand your question, the scenario is thus: You have top-level folder that you periodically capture in its entirely.

Contained within this top-level folder you have either (a) a folder who's changes are insignificant and you aren't interested in capturing regularly and/or (b) a folder who's contents are extremely important and you want to capture those much more frequently than the top-level folder.

For this example, lets say your top-level item is your home folder. You've created an action that captures your home folder every hour. Your home folder contains an iTunes folder (a), that you're not worried about capturing more than once a day, and a Working Projects folder (b), that you want to immediately capture if there are any changes.

(a) is easy to setup. Use the QRecall Capture Preferences service to set that folder to "Ignore Changes". In the case of the iTunes folder, you'd probably set it to ignore changes for 24 hours, or set it to ignore changes during the hours you typically work.

(b) can be addressed by setting up a second capture action, using a "Capture Items Changed" event. You'd create a capture action for the Working Projects folder and set it to, say, capture 2 minutes after any change.

You don't have to do anything to exclude/ignore the Working Projects folder from the hourly home folder capture. The Working Projects capture action will capture all the changed items in that folder as soon as they change. So when the home folder capture gets around to running, it's unlikely that there are any uncaptured changes in the Working Projects folder. QRecall only captures changes, so there will be nothing in the Working Projects folder to capture.

What you can't do is have a folder that is both ignored (a) and frequently captured (b). That's logically inconsistent.

Ming-Li Wang wrote:1. Those event-driven actions take place more often than I expected due to meaningless (non)change to .DS_Store files.

This, sadly, I can't do much about. OS X's file change event service doesn't report exactly what changed in the filesystem, just that "something in folder XYZ changed." So it's impossible to prevent changes to specific items from running an action.

I would like to exclude, e.g., all .DS_Store files, all *.bak files, all *.*-journal files, and all "minidump" folders, just to name a few.

This has been on the to-do list for awhile.

I've long intended to add a filter mechanism that would allow you to use file globbing and/or regex to exclude items. But it's such a geeky power-user feature that I keep putting it off.

Since you asked, I'll give it a +1.

2. The current explanation for the option "Ignore new events for ___ (min./hours/days/weeks)" &helips;

There are two issues here. First, I haven't updated the help for the new 2.0 features yet, so the behavior of the "Ignore" period for the new "Capture Items Changed" event hasn't been documented yet.

Second, the terminology is a little off because the "changed items" event are stateless and the whole events schedule mechanism is designed around stateful events, like mounting and unmounting a volume, logging in and out, or launching and quitting an application. Said another way, when a folder's content changes (the event) it stays changed forever (until captured). When an application launches, that launch event ceases to exist when the application quits again.

Because of these differences, how events and event state are treated during the "Ignore" period of event schedule is?as you've discovered?quit different for change events vs. stateful events.

The two need different terminology, and I plan to give them different interfaces, but that's probably going to have to wait until 2.1 when the UI gets an overhaul.

Ming-Li Wang wrote:B12 went through the grueling early Sat. morning schedule (moved to this morning as usual) for the first time here and survived. It's still working fine, and doesn't give me spinning rainbow wheels.

Finally, some progress!

From the diagnostic report you sent, I can confirm that all of the actions ran successfully.

Found some peculiar entries in the Log Window, however.

1. Five consecutive "could not communicate with helper" entries at address 8:46am, right about when I touched my computer for the first time in the morning. There were 6 of them yesterday morning (I forgot to add Tuesday morning to the group of "Sat. morning actions", so I didn't report back yesterday morning), also at about the time I touched my computer for the first time.

I was able to determine the cause of these messages, and it goes a long way to unraveling a slew of odd issues other have encountered.

Recent versions of OS X have gotten extremely aggressive about suspending apps that it doesn't think are relevant. In the past, the QRecall Monitor process could reasonably expect to get distributed notifications and timer messages when the system was idle, but apparently that's no longer true. Based on the log entries, it appears that OS X is electing to completely suspend the QRecall Monitor application when your display is asleep. Here's the sequence of events that I found:

At 3:20 AM a capture action ran.
20 seconds later, the capture action finished and terminated.
At 8:46 AM the monitor app receives a notification from the capture action that it would like be monitored.
The monitor app tries to connect with the capture action (the one that terminated 5 hours ago) and fails (no surprise).

This explains the spat of "Could not communicate with helper" errors that you're seeing when you wake up your display. Even though the capture action sent a notification to the monitor app at 3:20, OS X hung on to that message and didn't deliver it until 8:46.

2. There are 2 errors associated with the actions that took place overnight. The first at 4:49:55am, "Helper did not start" after a successful "merge" action. Then "helped did not start" again at 5:34:55am, which stopped a "verify" action. Both errors gave the same description (with different timestamps, of course):

These message are both erroneous and have the same root cause as the "Could not communicate..." errors.

Both the merge and verify actions also tried to establish a connection with the monitor app (this time using a direct Mach port connection, rather than using distributed notifications?for complicated reasons having to do with bootstrap namespaces). Again, instead of delivering those messages, OS X left the QRecall Monitor app suspended so it never processed or responded to those messages. The messages eventually timed out, and threw an exception which cause the "Helper did not start" error to be logged.

The "Helper did not start" error is actually a mistake in this context. It's supposed to report that helper's main run loop never got started, but it can also get logged if the run loop starts but then throws an uncaught exception, which is what happened here. I've fixed that bug. Now the "Helper did not start" error only gets logged if the helper encountered a fatal error before its main run loop is started, and not afterwards.

I've addressed the two communications problems—the monitor waking up with a request to monitor an action that's already finished and terminated, and the situation of an action trying to send status messages to a monitor app that's suspended and unresponsive.

All of these remedies will appear in (lucky) 2.0.0b13, which should get released before the end of the week.

Thanks again for all of the great feedback and diagnostic reports. I probably would have never found this on my own!

Ming-Li Wang wrote:One question: how reliable is OSX/HFS's file change mechanism?

There are broadly two ways to use the filesystem change event information: to determine if something changed and to determine what changed.

The first use is rock solid. If anything within the watched folder hierarchy changes, OS X will notify the observing process (the scheduler) which will start the action.

The second use is where things get tricky. There are some extreme edge cases where filesystem events can't be used to reliably determine exactly which folders contain changes. This is why backup software like Time Machine can run into trouble. QRecall's capture uses a mixed solution that uses filesystem change information to skip folder during most captures (which is fast). But that trust has a time limit and, once exceeded, QRecall will perform an exhaustive examination of the items in every folder (which slow, but sure).

Do I need to compliment the action with, e.g., daily capture?

No. The "items changed" event simply triggers the start of the capture action, it doesn't tell the action what to capture. So it doesn't matter if you start the action using a daily schedule, with an "items changed" event, or using the command line tool, the capture action will still capture the same set of changes.

Ming-Li Wang wrote:It happened in the early morning on 8/19, Taiwan time, so it's already in the diag. report I gave you that day.

I found the log records for the action. The log window didn't display them because of a subtle bug in the way the log window collection log records, that I don't have a good solution for at the moment.

QRecall log records have a sequence value that organizes them into a hierarchy. The top level values in the hierarchy are the Julian Day Number and the process ID. Now while it's unlikely that two QRecall actions that run within the same day would have the same process ID, it isn't impossible and that's what happened here. A helper with a process ID of 506 ran at 01:38:25.86 that morning and the capture action started at 03:20:00.904 also had a process ID of 506. The log window assumed they were the same process and commingled their log records.

I'll put this on the "do something about this" list, but as I said, there's no simple solution.

It's exacerbated in your case because you have a lot of QRecall actions scheduled (several run every 15 minutes), increasing the possibility of a duplicate process ID. Most of these capture actions do nothing, because nothing changed.

You might consider changing the schedule of your actions that capture a specific subset of files (like documents, photos, or music) to use the new "Capture Items Changed" event schedule.

This new event schedule monitors the folder(s) to be captured for changes and immediately starts the capture once changes are detected. The advantages are that it's much more responsive (the action can start the minute a change is detected) and it's much more efficient because the action doesn't run until there's actually something to capture.

Ming-Li Wang wrote:Still got the rainbow wheel again when trying to click QR Monitor's icon in the menubar this morning. The system is otherwise normal, meaning I could launch Firefox and other applications without delay. I launched QR from the Launchpad instead, but again clicking anything on the menu gave me the rainbow wheel. Be patient, and the menu did pop up after a few sec., and I was able to get the Log window open, and send the diag. report. The toplog was also sent, but I believe you'll see a system mostly idle throughout the night. The OSX Activity Monitor again showed QR Scheduler is alive, but it's apparently not working.

After spending a lot of quality time with the logs, the crash reports, the toplog.txt files you've been collecting, and some experimenting, I think I'm getting a handle on what's going on, although I still don't know why.

I finally figured out the rainbow cursor problem: The helper and scheduler processes are hung up. It's not a system load or resource issue, these processes are simply locked up. When you launch QRecall, the first thing it does it try to contact the monitor and scheduler processes and, you guessed it, promptly locks up waiting for them to respond.

The messages will eventually time out and the QRecall app will run, but will report it can't install the monitor and it's waiting for the scheduler to start, even though it's already running.

The new version (2.0.0b12) tries to address these issues by reworking how messages are retried. The new logic might still lock up, but for no more than a minute or two (worst case). So the monitor and scheduler should recover within a reasonable amount of time (minutes not hours) and get back to running.

The root of the problem is the helper processes that are started around 4:00 AM simply stop running—and I have no idea why. They don't log any error messages (the log messages from the process simply stop), they don't terminate unexpectedly (nothing in the system log), they don't crash (no crash log for the process), they just stop running. The scheduler and monitor continue to try and communicate with the helper, which locks them up too waiting for a response that never happens.

Without a sample of what's going on within the QRecallHelper process, I'm at a loss to explain why it would do this. It also doesn't make much sense that it only does it at 4:00 AM and not other times.

QRecall 2.0.0b12 addresses some of these issues and adds additional debugging information that will hopefully allow me to get closer to the problem.

Ming-Li Wang wrote:Since the problem doesn't seem to be related to competing use of system resources (my own tentative conclusion while you investigate), I decided to take a closer look at the logs and see if I can spot anything that escaped my attention.

And you are correct, this is not resource issue.

1. A capture action is scheduled for my system partition at 3:20am daily. The archive's layer map says it's done this morning, with 793 captured items totaling 273.7 MB. I can't find any entry for it in the QR Log (see attached screenshot), even with maximum detail level.

If you send another diagnostic report I can review the log.

Aug 19 03:20:00 carlo com.apple.xpc.launchd[1] (com.qrecall.hoist[505]): Endpoint has been activated through legacy launch(3) APIs. Please switch to XPC or bootstrap_check_in(): QRHoist

This is normal. Apple wants developers to switch from using Mach ports to XPC. So whenever QRecall starts a privileged service using Mach ports, the system spits out this message. But until Apple provides end-to-end XPC communications, QRecall has to use Mach ports.

Aug 19 03:20:16 carlo QRecallHelper[506]: Failed to initialize IconCache named: com.apple.iconservices with error: Error Domain=NSCocoaErrorDomain Code=4097 "Couldn?t communicate with a helper application." (connection to service named com.apple.iconservices) UserInfo=0x7ff2b941f840 {NSDebugDescription=connection to service named com.apple.iconservices}

Aug 19 03:20:16 carlo QRecallHelper[506]: Error returned from iconservicesagent: Error Domain=NSCocoaErrorDomain Code=4097 "Couldn?t communicate with a helper application." (connection to service named com.apple.iconservices) UserInfo=0x7ff2be2bf6f0 {NSDebugDescription=connection to service named com.apple.iconservices}

Aug 19 03:20:16 carlo QRecallHelper[506]: ImageIO: CGImageDestinationFinalize image destination must have at least one image

These are errors from OS X's launch services framework. As QRecall captures items, it queries the launch services API for things like the item's icon, localized display name, filename character encoding, and so on. There are number of issues with launch services going back more than a decade, and sometimes it can't/won't return this information. In more recent versions of OS X, it logs messages like this to the console. (Earlier versions of OS X would simply crash.) You'll probably notice corresponding messages in the QRecall log like "cannot obtain icon for ...". These problems are entirely cosmetic, as this information is only used to display the items in the archive browser. They're annoying, but can be safely ignored. Sometimes these issue are caused by a corrupted launch services database and rebuilding your LSDB might make these go away (Google "rebuild launch services database").

2. The capture actions scheduled for 3:50am was indeed run and concluded successfully, with nothing to capture. Don't know why QR executed it again after reboot this morning (as reported in my previous post).

If the scheduler terminated abnormally, it might not have remembered that it already ran that action would immediately run it again when restarted.

3. The first in a string of "Unable to communicate with helper" warnings (at 4:20:25 am) in the QR Log has a note that says:

[NOTE: this exception originated in the server.] | connection timeout: did not receive reply

The others says:

cannot connect with helper
port: QRecallHelper.2307054f

These are the helper communication problems that are at the root of the issue.

4. QR scheduler managed to get itself up and executed a capture job at 6:34:21am.

Small victories.

Here's an experiment to try:

Late in the evening, open a Terminal window and start the following command:

top -l 0 -S -s 30 -n 6 -o cpu > ~/Desktop/toplog.txt

and let it run all night. When you get back to your computer in the morning, switch to the Terminal window and press Control+C to stop it.

The toplog.txt file will contain a sample of your system's load averages, total memory usage, swap space, and the top 6 CPU hogs, and other performance stats, recorded every 30 seconds. Compress that file and email it to me.

If there are processes running overnight that are demanding a lot of CPU and/or memory, this should catch them.

Ming-Li Wang wrote:Another follow-up: I tried to restart the scheduler with the command line you taught me in another thread:
QRecallScheduler &
It didn't work. Got

-bash: QRecallScheduler: command not found

Sorry, that was shorthand. You need to specify the path to the scheduler executable:

~/Library/Application\ Support/QRecall/QRecallScheduler &

Then I tried
sudo QRecallScheduler &
and got a process id. Good.

Not good. You got a job ID for sudo. sudo then went to run the executable, couldn't find it, and quit. And it's moot anyway, because you can't run the scheduler as root.

But the two pending actions are still "waiting". Restarting QR monitor didn't help.

Restarting the monitor won't help. The monitor is just that: a monitor. It just reports on what's going on. It doesn't tell actions what to do (although it will rely commands you select in the stop menu to the action).

There's still no sign of QR scheduler in OSX Activity Monitor, either. The process name for the id is "sudo". Guess I'll have to logout and re-login again.

Even after you get the scheduler running again, you'll still need to stop those actions or choose the "Ignore Hold and Run" command from the stop button's drop-down menu in the monitor.

When the scheduler starts a QRecallHelper process, it begins by placing it in a "wait for scheduler" state. Once the scheduler determines that the action is ready to run, it sends it a message that it can start. While in the "waiting" state, the action is suspended until the scheduler gives it the green light.

And this is the problem I'm seeing in your log. The scheduler starts the QRecallHelper process, but then can't communicate with it ("lost communications with helper" errors). So what's happening is that the helper gets launched but never runs. I see these kind of communication failures when the system is overloaded and can't allocate the kernel resources necessary to deliver the message, or can't do it fast enough. What I suspect is that when the scheduler is trying to start these actions, your system is under tremendous load; either something is using (or was using) a lot of CPU, memory, or both. This is bogging down the communications between the scheduler and helper processes, and generally making your system run very sluggishly.

And just restarting the scheduler won't help, unless the scheduler closed down cleanly and is restarted within 30 seconds. The scheduler doesn't know what QRecallHelper processes are running and, once it's forgotten about them, can't automatically reconnect with them to tell them they can run. So once you have an orphaned QRecallHelper process sitting in a "wait" state, you will either have to start it manually (using the "Ignore Hold and Run" command) or just stop the action.

Ming-Li Wang wrote:Ok, just came in, and found QR Monitor window open with 2 actions "waiting on scheduler". Apparently the scheduler went down again, at 12:22pm according to the log.

Please send that log via a diagnostic report. If you haven't figured this out by now, you can't send too many diagnostic reports.

I posted the last message at 11:44 local time, so the scheduler stopped less than an hour later. But it seemed to have run into trouble much earlier, for the two "waiting" actions (both weekly merge actions) were scheduled to take place at around 5am this morning. The rainbow wheels I saw this morning should be related, for QR seems be acting normally (no rainbow wheel) now with the scheduler gone.

I suspect these are related, but I doubt it's a causal relationship.

BTW, there are two new QR processes in the Activity Monitor that I've never seen before: QRecallHelper (2 of them). Is this normal?

Yes, those are the two actions that are waiting to run. You can quit them via the activity monitor.

Every (major) task that is performed on an archive is performed by an instance of QRecallHelper (sometimes two instances). So if you start a recall, a verify, a merge, and a compact action, you'll see four QRecallHelper instances running. The helper also performs installs, system configuration checks, updating the sleep/wake schedule, and other privileged activities.

Ming-Li,

I reviewed the diagnostic reports you sent, especially the one sent before you tried to kill or restart anything, and what I found was ... nothing. QRecall wasn't doing much of anything at the time. Both the scheduler and monitor were idle, and there were no running actions.

So I suspect something else is in play, and the QRecall application might be the cause.

The next time this happens, launch the Activity Monitor. Starting in the memory tab, sort processes by total memory use (greatest to least) and note which processes are using the most memory. Also note the amount of swap used. Then go to the CPU tab, sort (greatest to least) by CPU% and see if there are one or more runaway processes.

Now on to my theory: If you are in the habit of leaving the QRecall application running, and its log window open, this might be your problem. As long as the log window is open, it will read in all of the log records on file. In your situation, this could be a lot.

You have a lot of actions defined, and you run those actions very frequently. You're also running a beta version of the software (which produces a lot more log messages than the regular version). Your log file preferences are set to keep log messages for up to 6 months. All of these combine to create a kind of "perfect storm" that sets up the log window to read a ridiculous amount of data. If that's what happening, your system's poor performance could simply be the amount of memory and swap space consumed the QRecall application's log window.

Now if you aren't leaving the log window open, then something else is bogging down your system, which is why you need to use the Activity Monitor to conform it.

If it turns out that it is the QRecall log window, you can not leave it open or try reducing amount of log history you keep.

After spending some quality time with your log, I found a couple of instances where the scheduler stopped due to uncaught exceptions, mostly to do with spurious mount notifications for the recovery partition. (Under normal circumstances, the recovery volume should never be mounted, so it's a mystery why the scheduler would be receiving these...)

I've plugged those leaks, but even those wouldn't have caused the scheduler to exit with a non-zero status. So why launchd isn't restarting it remains a mystery.

I even specifically found instances where you had force-quit the scheduler, and launchd didn't restart it.

But see my reply to the slow-down issue you reported for a possible clue.

On a side note, it's best to try quitting (SIGTERM) the scheduler first before resorting to a "force quit" (SIGKILL). A quit gives the scheduler a chance to save the information it has about running actions before terminating. This helps it restart smoothly.

Ming-Li Wang wrote:Looking closely, I found the "overlap" was from 7/4 to 8/1, and 8/2 might be the day I started trying out beta 10. (Not sure, but likely.) If that is true, then does it mean both b10 and b11 thought, when making their first backup, they were continuing the volume started on 7/4 (the day I reinstalled my system from scratch), and yet b11 thought, at the same time, the volume has changed from the previous day. How is that possible?

I don't know, and there's no easy way to determine what identifiers QRecall was using at the time to match the volume.

Oh, and to make matters more complicated, QRecall 1.x uses a different method for identifying volumes. Prior to 2.0, QRecall used a combination of the filesystem identifier (part of the BSD filesystem API), volume creation date, size, and name.

Ming-Li Wang wrote:I did use "sudo", didn't I?

Yes, you did.

I don't know how I missed that...

I couldn't reproduce this on my development system, but I did see it on a test MacMini, so I'll investigate this further.

Thanks for sending the diagnostic reports with the scheduler failure. And to answer your question, no the QRecall application can no longer force the scheduler process to start/stop. Recent changes to the launchd and launchctl tools make process management the sole responsibility of launchd.

It's a moot point, since adding a feature to manually start the scheduler process is more hack than solution.

If you want to do that, just start it from the command line:

QRecallScheduler &