QRecall

Bruce Giles

When I open my QRecall archive (about 500 GB in size) from the Open menu, or the QuickStart window, it takes about 10 SECONDS to open. But when I run an action that uses this archive, the Activity window sticks at "Opening archive" for about 10 MINUTES. Any idea why this is happening?

Here's a brief excerpt from the log file that shows what's happening:

Action 2019-09-12 23:00:07 ------- Capture Home Folder
Action 2019-09-12 23:00:07 archive: /Volumes/Seagate 2GB Backup/QRecall Backups/Macintosh HD.quanta
Action 2019-09-12 23:10:40 Minutia Took snapshot of volume Macintosh HD
Action 2019-09-12 23:10:40 Capture bgiles

James Bucanek

Bruce,

I have noticed that the Mojave snapshot process can be quit lengthy at times. I'm not sure exactly what the criteria is, but I suspect it competes with other changes being made to the volume at the same time.

I noticed this first when I was stepping through a test version of the capture action using the debugger and thought the process had deadlocked. I paused the executable to find it waiting in the create_snapshot() function. After about three minutes, it finally finished and went on.

If this is a problem for some reason, you can still disable snapshots by turning off the "Capture a Snapshot" option in the advanced settings. Just be warned that this option might go away in future versions, because without taking a snapshot capture is deeply problematic in Mojave and later.

Bruce Giles

Hi James,

It's really not a problem -- it's just something I notice frequently because it shows up in the Activity window. But I'm not convinced that "other changes being made to the volume at the same time" is the culprit. This happens, as far as I can tell, EVERY time the action runs, whether scheduled or manually initiated, even when the computer is essentially idle. I also have a second archive, which just backs up my Virtual Machines folder, and I have the same issue with that one as well, although the delay period is shorter.

I will try disabling the snapshots just to see if it makes a difference, and I'll report back after I try that. If I tried manually snapshotting a volume from the command line, would I likely see the same delay? (I don't know how to do that, but presume, between the MAN file and Google, I could figure it out.)

-- Bruce

James Bucanek

I'd be very curious to see what you discover.

The magic tool is tmutil.

It has a localsnapshot command to create a local snapshot of all APFS volumes. (I'm not aware of any way of creating a snapshot for a specific volume, the way QRecall does.)

You can also list the snapshots on a volume (listlocalsnapshots and listlocalsnapshotdates), and delete them by date (deletelocalsnapshots).

Both QRecall and macOS are responsible for deleting their stale snapshots, so any snapshots you create will eventually get deleted.

Bruce Giles

OK, I still don't know what's going on. Before experimenting with tmutil, I decided I ought to check my boot drive. I rebooted into single-user mode, and ran /sbin/fsck -fy. During the "Checking the fsroot tree" phase, it found four errors:

error: directory valence check: directory (old 0xb0069): nchildren (25) does not match drec count (0)
error: directory valence check: directory (old 0x150050): nchildren (25) does not match drec count (0)
error: directory valence check: directory (old 0x180042): nchildren (25) does not match drec count (0)
error: directory valence check: directory (old 0x290064): nchildren (25) does not match drec count (0)

I don't know what that was about, but the errors seem to have been fixed, as they did not appear again when I re-ran fsck. I also booted into Recovery Mode and then used Disk Utility to check both the boot disk and the backup disk. It didn't find problems with either of them.

So I rebooted, and ran QRecall and tried my capture action. The delay still occurred. While I was waiting for something to happen in QRecall, I tried tmutil localsnapshot. It created a snapshot in about a second. I was then able to use listlocalsnapshots to list it, and deletelocalsnapshots to delete it. It took about 30 seconds to delete. QRecall was still waiting for the archive to open, so I created another snapshot, and listed it. It showed up as:

com.apple.TimeMachine.2019-09-13-194448

I tried the list command several more times while I was waiting, and got the same result every time. Finally, after about 10 minutes, the archive opened, and QRecall started capturing. I immediately ran the list command again, and this time I got this:

com.apple.TimeMachine.2019-09-13-194448
com.qrecall.capture.596.590111295

Finally, the QRecall snapshot is listed. From this, it would appear that the delay is indeed occurring in the time between when the action starts and when the snapshot is created, but I don't know why, since I was able to quickly manually create and delete snapshots during the period when QRecall appeared to be hanging. When the QRecall capture finished, I verified that the QRecall snapshot was no longer listed, and then I deleted the one I had manually created.

Next, I went into QRecall's advanced preferences and set "Capture a snapshot" to false. Tried another capture, and the 10-minute delay still occurred. Again I used tmutil to list snapshots, and confirmed that none were created by QRecall this time. The capture finished with 4 capture errors. All were "Unable to read extended attribute data". I suspect these were due to the snapshots being turned off. I went back to QRecall's preferences, and set "Capture a snapshot" back to true. Did another capture, and discovered that the snapshot still wasn't being created, even though the preference was now set. Maybe it needs a reboot? So I rebooted, and did another capture and confirmed that the delay still occurred, but QRecall is again creating snapshots, and the capture errors did not appear this time.

From all of the above, it seems pretty evident that snapshots are not the problem. The delay occurs whether QRecall creates snapshots or not. I just now had QRecall send you a report. Maybe that'll provide a clue.

-- Bruce

James Bucanek

Bruce,

If you suspect that snapshots are not the problem, the next step would be to send a diagnostic report during that 10 minute period. The diagnostic report will take a sample of all running QRecall processes. If the capture action is stuck, the report should pinpoint exactly where.

Bruce Giles

Just now sent one.

-- Bruce

James Bucanek

Well, color me dumbfounded.

Bruce, you were absolutely right. This is not a snapshot issue.

TL;DR Try reindexing the archive

The diagnostic report pinpointed what was eating up the time, but I have absolutely no (definitive) explanation as to why.

So here's what's going on (technical details). The archive maintains a very (sometimes very, very) large hash table used to search for duplicate data that's already been added to the archive. This table is so big, it's impractical to make a copy of it every time you perform a capture. So when QRecall captures a modest amount of data, instead of copying and updating the master hash file, it writes the handful of updates to an "adjunct" hash file. This adjunct hash file is read in again when the next capture starts, on the theory that the adjunct file will be orders of magnitude smaller than the master hash file.

Eventually, the adjunct hash entries will exceed a threshold, QRecall will "bite the bullet," making a copy the master hash file and update it. At this point there are no adjunct entries and the whole thing starts over again.

So back to the problem. Your capture is getting stuck reading in the adjunct hash entries. Here's the (interesting part of the) sample trace:

+               9814 -[CaptureCommand execute] (in QRecallHelper)

    +                 9814 -[RepositoryCommand prepareRepository] (in QRecallHelper)

    +                   9814 -[RepositoryPackage prepareWithMode:] (in QRecallHelper)

    +                     9814 -[DataHash prepareWithMode:] (in QRecallHelper)

    +                       9317 -[DataHash loadAdjunctEntries] (in QRecallHelper)

    +                       ! 9315 -[NegativeChecksumMap add:] (in QRecallHelper)

    +                       ! 2 -[DataHash addEntry:] (in QRecallHelper)

    +                       !   2 -[DataHash insertEntryIndexIntoHash:forChecksum:] (in QRecallHelper)

    +                       497 -[DataHash loadAdjunctEntries] (in QRecallHelper)

    +                         497 DDReadFile (in QRecallHelper)

    +                           497 read  (in libsystem_kernel.dylib) + 10  [0x7fff5df83ef2]

During the sample period, 5% of the time was spent reading the adjunct file and 95% of the time was spent inserting those entries into the in-memory hash cache.

And here's where it gets weird. That insert function ([NegativeChecksumMap add:]) is literally 5 lines long. It consists of an exclusive or, a bit shift, a mask, an address calculation, and an add. A modern CPU should be able to do several hundred million of these a second. It should be so fast, it shouldn't even show up in the stack trace. Yet, it's accounting for 95% of the delay...

My only guess is that it might be hitting virtual memory, assuming there are other large (memory footprint) processes running at a the same time. Or, the negative map has been mapped into memory and the page loads are just really, really, slow for some reason. Basically what I'm saying is that VM paging/contention is the only thing I can think of that would account for this miserable performance.

So that's the problem. One "solution" would be to reindex the archive. A reindex will rebuild all of the index files, including the hash file, from scratch. At the end, the hash file will be complete and up-to-date and there's won't be any adjunct entries to read or write. Of course, this just kicks the problem down the road as the adjunct entries will, again, start to accumulate as small captures are completed.

But start with a reindex and see if that resolves the problem.

Bruce Giles

James Bucanek wrote:Well, color me dumbfounded.

Bruce, you were absolutely right. This is not a snapshot issue.

Never doubt me. My gut feelings are better than most people's facts.

It worked! I started both archives on a reindex last night (they each took about 3 hours), and tried a capture with both this morning. They each took about 15 or 20 seconds to open the archive, and then the capture started.

I could almost swear that I tried to repair at least one of the archives a couple of weeks ago, and it didn't help, but maybe I just thought about it, without actually doing it. Or does Reindex do something that Repair doesn't?

I'm not sure why there would be a VM issue, or if there was, why it would slow things down so much. This Mac is a Late 2015 iMac, with a 4 GHz i7 processor, 16 GB RAM, and a 2 TB hard drive with only about 640 GB in use. It's not the current state-of-the-art, but it's no slouch either. I'm usually running a Windows 10 VM (VMware Fusion), which at various times has been allocated from 2 GB to 8 GB, but even when it's running at 8 GB, there's not that much going on in the rest of the iMac.

I'll try to keep a closer eye on the captures, and if they start slowing down again, I'll let you know. Thanks!

-- Bruce

James Bucanek

Bruce Giles wrote:Or does Reindex do something that Repair doesn't?

Both reindex and repair rebuild all of the index files (using the same code).

The only difference between a reindex and repair is that the reindex doesn't touch the master repository.data file.

A repair opens the master data file read+write and will erase or repair any damaged records or inconsistencies that it finds.

A reindex opens the master data file read-only and simply fails if it finds any inconsistencies.