QRecall

Norbert Karls

Hi there James, hope to see you in good health.

I've recently tried to run a compact operation for my main OS archive. It ran well into the night until macOS decided to shoot it down because the system volume was filled up with swap (that would have been 16~17G of swap, plus all the physical memory not given to other applications, so likely another ~20G). After it having been killed during compacting, I thought it would surely be best to have a look at it before trying to add more data to it, so I asked the QRecall GUI to try and open it. Sure enough, it told me the archive was damaged and needed repairs. I agreed, and worriedly watched the system volume fill up again. Moved some Apps to external storage this time, cleaned out Downloads folder etc so now I have not 17 but 30 Gigs of free space again. Was relieved to see swap usage stagnate at some 20G. In iStatMenu the QRecall Helper was shown using almost 200G of RAM (can't be true, but still, wow) when after some three days I decided to ask the attempt to stop because the system volume was again dangerously low on space.

I then decided to give this the best shot I can. Copied the archive from its storage box (which runs on ssd-cached spinning rust) to a spare SSD for more IO/s, created a named pipe so logging would not eat IO or storage space, and used the command line to start a repair action like

qrecall repair /path/to/broken.quanta --recover /path/to/recovered.quanta --logfile /path/to/named_pipe --monitor --options noautorepair

Then I watched. The named pipe I constantly drain using a »tail -f« so there's no memory leak there.

What I found watching the log scrolling by was that at first it looked fairly normal, but after a few minutes, at around 42% during »Examining archive data« it would start writing a »Details at 'position' [numbers]« line for every eight Bytes (!) of 'position'. At this pace, writing a ~80B log line for every 8B of repository data, I can understand very well how that might fail in my storage situation. repository.data is about 180G in size, and we're just shy of 80G now, so if it doesn't pick up pace again at all I'm looking at several weeks until this is done.
Also, there were lines reading »#debug# 2712 consecutive envelope exceptions«,
many blocks like

2020-04-06 12:29:11.956 +0200 #debug# -[DiagnosticCommand packagePump] caught exception at scan position 79413639240 [2.73752138.46183.30357]
2020-04-06 12:29:11.956 +0200 #debug# invalid envelope type [2.73752138.46183.30357.1]
2020-04-06 12:29:11.956 +0200 Subrosa .Exception: Data [2.73752138.46183.30357.1.1]
2020-04-06 12:29:11.956 +0200 #debug# Class: BufferBlasterSource [2.73752138.46183.30357.1.2]
2020-04-06 12:29:11.956 +0200 Details Type: 0x07000000 [2.73752138.46183.30357.1.3]
2020-04-06 12:29:11.956 +0200 Details Position: 79413639240 [2.73752138.46183.30357.1.4]
2020-04-06 12:29:11.956 +0200 Subrosa .DataBad: 1 [2.73752138.46183.30357.1.5]
2020-04-06 12:29:11.956 +0200 Details File: repository.data [2.73752138.46183.30357.1.6]
2020-04-06 12:29:11.956 +0200 Details Length: 132864 [2.73752138.46183.30357.1.7]
2020-04-06 12:29:11.956 +0200 #debug# backtrace 28001dafb9bec [2.73752138.46183.30357.1.8]

, and just now there are lines reading

2020-04-06 12:41:21.550 +0200 Details [0]: { position=14920574123313796, length=4, <IO> re-read past eof { Position=14920574123245568, .EOF=1, Path=/path/to/broken.quanta/repository.data, Length=131072 } } [2.73752138.46183.801.4472.2.

which leave me a bit worried. Why would something point to a place that I could never afford to buy enough storage for?

The log file currently grows at about a megabyte per minute. Given that there's 80B of log for 8B of repository data, we're currently Examining archive data at a pace of about a megabyte per ten minutes. I things stay like this, we're looking at a million minutes, translating to almost 700days. For the examining, not the actual rebuilding which comes after.

Questions:
? Is this operation likely to finish at all or does the archive seem to be absolutely broken? Why/why not?
? Is logging expensive? If so, can I give a command line argument asking those many many »Details at?« lines to not be logged at all?
? Is it normal that those logged Details are only eight Bytes in size, and that there are this many of them?
? I found an »available memory« option in the Advanced preferences and set it to 8192. Will this help future runs? (Right now, QRecallHelper seems to keep its resident memory below 8G)
? Is it possible to ask a Compact operation in an extra safe way, like when it was running on a network share that might disappear anytime?

Please do stay at home and healthy, we can ill afford to lose you

Olfan

James Bucanek

Norbert,

Sorry to hear you've run into these problems. You have the poor misfortune to have hit one a couple of QRecall deficiencies (most of which have already been addressed in future versions, just not the one you have now

).

First, I'm surprised the compact action got killed for taking up too much RAM. Compact shouldn't use that much RAM; it builds a couple of large record indexes (a few hundred MB at most) and after that it mostly just reads and writes records to relocate them. I suppose there could be a memory leak in there that I'm unaware of, but that's the only explanation that jumps to mind.

Sadly, compact is (currently) one of those actions that can't be gracefully recovered from if it crashes. That's why the archive must be repaired afterwards.

An interrupted compact can leave a lot of detritus in the archive's primary data file that the repair must deal with, and that's what the repair is running into. The primary data file is composed of a sequence of records. Each record has a header, type, content, and a checksum. The repair simply starts at the beginning of the file and tries to read every record. If the next record is intact (header, type, content structure, and checksum all verify and appear to contain usable data) that record is added back to the archive and the repair moves to the next record.

If the record is damaged, or appears to be suspect in any way, a failure is logged, the repair advances 8 bytes (all records are aligned to 8-byte boundaries) and it tries again. That's what you're seeing in the log; an attempt to read a record at every 8 byte offset in the file.

Now what is supposed to happen is that the repair logs the first few failures, and then stops logging failures until it finds the next valid record, where it logs a summary about how much data it had to skip over to get there. But that logic doesn't always work in the current version.

And that's why you're getting GB of log output.

So with that in mind,

Is this operation likely to finish at all or does the archive seem to be absolutely broken? Why/why not?

A lot of work has gone into making the repair as robust as possible, so if you can read the file it should be repairable.

First, I would suggest simply piping the log to /dev/null. Second, since you're working on a copy of the archive there's no reason to use the --recover option; you're just creating a lot more work for the repair (and approximately doubling the amount to time it will take to finish).

BTW, it's nice to see someone using the command line tools

My suggestion is to start a simple repair, ignoring the log, and see if it finishes. (I fully expect it to.) Having said that, I'm still not sure why the repair would be using an excessive amount of memory. But if you can get the repair across the finish line, the archive should be usable again (and largely intact).

Is logging expensive? If so, can I give a command line argument asking those many many »Details at?« lines to not be logged at all?

Logging isn't overly expensive, but in your case QRecall is logging too much. It's simply a bug.

Is it normal that those logged Details are only eight Bytes in size, and that there are this many of them?

Yes. No.

I found an »available memory« option in the Advanced preferences and set it to 8192. Will this help future runs? (Right now, QRecallHelper seems to keep its resident memory below 8G)

Not much. QRecall already tries to keep its memory usage to what's physically installed (up to 8GB). If you have that much memory, performance will be better if you leave the memory limits alone. That option is really only for situations where you want to limit QRecall from trying to use all of the physical RAM which might compete with other long-running processes (like database servers or other work you're trying to get done).

Is it possible to ask a Compact operation in an extra safe way, like when it was running on a network share that might disappear anytime?

That exactly situation is impossible to protect against. It's a catch-22, once the volume is unreachable there's no way to take any corrective action on the data. The only way to protect against this would be to make a copy of the entire archive and compact that copy; that would require twice the storage and actually perform more I/O than the compact does.

Having said that, I'll tease that QRecall 3.0 now does this on filesystems that support cloning (i.e. APFS). The archive's primary data file is first cloned, and the clone of the archive is compacted. If anything goes wrong, the original version of the archive is still there and it simply reverts back to the previous file.

Finally, let me close with a suggestion. Once the archive is repaired and you're ready to try compacting it again. I suggest setting a time limit for the compact by adding a schedule condition that will "Stop after 3 (or so) hours". The compact is incremental and, if canceled, will pick up where it left off the next time it starts. So if you're concerned that the compact will run out of VM, simply prevent it from running too long and start it again the next day.

Please do stay at home and healthy, we can ill afford to lose you

And the same to you and yours! I don't know what I'd do if I didn't have any customers.

Norbert Karls

Hi again,
thanks a lot for the quick reply, as always I've learned a lot. So the repair slides through the file, one word at a time, trying to find the start of an actual data block. What I'm witnessing could then be one or several large files that have been purged by a rolling merge some time in the past which are no longer part of the actual archive, and as the repair operation doesn't know (or doesn't trust the index to reliably know) where a data block starts, it just has to keep trying until the next actual block of data starts. I agree that logging all that does seem a little obsessive.

Now that I know that, I'll just let it finish. At some point there has to be actual data again, and then the rest of the operation should finish in a more timely manner.

Memory leak: It is possible that some other application had been causing the runaway swapping and QRecall was just the victim. I can imagine scenarios like overlap with capture/merge actions, automatic updates of other applications, Photos looking for new faces to recognize, load peaks of some of my own projects etc. etc. which, when happening simultaneously, can accumulate to ludicrous levels of RAM use. With a user in front of the screen, macOS will ask for an application to be closed and/or space in the system volume to be freed. While I'm asleep, though, processes will be shot. If there haven't been any other hints at a possible memory leak recently, you really shouldn't start dissecting your product over a rumor.

Command line tools: ever since QRecall 2.0 I've been using the GUI only for browsing and restoring. I need the CLI to manage exclusions anyway, and I love not having to leave the shell (which might seem counter to the Mac experience, but I'm actually a unix dev abusing Macs as tools, so?).

Timeout an action: thanks for this suggestion, this seems very much worth trying. I do see the »Stop after?« condition in the UI action editor, is there an equivalent for the command line? If not, don't bother writing one just for me, I can just configure an action once graphically and then, instead of composing the whole operation on the command line, run that action in the shell by name.

Customers: what customers? You've been refusing to take money for upgrades for as long as I've known you, and that's about a full decade now. During that time I've received more, quicker and better support than with some paid maintenance plans I've been subscribed to. I'm not sure whether you're aware, but that's hardly a proper vendor-customer relationship.

It might even be slightly irresponsible: your business has to stay afloat even after everyone already has a QRecall license. So when 3.0 goes live I'm just going to buy an extra license again like I did for 2.0.

Stay safe,
Norbert

post scriptum: speaking of staying afloat while completely reeling off topic: Dawn to Dusk isn't just you, is it? There are other people who know the code base and understand the inner workings of QRecall, right? I've been a company's bus factor of 1.0 for too long to not be scared by the thought that there might be only one person in the world to have mission critical knowledge?

James Bucanek

Norbert Karls wrote:At some point there has to be actual data again, and then the rest of the operation should finish in a more timely manner.

That's the hope!

Timeout an action: ... is there an equivalent for the command line?

The equivalent would be to obtain the PID of the QRecallHelper process that gets started by the tool, then start a timer that will send a SIGTERM after a while (a la (sleep 10800; kill $QRHELPERPID) &).

I can just configure an action once graphically and then, instead of composing the whole operation on the command line, run that action in the shell by name.

That's the "much easier" way.

Customers: what customers? You've been refusing to take money for upgrades for as long as I've known you, and that's about a full decade now.

Well, I really like my customers and I still want to build momentum. I have a new plan for 3.0 that will hopefully provide some subscription income, so wait for that.

speaking of staying afloat while completely reeling off topic: Dawn to Dusk isn't just you, is it?

It's largely me. I have contractors for a lot of tasks. I keep hoping to get enough regular revenue to hire some full-time engineering and support, but I haven't quite crested that milestone yet. There are other engineers, and there are disaster plans to go open-source if this COVID-things goes sideways...

Norbert Karls

Hi again,
just popping in to share a little follow-up to what has been happening lately.

I've copied the broken archive to an external SSD I had lying around, connected it to an unused elderly MacBook Pro. This way I could make sure there isn't some weird hardware issue with my work Mac and I wouldn't have all that extra I/O going on all the time. Here's what I observed:

QRecallHelper will not exceed its allowance of »wired« or »reserved« memory, but it will take on more and more »virtual« memory over time As htop's VIRT column shows mapped files, too, I'm not worried about this reaching huge values. iStatMenus' memory dropdown has QRecallHelper growing wildly as well, though. For comparison, Safari reaches some 4GiB after a few days. After around 24hours, iStatMenus will list QRecallHelper using some 50GiB of RAM (which can't be true, the sum of physical and swap memory is less than that, but what does it count then?), and some time later QRecallHelper will just be gone from the process list. The qrecall command line utility will then just stop counting time and stall forever until killed (friendly kill is enough). It will complain about being unable to reach its (long dead) QRecallHelper. I tried manipulating the clock and QRecall happily adopted my changes, counting run times of several thousand hours, so it isn't dying upon rolling over a full day even though it does die after about a day's worth of runtime.

The logs don't mention any reason for the process ending, there's just the same chatter on and on for hours and then it just stops. A caught error leading to termination would be logged and could then be found near the end of the logfile; this is not the case, though. There's also no core dump or any meaningful entry in macOS' console.app on the matter.

I then put my hopes into trying an incremental approach instead of doing all the work at once, so I set up a repair action to be run every ten hours with a condition to stop after eight hours. It does stop after eight hours, and it will start anew for another eight hours every time. Unfortunately, it will start from scratch every time and not continue where it left off. I can't find log entries saying that the existing data had to be scrapped or anything, so I assume that while a compact operation can run incrementally, a repair action cannot.

Finally I looked around the various options to the commands, hoping to find a way to have it just extract individual files or layers or some way to force reindexing ignoring errors to create an index that would allow me to get something instead of nothing, but without any luck.

I am readying to formally declare this archive beyond repair. I'll still keep it around to try it on any other Macs I find myself possessing and/or any new versions of QRecalls as they come out, maybe someday some combination of hardware and software will perform a miracle cure. For my daily driver Mac I've had to start a new archive, though. I'm not feeling safe with no system backup at all. Better to lose ten years worth of history than to not prepare for the future.

Here's a thing that came to my mind: A repair action will first dig through the entirety of an archive analyzing it before then starting to take action (with the exception of a reindex, but it refuses that here). Couldn't we have some kind of scavenge mode, especially with the --recover option, that will start writing a new archive immediately, adding files and layers at the pace they are discovered in the source archive? This way, when the process dies/fails, there's at least everything up to that point. Knowing far too little about the inner workings of the repository this may of course range anywhere from an quick'n'easy one-hour hack to something utterly impossible on your end.

Anyhow - I'm ready to put this case to rest. If my observations did give you any new ideas, just dictate actions and I will take them and report back.

Do stay healthy,
Norbert

James Bucanek

Norbert Karls wrote:Hi again,

Hello!

QRecallHelper will not exceed its allowance of »wired« or »reserved« memory, but it will take on more and more »virtual« memory over time

Note that QRecall uses a lot of "mapped" files. So most of its virtual memory is really "virtual." In other words, it's just reserving address space; it doesn't mean any actual memory is involved (although some will, eventually) and it is separate from the VM backing store.

For example, when a process "maps" a 500MB file to memory what happens is the system simply reserves a span of logical memory addresses (say, 0x0006000c6a0000000 through 0x0006000c6bfffffff) and sets up the contents of that file to correspond to those addresses. The "virtual" memory of the process immediately increases by 0.5 GB, but nothing has really happened; no actually memory has been allocated or used at this point. The real measure of a process' impact is how much of the VM store and heap space it has consumed, which you can't tell from a casual look at its virtual memory footprint. That's why the latest macOS and Activity Monitor app has a "memory pressure" metric that takes these kinds of things into account.

The logs don't mention any reason for the process ending, there's just the same chatter on and on for hours and then it just stops. A caught error leading to termination would be logged and could then be found near the end of the logfile; this is not the case, though.

That would indicate that the process is crashing for some reason, and it appears to be repeatable. That information would be in a system crash log. If you send a diagnostic report from QRecall (Help > Send Report...) those crash logs will be sent to use and could indicate what's failing.

I then put my hopes into trying an incremental approach

The repair (and index) is really an "all or nothing" proposition. It has to finish, or it has to start over.

Finally I looked around the various options to the commands, hoping to find a way to have it just extract individual files or layers or some way to force reindexing ignoring errors to create an index that would allow me to get something instead of nothing, but without any luck.

That's literally what the repair command does, so we need to figure out why it's crashing. (Note that a program can ignore errors, it can't ignore a crash.)

Here's a thing that came to my mind: A repair action will first dig through the entirety of an archive analyzing it before then starting to take action (with the exception of a reindex, but it refuses that here). Couldn't we have some kind of scavenge mode, especially with the --recover option, that will start writing a new archive immediately, adding files and layers at the pace they are discovered in the source archive?

That's pretty much what it's already doing .. the problem is the crash. However, it did occur to me that some aspects of the repair, like dealing with damaged records, could be made more incremental and I'll add that to the wish list.

Do stay healthy

We're in this together!